Machine Intelligence in Decoding of Forward Error Correction Codes.

Size: px

Start display at page:

Download "Machine Intelligence in Decoding of Forward Error Correction Codes."

Percival Conley
5 years ago
Views:

Intelligence in Decoding of Forward Error Correction Codes.

1 DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017 Machine Intelligence in Decoding of Forward Error Correction Codes. NAVNEET AGRAWAL KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING

3 Abstract A deep learning algorithm for improving the performance of the Sum-Product Algorithm (SPA) based decoders is investigated. The proposed Neural Network Decoders (NND) [22] generalizes the SPA by assigning weights to the edges of the Tanner graph. We elucidate the peculiar design, training, and working of the NND. We analyze the edge weight s distribution of the trained NND and provide a deeper insight into its working. The training process of NND learns the edge weights in such a way that the effects of artifacts in the Tanner graph (such as cycles or trapping sets) are mitigated, leading to a significant improvement in performance over the SPA. We conduct an extensive analysis of the training hyper-parameters affecting the performance of the NND, and present hypotheses for determining their appropriate choices for different families and sizes of codes. Experimental results are used to verify the hypotheses and rationale presented. Furthermore, we propose a new loss-function that improves performance over the standard cross-entropy loss. We also investigate the limitations of the NND in terms of complexity and performance. Although the SPA based design of the NND enables faster training and reduced complexity, the design constraints restrict the neural network to reach its maximum potential. Our experiments show that the NND is unable to reach Maximum Likelihood (ML) performance threshold for any plausible set of hyper-parameters. However for short length (n 128) High Density Parity Check (HDPC) codes such as Polar or BCH codes, the performance improvement over the SPA is significant. ii

4 Sammanfattning En djup inlärningsalgoritm för att förbättra prestanda hos SPA-baserade (Sum- Product Algorithm) avkodare undersöks. Den föreslagna neuronnätsavkodaren (Neural Network Decoder, NND) [22] generaliserar SPA genom att tilldela vikter till bågarna i Tannergrafen. Vi undersöker neuronnätsavkodarens utformning, träning och funktion. Vi analyserar fördelningen av båg vikter hos en tränad neuronnätsavkodare och förmedlar en djupare insikt i dess funktion. Träningen av neuronnätsavkodaren är sådan att den lär sig bågvikter så att effekterna av artefakter hos Tannergrafen (såsom cykler och fångstmängder ) minimeras, vilket leder till betydande prestandaförbättringar jämfört med SPA. Vi genomför en omfattande analys av de tränings-hyper-parametrar som påverkar prestanda hos neuronnätsavkodaren och presenterar hypoteser för lämpliga val av tränings-hyper-parametrar för olika familjer och storlekar av koder. Experimentella resultat används för att verifiera dessa hypoteser och förklaringar presenteras. Dessutom föreslår vi ett nytt felmått som förbättrar prestanda jämfört med det vanliga korsentropimåttet. Vi undersöker också begränsningar hos neuronnätsavkodaren med avseende på komplexitet och prestanda. Neuronnätsavkodaren är baserad på SPA vilket möjliggör snabbare träning och minskad komplexitet, till priset av begränsningar hos neuronnätsavkodaren som gör att den inte kan nå ML-prestanda för någon rimlig uppsättning tränings-hyperparametrar. För korta (n <= 128) högdensitetsparitetskoder (High-Density Parity Check, HDPC), exempelvis Polarkoder eller BCH-koder, är prestandaförbättringarna jämfört med SPA dock betydande. Trapping sets Cross-entropy loss metric iii

5 Acknowledgment First and foremost, I would like to thank my supervisor Dr. Hugo Tullberg. His guidance and support was most helpful in understanding and developing the technical aspects of this work, as well as in the scientific writing of the thesis dissertation. Prof. Ragnar Thobaben was not only the examiner for this thesis but also acted as a co-supervisor. I would like to thank him for the ideas and direction he provided that motivated a significant part of this research work. I thank my manager, Maria Edvardsson, for providing moral support and guidance. I would also like to extend my gratitude to Mattias W. Andersson, Vidit Saxena, Johan Ottersten and Maria Fresia, who were always present when I needed to discuss ideas or understand complex concepts. Last, but not the least, I would like to thank my wife, Madolyn, and my parents for their unconditional love and support. iv

6 Contents Abstract Acknowledgment Table of Contents List of Figures List of Tables Abbreviations ii iv vi vii viii ix 1 Introduction Background Decoder Design Problem Formulation and Method Motivation Neural Networks Graphical Modeling and Inference Algorithms Societal and Ethical Aspects Thesis outline Notations Background Communication System Model Factor Graphs and Sum-Product Algorithm Generalized Distributive Law Factor Graphs Sum-Product Algorithm Coding Theory Maximum Likelihood Decoding Iterative Decoder Design Cycles and trapping sets Neural Networks Introduction Network Training Parameter Optimization Online Phase v

7 3 Neural Network Decoders Sum-Product Algorithm revisited Network Architecture Operations Neural Network Decoder Design Network Architecture and Operations Computational Complexity Hyper-parameter Analysis Parameters Normalized Validation Score Common Parameters Number of SPA iterations Network Architecture Loss functions Learning rate Training and Validation Data Summary Experiments and Results Experimental Setup Tools and Software Training Testing Trained Weights Analysis Learning Graphical Artifacts Evolution of Weights in Consecutive Layers Decoding Results (32, 16) polar code (32, 24) polar code (128, 64) polar code (63, 45) BCH code (96, 48) LDPC code Summary Conclusions and Future Work Conclusions Future Work Recommendations vi

8 List of Figures 2.1 Communication system model for decoder design Graphical representation of function f in (2.2) Tanner graph of parity check matrix for (7,4) Hamming code Tanner graph showing cycles for (7,4) Hamming code Neural network example SPA-NN and Tanner graphs for (7,4) Hamming code Neural Network Decoder graph for (7,4) Hamming code Computational complexity comparison NND performance comparison of different hyper-parameter setting Comparison for selection of learn-able weights Comparison of performance of NND for different number of SPA iterations Comparison of performance for different number of SPA iterations Comparison for different network architectures Comparison for syndrome check loss function Cross entropy vs Energy based loss function Comparison of different loss functions Comparison of SNR values for training (32,16) polar code Comparison of BER performance for different training epochs Weight distribution analysis - (7,4) Hamming Weight distribution analysis - (7,4) tree Evolution of weights Decoding results and edge weight analysis for (32, 16) polar codes Decoding results and edge weight analysis for (32,24) polar code Decoding results and edge weight analysis for (128,64) polar code Decoding results and edge weight analysis for (63,45) BCH code Decoding results and edge weight analysis for (96,48) LDPC code. 65 vii

9 List of Tables 3.1 Number of operations required to perform one SPA iteration in NND Hyper-parameter list List of codes evaluated for their decoding performance with the NND Parameter settings for (32,16) polar code Parameter settings for (32,24) polar code Parameter settings for (128, 64) polar code Parameter settings for (63,45) BCH code Parameter settings for (96,48) LDPC code viii

10 Abbreviations AWGN BCH BER BLER BP BPSK DNN EML FEC FF-NND FFT GF HDPC i.i.d IoT KL LDPC LLR MAP ML MPF NND NVS RNN-NND SPA SPA-NN URLLC Additive White Gaussian Noise Bose Chaudhuri Hocquenghem Bit Error Rate Block Error Rate Belief Propagation Binary Phase Shift Keying Deep Neural Network Energy based Multi-Loss Forward Error Correction Feed Forward architecture based NND Fast Fourier Transform Galois Field High Density Parity Check Independent and identically distributed Internet of Things Kullback-Leibler Low Density Parity Check Log Likelihood Ratio Maximum a-posteriori Maximum Likelihood Marginalization of Product Function Neural Network Decoder Normalized Validation Score Recurrent Neural Network architecture based NND Sum Product Algorithm Sum Product Algorithm based Neural Network Ultra Reliable Low Latency Communication ix

11 Chapter 1 Introduction 1.1 Background With an estimated 29 billion devices connected to the Internet by 2022 [15], the amount and diversity of the mobile communication will grow tremendously. 18 billion of those devices will be related to the Internet of Things (IoT), serving different use-cases such as connected cars, machines, meters, sensors etc. 5 th generation of communication systems are envisaged to support large number of IoT devices falling into the scenario of Ultra Reliable Low Latency Communication (URLLC) with strict requirements on latency (within milliseconds) and reliability (Block Error Rate (BLER) < 10 5, and beyond). Forward Error Correction (FEC) codes are used for channel coding to make the communication reliable. To ensure low-latency, the transmission data length has to be kept short, coupled with the low-complexity decoding algorithms Decoder Design It has been 70 years since the publication of Claude Shannon s celebrated A Mathematical Theory of Communication [26], which founded the fields of channel coding, source coding, and information theory. Although Shannon theoretically proved the existence of codes that can ensure reliable communication (up to a certain information rate below the channel capacity), he did not specify methods or codes which can achieve this practically. Practically successful codes with high error-correcting capabilities must also have a low-complexity decoding algorithm to decode them. The decoding problem has no optimal polynomial time solution (NP-Hard), and for years researchers struggled to find an algorithm that achieves desirable performance with low complexity. A major breakthrough came with the introduction of iterative decoding algorithm, known as the Sum Product Algorithm (SPA) [13, 28], and re-discovery of Low Density Parity Check (LDPC) codes, which performs near optimal with the SPA. However, the SPA performance remains sub-optimal for short length codes with good error-correcting capabilities as cycles are inherently present in the graphs of good codes (cf. Section 2.3.3). A short cycle in the graph degrades the performance of the SPA by forcing the decoder to operate locally so that the global optimal is impossible to find. In order to achieve near Maximum a-posterior 1

12 (MAP) performance in decoding, the decoding algorithm must find the globally optimal solution in a cyclic code within the polynomial time complexity. 1.2 Problem Formulation and Method In this thesis, we propose to develop methods to combine the expert knowledge of the systems, with data-driven approaches in machine learning, in an attempt to improve performance of the decoding algorithms. The scope of our study will be restricted to binary, symmetric and memory-less channels with Additive White Gaussian Noise (AWGN), and Binary Phase Shift Keying (BPSK) modulation in a single-carrier system. We restrict our study to binary linear block codes since they are the most commonly used codes in modern communication systems. We will study different algorithms in the context of decoding linear block codes, and explore methods to incorporate data-driven learning using neural networks. In order to evaluate the performance of different algorithms, simulations will be carried out using various tools developed during this thesis. The objectives of this thesis are: Study and Investigate Graphical modeling and inference techniques, with focus on Factor graphs and message passing algorithms Data-driven machine learning techniques, with focus on the Deep Neural Networks (DNN) and its variants. Decoder design Study channel coding basics with focus on standard algorithms for decoding binary linear block codes. Review literature for methods using neural networks for decoding. Analyze methods for performance, scalability, and complexity. Implement and analyze neural network based decoder algorithms that improves upon the performance of standard SPA for short to medium length codes. Evaluate performance by comparing Bit-Error Rates (BER) and Block Error Rates (BLER) of different family of codes. The aim of this thesis is not to design a complete receiver system, but to introduce methods that can enable data-driven learning in communication systems, using already available expert knowledge about the system. Algorithms and analysis developed in this thesis are applicable to a wide variety of problems related to optimization of multi-variate systems. 1.3 Motivation The primary motivation of this work came from (a) the recent advances in Machine Learning algorithms [18] and (b) the development of signal processing and digital communication algorithms as an instance of SPA on the factor graphs [20]. 2

13 1.3.1 Neural Networks Neural networks have long been applied to solve problems in digital communications [19]. However, due to their high complexity in both training and application, they were mostly considered theoretically, and never applied in practice to the communication systems. More recently, due to advent of more powerful algorithms such as DNN, and tremendous increase in computational power of modern processors, there have been renewed efforts towards developing communication systems based on machine learning [8, 24]. Data-driven discriminative learning approach of DNN uses complex non-linear models to represent the system generating the data. Multi-layered feed-forward neural networks, such as the DNN, are a class of universal function approximators [16] Graphical Modeling and Inference Algorithms Communication systems are based on the probabilistic analysis of the underlying variables. These systems comprises of multiple variables (visible or hidden), and often we are interested in calculating the joint or marginal probability distribution of its variables. Graphical models provide an approach to augment this analysis using simplified graphical representation of the probability distribution of variables in the system [6]. Also, graphical modeling allows incorporating expert knowledge about relationships of the variables into the system model. Many algorithms deals with complicated systems of multiple variables by factorizing the global function (joint distribution) into product of local functions, which depends only on the subsets of the variables. Such factorization leads to reduced complexity, and provide an efficient methods (such as message passing algorithms) for making inferences about the variables. A simple and general graphical construction to represent these factorizations is by using the Factor Graphs. Message passing algorithms operates on nodes of the factor graphs. When factor graphs are cycle-free, message passing algorithms provide exact inferences. However, for cyclic graphs, the inferences are approximate. For some problems the algorithm may converge to provide nearoptimal results for cyclic graphs, but for other it will not. Decoding algorithm using the SPA is one instance of message passing algorithms on factor graphs. In this work, we investigate methods to incorporate data-driven learning in the SPA decoder, in order to overcome some of its issues. The resemblance of the factor graphs with neural networks has motivated us to apply these two methods in conjunction. 1.4 Societal and Ethical Aspects As our societies are moving towards greater connectivity and automation, the energy efficiency and sustainability of entire eco-systems becomes paramount. Communication systems are an essential part of any eco-system involving multiple devices. In communication systems, improvements in the decoding performance of the receiver will lead to a reduction in the failed transmissions and re-transmissions, and hence, to a reduction in the overall energy consumption. Energy efficiency will help ensuring sustainability of IoT devices that are part of the massive machine-to-machine communication or URLLC framework. Experimental results (see Chapter 4) shows that for short length HDPC codes, by 3

14 using the DNN based decoding algorithm, we can achieve a power gain of 2-4 db compared to the standard SPA. The DNN algorithms designed to model stochastic systems requires parallel training of the model on the actual online data in order to continuously adapt. Since devices usually possess minimal computational resources, online data is sent to a centralized system for processing. This may lead to ethical issues regarding privacy of the data. However, the linear block codes based coding system is a deterministic system. The DNN based algorithm for deterministic systems need not adapt to the online data. A DNN based decoder, trained sufficiently on artificially generated coding data, will provide the best decoding performance during its online run. Hence, the DNN based decoding algorithm may exempt from data privacy related ethical issues. 1.5 Thesis outline In Chapter 2, we introduce concepts required to understand rest of thesis. The Neural Network Decoder (NND) design and analysis is presented in Chapter 3. Here, we also provide the previous related work in this field and present an in-depth analysis of hyper-parameters crucial to the performance and implementation of the NND. In Chapter 4, we present experimental setup, analysis and decoding results for different families and sizes of codes. Finally, in Chapter 5, we present important conclusions drawn from the work, and give recommendations for the future research work. 1.6 Notations In this section, we provide the basic notations that are used throughout this thesis. We follow the notations from set theory and coding theory. Note that some of the specific notations or symbols are introduced within the report as they are used. An element x belonging to the set S is denoted by x S. A set S with discrete elements will be specified by listing its elements in curly brackets, for example S = {1, 2, 3}. Size of a set is denoted by S. If a set S consists of only those elements of another set X that satisfy a property F, then set S can be denoted by S = {x X : F(x)}, or, if X is known, S = {x : F(x)}. Notations such as union ( ), intersection ( ), subset ( ), and their not operators (for example / ), have usual meaning from the set theory. Operation denotes matrix cross product and denotes matrix elementwise product. Consider, as an example, a set of multiple variables X = {x 1, x 2, x 3, x 4 }. The operator backslash (\) between two sets of elements denotes exclusion of set of elements on the right from the set on the left, for example S = { X \{x 1, x 2 } } = {x 3, x 4 }. The operator tilde ( ) before an element or set denote a set formed by excluding the element, or set, from its parent set, for example S = { x 1 } = { X \{x1 } } = {x 2, x 3, x 4 }. In such cases, parent set X must be known. Sets are denoted by calligraphic capital letters, such as X. Bold type small characters denote Vectors (x), and bold type capital letters denote Matrices (X). The ith element of a vector x is denoted by x(i), whereas (i, j)th element 4

15 of a matrix is denoted by X(i, j). Super-script and sub-script letters following a variable (X A,a B,b ), denotes special properties of the variable. 5

16 Chapter 2 Background In this chapter, we provide the basics required to understand rest of the thesis work. We organize this chapter as follows. First, we present the basic communication system and channel model used in this study. Factor graphs and the Sum-Product Algorithm (SPA) are introduced next. Then we provide a brief overview of the coding theory and extend the SPA in context of decoding applications on the Tanner graph. We introduce neural networks in the last section. 2.1 Communication System Model The goal of a communication system is to retrieve a message transmitted through a noisy communication channel. In the analysis of the decoder, we will use a simplified communication system model shown in Figure 2.1. The system is based on an AWGN channel and BPSK modulation. The encoded bits s i = {0, 1} are mapped to BPSK symbols y i = {+1, 1}. The modulated signals, y {+1, 1} n, are transmitted through the AWGN channel, with Gaussian distributed noise samples n N (0, σ 2 ), σ R. Received signal is given by r = y +n. The signals received at the receiver are demodulated to give the likelihood of a symbols being transmitted. In an AWGN-BPSK communication system, the received signal r is a Gaussian random variable with mean µ = { 1, +1} and variance σ. Likelihood ratio is generally represented in log domain, as Log-Likelihood Ratios (LLR), given by (2.1). These LLR values are fed into the decoder as input, and decoder attempts to correct each bit using the redundancy introduced in the code through the encoding process. LLR AWGN (s i r i ) = log P (r i s i = 0) P (r i s i = 1) = log P (r i y i = +1) P (r i y i = 1) = 2r i σ 2 (2.1) where r i is the received signal power ( 1, 1) and σ is variance of the channel AWGN. We make the following assumptions, (a) channel σ is real valued with power spectral density N 0 /2, and (b) transmitted symbols are independent and identically distributed (i.i.d). Notice that the modulation scheme maps {0 +1} and {1 1}, that is, y i = ( 1) s i. 6

17 For the analysis of decoder, we use a BPSK modulation scheme in order to keep the complexity of the demodulator low, and focus only on the performance of the decoder. The assumption of AWGN channel is valid for decoder design because, in general, the communication system is designed such that the correlations induced by the channel are removed before sending information to the decoder. The performance of the SPA decoder degrades significantly for signals with correlated noise. Hence, it is important that the decoder receives i.i.d. information as input. Source b : b {0, 1} k Sinkˆb = ξ 1 (ŝ) Encoder s = ξ(b) : s {0, 1} n, n > k Decoder ŝ = argmax P (r s) s Modulator y = ( 1) si Channel r = y + n : n i N (0, σ 2 n) Demodulator P AWGN (s i r i ) Figure 2.1: Communication system model for decoder design. 2.2 Factor Graphs and Sum-Product Algorithm In this section, we will introduce the factor graphs and the SPA as a general message passing algorithm to make inferences about the variables in factor graphs. Here, we develop the foundations for the work presented in this thesis. The introduction to factor graphs and SPA is also necessary to motivate the extension of this work to a wider range of problems. For a general review of graphical modeling and inference algorithms, we refer the reader to Chapter 8 of [6] Generalized Distributive Law Many algorithms utilize the way in which a complicated global function factorizes into the product of local functions. For example, forward/backward algorithm, the Viterbi algorithm, the Kalman filter, and the Fast Fourier Transform (FFT) algorithm. The general problem these algorithms are trying to solve can be stated as the marginalization of a product function (MPF). The MPF problem can be solved efficiently (exactly or approximately) using the Generalized Distributive Law (GDL) [3] or SPA [20]. Both methods are essentially the same, that is, they are based on the humble distributive law, that states ab + ac = a(b + c). Let us take a simple example to show power of the distributive law. Correlations in received information can be removed by scrambling the input bits before transmission, as well as through the process of channel synchronization at the receiver. 7

18 Example Consider a function f that factorizes as follows: f(x 1, x 2, x 3, x 4, x 5 ) = f 1 (x 1, x 5 ) f 2 (x 1, x 4 ) f 3 (x 2, x 3, x 4 ) f 4 (x 4 ) (2.2) where x 1, x 2, x 3, x 4, and x 5 are variables taking values in a finite set A with q elements. Suppose that we want to compute the marginal function f(x 1 ), f(x 1 ) = x 1 f(x 1, x 2, x 3, x 4, x 5 ) = x 2 f 1 (x 1, x 5 ) f 2 (x 1, x 4 ) f 3 (x 2, x 3, x 4 ) f 4 (x 4 ) x 3 x 4 x 5 }{{} marginal of products (2.3) How many arithmetic operations are required for this task? For each of the q values of x 1 there are q 4 terms in the sum defining f(x 1 ), with each term requiring one addition and three multiplications, so that the total number of arithmetic operations required are 4q 5. We apply the distributive law to convert marginal of products into product of marginals as follows: [ ][ ( )] f(x 1 ) = f 1 (x 1, x 5 ) f 2 (x 1, x 4 )f 4 (x 4 ) f 3 (x 2, x 3, x 4 ) (2.4) x 5 x 4 x 2,x 3 }{{} product of marginals The number of arithmetic operations required to compute (2.4) are 2q 2 +6q 4. Moreover, if we wish to calculate other marginals, the intermediary terms in the product in (2.4) can be used directly, without re-computing them. If we follow the operations in (2.3), we will have to re-compute marginals for each variable separately, each requiring 4q 5 operations. For larger systems with many variables, the distributive law reduces the complexity of computing the MPF problem significantly. The notion of addition and multiplication in GDL can be further generalized to operations over a commutative semi-ring [3]. Hence, the GDL holds for other commutative semi-rings which satisfy associative, commutative and distributive laws over the elements and operations defined in the semi-ring, such as max-product or min-sum Factor Graphs Factor graph provides a visual representation of the factorized structure of a system. They can be used to represent systems with complex inter-dependencies between its variables. A stochastic system can be modeled as the joint probability of all its underlying variables, while systems with specific deterministic configuration of variables can be modeled using an identity function to specify the valid configurations. Factor graph are a straightforward generalization of the Tanner graphs [29]. The SPA operates on the factor graph to compute various marginal functions associated with the global function. The function f in (2.2) can be represented as a tree-structured graph, as shown in Figure 8

19 2.2a, or as a bipartite factor graph, as shown in Figure 2.2b. Notice that in a factor graph of a tree-structured system, starting from one node and following the connected edges, we can never reach the same node again. x 1 f 1 f 2 x 5 x 4 f 4 f 3 x 2 x 3 (a) Tree representation of function f. x1 f1 x2 f2 x3 f3 x4 f4 x5 (b) Bipartite graph representation of f. Figure 2.2: Graphical representation of function f in (2.2) Sum-Product Algorithm The SPA operates by passing messages over the edges of the factor graph. For tree-structured graphs, such as the one shown in Figure 2.2a, the SPA gives exact inferences, while for graphs with cycles, the inferences made by the SPA are approximate. The basic operations in SPA can be defined using just two equations - variable to local function message (µ x f given by (2.5)), and local function to variable message (µ f x given by (2.6)). Variable to local function: µ x f (x) = h n(x)\{f} µ h x (x) (2.5) Local function to variable: µ f x (x) = {x} ( f(x) y n(f)\{x} ) µ y f (y) (2.6) where n(x) denotes the neighboring elements of x in the bipartite graph. For example, n(x 1 ) = {f 1, f 2 } and n(f 2 ) = {x 1, x 4 }. We will explain the SPA by solving the MPF problem given by (2.4), by message passing over the corresponding tree-structured graph shown in Figure 2.2a. The idea is to compute the marginal f(x 1 ) by passing messages along the edges of the graph, and applying two basic SPA operations alternatively till all nodes are covered. We start the computations at the leaf factor nodes with least number of variables, that is f 3 and f 5. The messages reaching an intermediate variable node (for example x 4 ) should only be a function of that variable, with other variables being marginalized out of the message before reaching this node. The function f 3 is marginalized out of the variables x 2 and 9

20 x 3 using (2.6), sending the message µ f3 x 4 (x 4 ) towards node x 4. This step gives us the rightmost element in the product in (2.4), that is x 2,x 3 f 3 (x 2, x 3, x 4 ). A node must receive information from all the connected edges, except the edge connecting the parent node, before it can send information to the parent node. Hence, variable node x 4 has to receive message µ f4 x 4 (x 4 ) from node f 4 before sending a message to node x 1. Now, once node x 4 has received information from its connected edges, it calculates the message µ x4 f 2 (x 4 ) = µ f3 x 4 (x 4 )µ f4 x 4 (x 4 ) using (2.5) and send it to node f 2. The factor node f 2 passes message µ f2 x 1 (x 1 ) by marginalizing out the variable x 4, using (2.6). Similarly, variable node x 1 receives message µ f1 x1 (x 1 ) from f 1 by marginalizing out variable x 5. The two messages reaching variable node x 1 represent the two multiplying factors in (2.4). The marginal of f(x 1 ) is obtained by taking product of these two messages reaching node x 1. Hence, we see that the SPA is essentially applying equations (2.6) and (2.5) alternatively, starting from the leaf nodes, till we reach the root node. By applying these equations to the entire graph, one can calculate the marginals of all variables. Now consider the SPA operations over the factor graph of the same problem shown in Figure 2.2b. The variable nodes x i initialize messages µ xi f j (x i ) as some constant value or as observed values from the system. The factor nodes f j apply (2.6) to calculate the messages that are passed back to variable nodes. Next, the variable nodes apply (2.6) to calculate new messages to send forward to factor nodes. Notice that a factor node f j (or variable node x i, respectively) calculates the message µ fj x k (x k ) (or, µ xi f k (x i )) using incoming information from all variable nodes connected to f j (or x i ), except the variable node x k (or f k ) to which it will send this information forward. If variable nodes have initial observations, these observations are added to the variable node s outgoing messages. The variable and factor nodes process information iteratively in this fashion, until there are no more nodes remaining to pass the information (in case of tree-structure), or time runs out (in case of cyclic structures). The application of the SPA on graph with cycles leads to an approximate inference of the variables. We will come back with some more details on the SPA for Tanner graphs with cycles in Section Coding Theory A linear block code, denoted by C(n, k), is a code of length n which introduces redundancy in a block of k information bits, by adding n k parity check bits that are functions ( sums and products ) of the code and the information bits. A block code is linear if its codewords form a vector space over the Galois Field GF(q), that is any linear combination of the codewords is also a codeword. In this thesis, we consider binary codes restricted to the field GF(2), consisting of the set of binary elements {0, 1} with modulo-2 arithmetic. Note that an all-zero codeword is a member of any linear block code since adding an all-zero codeword to any other codeword will give the same codeword (identity property). The encoding process can be explained as a vector-matrix multiplication of the information bits vector with a Generator matrix G of size [n, k], that is y = G b. The rows of a Generator matrix forms the basis of the linear-space in GF (2 n ). The dual-space of G is given by the parity check matrix H of size 10

21 [n, n k], which possesses the property G T H = 0. The rate of the code is defined by r = k n, that is number of information bits per transmitted bit. The Hamming distance between two codewords is defined as the number of positions where bit values are different in both codewords. In terms of GF(2) arithmetic, d h = y x, y, x C. The Hamming weight of a codeword is defined as the Hamming distance of the codeword with the all-zero codeword. The minimum distance of a code is defined as d min = min dh d h (y, x) y, x C, or as minimum Hamming weight within all codewords in the code-book Maximum Likelihood Decoding The task of the decoder is to find the transmitted codeword s, given the received signal r. An optimal decoder will choose the codeword (ŝ C) which gives the maximum probability for the received signal p(r s). This is the Maximum Likelihood (ML) decoder. ŝ = argmax p(r s) (2.7) s:s C In order to find a ML decoding solution, one has to look into the entire set of 2 n possible codewords to find the one which gives the maximum probability of the received signal being r and satisfies the parity check H T s T = 0 T. This problem becomes prohibitively complex as the length of the codeword becomes longer. In fact, ML decoding problem has no polynomial time solution (see section in [25]). Hence, it is reasonable to look for some heuristic solutions, including neural networks Iterative Decoder Design The iterative decoder is based on the SPA, which operates by passing messages over the Tanner graph of the code. Let us first give a brief review of the Tanner graph representation of a linear block code. Tanner Graph representation of code The Tanner graph is a bipartite graph that represents the linear constraints present in the code C(n, k). Any codeword must satisfy the parity check condition H T s T = 0 T. Using this property of the parity check matrix, the Tanner graph can be constructed by representing the columns of parity check matrix as the variable nodes v, and the rows as the check nodes c. An edge connects the variable node v j to check node c i if there is a 1, instead of 0, at (i, j) position in H. Any binary vector y = {0, 1} n will be a codeword of the code C(n, k) if it satisfies every check defined by the modulo-2 sum of values of the variable nodes connected to the corresponding check nodes. For example, consider [7,4] hamming code with parity check matrix H given by, H = (2.8)

22 The tanner graph given by this matrix is shown in Figure 2.3. Notice the configuration of the edges connecting two sets of nodes in the Tanner graph, and 1 s in the parity check matrix. v0 v1 c0 v2 v3 c1 v4 v5 c2 v6 Figure 2.3: Tanner graph of the parity check matrix of (7,4) Hamming code. Decoding using Sum Product Algorithm The SPA introduced in Section is the general message passing algorithm for making inferences on the factor graphs. Now, we will look at the SPA for the special case of decoding on the Tanner graphs. We will see that the exact formulation will be greatly simplified for the case of binary variable systems. For binary variable case, we can think of the messages µ f x and µ x f as real-valued vectors of length 2, given by {µ(1), µ( 1)}. The initial message for a variable node x i is the probability of the bit received at the ith realization of the channel, given by µ(x i ) = {p(r i x i = 1), p(r i x i = 1)}. Recall that for a variable node, the outgoing message take the form (2.5), µ x f (1) = k µ fk x(1) µ x f ( 1) = k µ fk x( 1) (2.9) Let us consider the check node messages. The check f nodes sends message to a variable node x about its belief that the variable node is +1 or 1. µ f x (x) = { µ f x (1), µ f x ( 1) } { = f(1, x 1,..., x J ) µ j (x j ), x j f( 1, x 1,..., x J ) } µ j (x j ) x j (2.10) We introduce LLR using (2.1) to obtain a single value denoted by l = ln µ(1) µ( 1). Furthermore, the check node function can be written as an indicator function whose value is 1 if the parity check is satisfied, 0 otherwise, that is 12

23 f(x, x 1,..., x J ) = I( j x j = x). Notice that since the symbols x {1, 1} we can write j x j = x instead of j x j = x. The outgoing message at a check node is given by equation 2.6. The message in (2.10) can be written as x l f x = ln f(1, x 1,..., x J ) j µ j(x j ) x f( 1, x 1,..., x J ) j µ j(x j ) e (l f x) = = = x 1,...,x J : µ j(x j) j xj=1 j µ j( 1) x 1,...,x J : µ j(x j) j xj= 1 j µ j( 1) x 1,...,x J : j j xj=1 e(lj(1+xj)/2) x 1,...,x J : j xj= 1 j e(lj(1+xj)/2) j (elj + 1) + j (elj 1) j (elj + 1) j (elj 1) = 1 + j e l j 1 e l j +1 1 j el j 1 e l j +1 (2.11) The last two steps follows as we expand out j + 1) and (elj j 1) for (elj x j = 1 and x j = 1 for numerator and denominator, respectively. Using the function tanh(l j /2) = el j 1, we get a simplified form given by, e l j +1 l f x = 2 tanh 1 ( tanh(l j /2) ) (2.12) Similarly, the outgoing message by a variable node is given by standard SPA equation (2.5), can be written in context of decoding as, j l xi f j = ln k µ fk x i (1) µ fk x i ( 1) = k l k (2.13) where the summation is over k factor nodes connected to the variable node x i, except the factor node f j. The final output LLR is calculated by adding the channel information to all messages received by the variable node after one iteration. l k = l k + i l fi x k (2.14) Hence, we got two simplified equations for messages µ x f (2.12). The general flow of SPA is given in Algorithm 2.1. (2.13), and µ f x Cycles and trapping sets The SPA works optimally for the Tanner graphs that forms a tree structure, in which the variable relationships can be factored exactly, leading to optimal solution of the MPF problem through iterative message-passing. However, the codes represented by graphs with no-cycles have low minimum distance, and hence perform poorly. This can be explained through the following argument (see section 2.6 in [25]). 13

24 Algorithm 2.1 SPA algorithm Initialize: l j = LLR j l xj f i = l j c = 0, C = max iterations repeat c = c + 1 Parity Check ŝ k = { 0, lk > 0 1, otherwise if ŝ H = 0 then Estimated codeword = ŝ END end if l fi x k = 2 tanh 1 ( j\k tanh(l x j f i /2) ) l xk f i = j\i l f j x k l k = l k + i l f i x k until END or c = C Lemma : A binary linear code C, with rate r and the Tanner graph forming a tree, contains at least 2r 1 2 n codewords of hamming weight 2. Proof : The graph of C contains n variable nodes (corresponding to each bit of encoded information), and (1 r)n check nodes. Total number of nodes in the tree are 2n nr. Hence average number of edges connected to each variable node is upper bounded by 2 r. Each internal variable node (variable node that are not leaf nodes) has degree at least 2. It follows that the number of leaf variable nodes x must be greater than nr (proof: x+2(n x) 2n nr x nr). Since every leaf variable node is connected to only one check node, we have at least rn (1 r)n = (2r 1)n leaf variable nodes that are connected to check nodes with multiple adjacent variable nodes. Each of these (2r 1)n leaf variable nodes has a pair of another leaf variable node, which give rise to a codeword of weight 2 for rates above one-half. Even for codes with rate less than one-half, tree structured Tanner graph based codes contain low-weight codewords. It has been observed that the SPA performance degrades due to two major artifacts of the code or Tanner graphs. One is the minimum distance of the code, and other is the Trapping sets or Stopping sets in the Tanner graph. A trapping set T is a subset of variable nodes V such that all neighbors of T, that is, all check nodes connected to T, are connected to T at least twice. A cycle is a trapping set, but opposite is not always true. Trapping sets leads to situations from which SPA fails to recover. An example of cycle and trapping set is shown in figure 2.4. Borrowed from Section 2.6 in [25]. 14

25 v0 v1 c0 v2 v3 c1 v4 v5 c2 v6 Figure 2.4: The cycles are marked with thick edges in the figure. Also variable nodes {v 0, v 1, v 2 } form a trapping set with check nodes {c 0, c 1, c 2 }. 2.4 Neural Networks In this section, we introduce some of the basic concepts of Neural Networks required to understand rest of the thesis. For a more comprehensive text, we refer the reader to the book on deep learning by Ian Goodfellow [18] Introduction The basic idea behind supervised machine learning techniques is to learn a function f modeled using a parameter set w to represent the system generating the target data x. Consider a model represented by a function of non-linear activations and basis functions. In machine learning terminology, this model is called a Linear Regression model, which perform well for regression or classification problems. ( N 1 ) ( ) f(x, w) = α w i ϕ i (x) = α w T ϕ(x) (2.15) i=0 where ϕ i () are non-linear basis functions and α() is a non-linear activation function. The machine learning algorithms perform regression or classification tasks by learning the model parameters w during the training phase. The training phase essentially provides the model with some experience (data) to transform itself in order to resemble the actual data generating system. The neural networks use the same form as (2.15), but each basis function in-turn becomes a non-linear function of the linear combinations of inputs. Hence the basic neural network model can be described by a series of linear and non-linear transformations of the input. Figure 2.5 shows a simple fully-connected feed-forward neural network. The input x is fed into the model through the input layer, which have same number of nodes as x. At the first hidden layer, we introduce first set of learn-able 15

26 parameters w (1). The output of the first hidden layer is obtained by applying a non-linear activation function α 1 ( ) on the linear combinations of the input. Notice that the basis function for the input layer is ϕ i (x) = x i. This output is again transformed in the second hidden layer by another set of learn-able parameters w (2) and a non-linear activation function α 2 ( ). Finally, we obtain the output y at the last layer using a last set of transformations. The training data must contain the pair of input and output (x, y) values that are used to train the parameters w (1), w (2), w (3). The number of nodes in the input and output layers are fixed, but the number of hidden layers and the number of nodes in each hidden layer are hyper-parameters of the model, that can be set to any value based on the complexity of the system. Similarly, the non-linear activation functions are another set of hyper-parameters. As the number of hidden layers increases, and the network becomes deeper, the neural network model becomes capable of representing a highly complex and non-linear system. Similar effect is seen as the number of hidden nodes in the network are increased. However, as the complexity of the model grows, the number of learn-able parameters will increase, and therefore training process will require more training data to find the optimal values for these parameters. Activation functions The role of activation function in the neural networks is to introduce some nonlinearity in the model, and to obtain the output of the network in some desired range of values. In general, activation functions must be continuous (smooth with finite first order derivatives) and finite valued for entire range of inputs. For a list and description of most general activation functions, refer to Section of [18]. However, in our implementation we will use tanh 1 activation, which leads to exploding values of the output as the input reaches ±1. The problem of exploding values and gradients can be circumvented by clipping the function s output in finite range. The derivative in that case will be clipped to match the clipped output value as well Network Training The aim of the training process is to find the optimal values of the learn-able parameters w, such that the the error in estimation or classification is minimized during the online phase of the model. However, since the model is trained to minimize the errors in estimating the training data, the estimation errors during the online phase may not see the same, and hence one has to validate the trained model separately to qualify a model for desired online performance. The online data used for validation is called the validation data. The function designed to quantify the models performance during training is called loss function. The ML estimator is commonly used to evaluate a loss function in the system. A learn-able parameter is a variable subjected to adjustments during the training process. The online phase of a model refers to its operation on the trained model using the input data generated by the online or real system. 16

27 Layers (k) Output o (k) i input (0) ϕ i (x) = x i hidden (1) ( ) α (1) i w(1) i o (0) i hidden (2) ( ) α (2) i w(2) i o (1) i output (3) ( ) α (3) i w(3) i o (2) i Figure 2.5: A fully-connected neural network with 2 hidden layers. Loss function The optimal parameters for a neural network are the ones that minimizes a loss function L(x, y = f(x, w)). The loss function is an optimization problem that is usually written as, w opt = argmax p model (x, y = f(x, w)) (2.16) w The exact formulation for the loss function depends on the system and the problem. We will discuss loss functions specific to the problems tackled in this thesis in Section The loss function is fundamental to the performance of a neural network. An ill-formed loss function will lead to poor performance, even if the network learns the optimal parameters for the given training data. Regularization Regularization helps keeping neural network from over-fitting. Different methods and techniques can lead to regularization effect on the network s parameters. Most significant are the weight regularization and early stopping. Weight regularization puts constraints on the value of the learn-able parameters by adding L p norm of all parameters to the network s loss function. Most common form is the L 2 norm, which puts a constraint on the weights to keep norm of the parameter values close to 1. Early Stopping Criteria Early stopping is often necessary to keep the network from deviating from the optimal values. We use Algorithm 2.2 for early stopping of the training process. 17

28 The algorithm let the network train for n epochs before validating its performance on a validation data-set. If the validation score of current validation test is better than all other test previously, then we save the weights at current state of the network. Once the network reaches its best performance, the validation scores will start to worsen. We observe the validation scores for at least p next validation tests, that is p n training epochs, to find a better score. If not, we stop the training and return the weights that gave the best validation score recorded. Algorithm 2.2 Early Stopping Algorithm n = number of training epochs before validation run. p = number of validations to observe worsening validation score before giving up. θ 0 = initial learn-able parameters. Initialize θ θ 0, i 0, j 0, v, θ θ, i i while j < p do Conduct training for n epochs, update θ i i + n v ValidationScore(θ) if v < v then j 0 θ θ i i v v else j j + 1 end if end while θ are the best parameters, training ended at epoch i with validation score v Parameter Optimization After designing the loss function and training process, we move to the task of solving the optimization problem, and finding the optimal weights w that minimizes the designed loss function L(w). Changing the weights w by a small amount δw leads to a change in loss function given by δl δw T L(w). The vector L(w) points towards the direction of greatest rate of change of the loss function. Assuming that the designed loss function is a continuous and smooth function of w, its minimum will occur at a point where L(w) = 0. Due to Zero gradient is obtained at points corresponding to all minimum, maximum or saddlepoint solutions. We apply stochastic gradient descent method to find the global minimum solution. 18

29 non-linearity of the loss function with respect to weights, and a large number of points in the weight-space, the solution to L(w) = 0 is not straight forward. Most common method for optimization is neural networks is the Stochastic gradient descent method. The gradient of the loss function with respect to a parameter is calculated using the back-propagation method. Then the weights are updated towards the negative direction of the gradient. Back-Propagation The back-propagation method, or simply backprop, provides an efficient technique for evaluating gradients of the loss function in neural networks. The loss function is a function of parameters w. The backprop method is based on the chain rule of derivatives. For the network shown in Figure 2.5, consider that the non-linear function as ϕ i (x) = x. The partial derivative of loss function corresponding to the nth training data with respect to the ith parameter of the jth hidden layer w (j) i is given by δl n δw (j) i = δl n δo (j) i δo (j) i δw (j) i The second partial derivative term will simply be equal to the output of the previous layer o (j 1) i (since ϕ i (x) = x). The first term can be calculated by applying the chain rule again. δl n = δo (j) i k δl n δo (j+1) k δo (j+1) k δo j i where the sum runs over all units k in layer j + 1 which unit i in layer j sends the connection. The chain rule is applied till we reach the last layer, where we calculate the gradient of the loss function with respect to the output of the last layer. Only this last partial derivative term depends on the design specific to the loss function. However, since we are propagating the errors backwards, the design of the loss function have significant effect on the gradients for all parameters. The final gradient is calculated as a sum of gradients over a set of training data, given by equation The Stochastic gradient descent method samples randomly from a subset of the training data to accumulate the gradients. The new weight is calculated by shifting the old value towards negative direction of the gradient, given by equation The hyper-parameter η is called the learning rate of the optimization process. This parameter is adaptively adjusted to enable the gradient descent algorithm to slowly move towards the optimal minimum point. We use RMSProp optimizer for adaptive learning [18]. δl δw (j) i = n δl n δw (j) i (2.17) w = w 0 η n δl n δw (2.18) 19

30 2.4.4 Online Phase The online or test phase of the neural networks is when we operate the trained neural network model using the inputs from the real data generating system. In online phase, the learned parameters of the network are fixed to their optimal trained values. The outputs are generated using this model by applying the neural network model on the online input data. Since there is no learning during this operation, the computations in online phase are only limited to the forward pass of the information in the model. During validation, we perform similar operations as the online phase. However, in order to validate the performance, the validation data contains pair of both input and output, which is used to calculate the validation scores. 20

31 Chapter 3 Neural Network Decoders Linear block codes provide an efficient representation of information by adding parity checks for error correction. The decoder solves the linear optimization problem that arises due to the correlations present in the encoded data. Optimal Maximum Likelihood (ML) decoding of linear block codes can be classified as an NP-Hard problem (see Section in [25], or [5]), and therefore it is reasonable to consider sub-optimal polynomial time solutions, including the neural networks. In this chapter, we will introduce a method to incorporate data-augmented learning to the problem of decoding linear block codes. The restriction to binary codes comes from the fact that the Neural Network Decoder (NND) is built using operations from the iterative decoder (cf. Section 2.3.2), where messages are the Log-Likelihood Ratios (LLR) of binary variables. The NND algorithm can be applied to solve a set of optimization problems similar to the decoding problem, that involves optimization of a linear objective function in a constrained system of binary variables. The optimal solution to these problems is obtained by marginalizing out all variables except the desired variable in the system. Different algorithms, such as the Belief Propagation (BP) [25] or Linear Programming Relaxation [11], have been proposed to find a polynomial time solution to the NP-Hard problem of decoding. These algorithms can be represented over the factor graphs, and solved using message passing algorithms such as the Sum Product Algorithm (SPA) [20]. The NND enables data-augmented learning over the factor graph of linear block codes (the Tanner graph) and implements the SPA for a system of binary variables. Neural networks have been very successful in the representation learning of non-linear data [4]. In [7], authors presented ways of relating decoding of error correcting codes to a neural network. Many neural network designs and algorithms for decoding emerged later on, such as feed-forward neural networks [9], Hopfield networks [10], Random neural networks [2], etc. Although these algorithms perform better than the standard SPA, they failed to scale with the length of the code. Even after the recent growth in the performance of neural network algorithms and the computational power of the modern processors, scalability of these algorithms remains as a bottleneck [12]. This curse of dimensionality is due to the fact that the number of possible words grows exponentially with the length of the code (n : C(n, k)). A decoder must learn to map all 2 n possible words to 2 k possible codewords. The neural network decoder has to be trained on a large proportion of the entire code-book for achieving satisfactory 21

32 results [12]. For example, in case of a code C(50, 25), the total possible words has to be mapped to total possible codewords Moreover, the size of the neural network will also grow with the length of the code, adding to the complexity of algorithms. Recently authors of [22] presented a neural network based decoding algorithm that is based on the iterative decoding algorithm, the SPA. The neural network is designed as an unrolled version of the Tanner graph to perform the SPA over a fixed graph. Hence, the neural network used in this algorithm inherits the structural properties of the code, and algorithmic properties (symmetry, etc.) of the SPA. It shows significant improvements over the SPA by learning to alleviate the effects of artifacts of the Tanner graph such as cycles or trapping sets. Our contributions in this thesis work are as follows: Analysis of various parameters affecting the training and online performance of the NND algorithm. The plethora of parameters considered in this work will also provide insights into the hyper-parameter selection for similar neural network algorithms, designed in the context of wireless communications. Introduction of a new loss function for training the NND that improves the performance compared to standard cross-entropy based loss functions. The new function bolster the model towards correct predictions where the SPA shows uncertainty, but prevents pinning of parameters to extreme values due to the strong SPA predictions, which are generally correct. Analysis of weights distribution of the trained NND and deeper insight into working of the NND based on this distribution. We extend the analysis provided by [22] by looking into the evolution of weights in different iterations, and compare different architectures. Analysis of NND s performance on different families and sizes of linear block codes, such as Hamming, BCH, polar, and LDPC codes. The rest of this chapter is organized as follows. In the next section, we will revisit the SPA in order to define its graph and operations over the unrolled version of the Tanner graph. Next, we provide a description of the NND s architecture and operations. Further, we present an analysis of the NND s hyper-parameters related to design, optimization and training, as well as their optimal selection process. 3.1 Sum-Product Algorithm revisited In this section, we will provide an alternative representation for the Tanner graph and the SPA, previously introduced in section This new representation, called the SPA over Neural Networks (SPA-NN), provides a method to implement the SPA operations (eq. 2.13, 2.12) using the neural networks. The SPA involves iterating messages forward and backward over nodes of the Tanner graph. In SPA-NN, the neural network nodes will correspond to the edges of the Tanner graph on which the SPA messages are being transmitted. Figure

33 shows the SPA-NN graph and an unrolled version of the Tanner graph for the (7,4) Hamming code corresponding to two full SPA iterations. Consider L iterations of the SPA. Each iteration corresponds to passing the SPA messages twice (once in each direction) over the corresponding edges of the Tanner graph. This can be equivalently represented by unrolling the Tanner graph 2L times, as shown in Figure 3.1b. One of the drawbacks of SPA-NN is that it is designed for a fixed number of iteration, whereas SPA can be operated for any number of iterations till the satisfactory results are found. The hidden layers in the SPA-NN graph are indexed by i = 1, 2,..., 2L, and the input and the final output layers, by i = 0 and i = 2L + 1, respectively. We will refer to the hidden layers corresponding to i = {1, 3, 5,..., 2L 1} and i = {2, 4, 6,..., 2L} as odd and even layers, respectively. The number of processing nodes in each hidden layer is equal to the number of edges in the Tanner graph, indexed by e = (v, c) : e E. Each hidden layer node in this graph calculates a SPA message transmitted in either one of the two directions over an edge of the Tanner graph. An odd (or even, respectively) hidden layer node e = (v, c), computes the SPA message passing through the edge connecting the variable node v (check node c, respectively) to the check node c (variable node v, respectively). Final marginalized LLR values can be obtained after every even layer. The channel information (initial LLR values) at the decoder input, are represented by l v, and the updated LLR information, received at an edge e of any even layer, are denoted by l e. Each set of odd-even-output layers correspond to one full iteration of the SPA in the SPA-NN graph. Figure 3.1b shows how the edges in either directions in the unrolled version of the Tanner graph (shown as red and blue edges) are translated to a hidden layer node in the SPA-NN in Figure 3.1a. The neural network architecture and operations translating the Tanner graph based SPA into the SPA-NN, are described in the following sections Network Architecture The connections between different layers in the SPA-NN graph can be understood by following the flow of information over the edges of the unrolled Tanner graph. We will use Figure 3.1 as an example to get a better understanding of this message flow and the architecture of the SPA-NN graph. Let us consider the Tanner graph based SPA as described in Section A single SPA iteration entails passing of information from (a) variable node to check node (red edges), (b) check node to variable node (blue edges), and (c) final output at the output node (green node and edges). The extrinsic information obtained at variable nodes in step (b) is passed on to perform the next iteration. Figure 3.1b converts the forward and backward information flow in the Tanner graph to a single direction of information flow in an unrolled graph. Now consider the SPA-NN graph as shown in Figure 3.1a. The initial channel information is inserted into the network at the input layer nodes (i = 0). A node in the first hidden layer of the SPA-NN (red node in Figure 3.1a) indexed as e = (v, c), represents an edge in the Tanner graph (red edge in Figure 3.1b) that sends channel information from the input nodes associated with variable node v, towards the check node c. Notice that any node in the first hidden layer of the SPA-NN is connected to only one input layer node. The even layers of the 23

34 (a) Network graph of the SPA-NN with nodes as edges of Tanner graph. L1 v0 v0 v0 v1 c0 v1 c0 v1 v2 v2 v2 v3 c1 v3 c1 v3 L2 v4 v4 v4 v5 c2 v5 c2 v5 v6 v6 (b) Unrolled Tanner graph and the SPA message flow for two full iteration. v6 Figure 3.1: The SPA-NN and the Tanner graph for (7,4) Hamming code representing two full iteration of the SPA. In Figure (a), red (blue, respectively) nodes corresponds to odd (even) hidden layers in the SPA-NN. The output layer nodes are in green. The Tanner graph, shown in Figure (b), is the unrolled version for two SPA iterations. The information is flowing from left to right in both graphs. The SPA message flow, leading to the output LLR of variable v 0 in the Tanner graph, is shown by dashed lines in Figure (b). The nodes in the SPA-NN corresponding to the dashed edges in the Tanner graph in Figure (b), are shown by bold circles in Figure (a). 24

35 SPA-NN (blue nodes in Figure 3.1a) represents passage of extrinsic information l e corresponding to the edges connecting check nodes to variable nodes (blue edges in Figure 3.1b). That is, every even layer node e = (v, c) is connected to the nodes of previous odd layer i 1 associated with the edge e = (v, c) for v v (cf. eq. 2.12). After every iteration in the Tanner graph, we obtain the final output for that iteration by adding the channel information l v to the updated LLR information l e received from all check nodes (cf. eq. 2.14). This operation is performed after every even layer node, at the output layers of the SPA-NN (green nodes in Figure 3.1a). The number of output layer nodes is equal to the number of variable nodes in the Tanner graph. An output layer node receives extrinsic information l e from the previous corresponding even layer nodes, and the channel information l v from the input layer nodes. Hence, each output layer node, indexed by v, has two connections, one to the previous even layer nodes and another to the input layer nodes. Notice that the output layers receive extrinsic information l e from all edges connected to the variable node v, without exception. Note that the green line in Figure 3.1b represents the green nodes in Figure 3.1a. In subsequent iterations of the SPA from the second iteration onwards, the variable nodes forwards a sum of the information it received from (a) the corresponding check nodes (except the message it sent to the check node in the previous step) and (b) the channel information (cf. eq. 2.13). Similarly in the SPA-NN, odd hidden layer nodes (e = (v, c)) corresponding to second iteration onwards (i > 2), receives extrinsic information (l e ) from the previous even layer (i 1) associated with the edge e = (v, c ) for c c, as well as channel information (l v ) from the input layer. Note that the channel information input to any odd layer is shown by small black rectangular boxes in Figure 3.1a. Design Parameters The architectural design of the SPA-NN can be represented using a set of configuration matrices. These matrices will define the connections between different layers in the neural network of the SPA-NN. The compact matrix description of the neural network architecture will help in formulation of SPA-NN entirely using the matrix operations. Notations: We will use following notations to define different sets of nodes in the graph: V = Set of all variable nodes in Tanner graph of C G = Set of all check nodes in Tanner graph of C E = Set of all edges in Tanner graph of C Φ(e) = {v V : e = (v, c) c G, e E} = Set of all variable nodes connected to a set of edges given by e. Σ(e) = {c G : e = (v, c) v V, e E} = Set of all check nodes connected to a set of edges given by e. Network layer sizes: The sizes of network layers are defined as follows: Input layer size = Output layer size = no. of variable nodes = n Hidden (odd, even) layer size = no. of 1s in H = n o = n e = row,col H 25

36 Configuration matrices: Now, we define connections between different layers of the SPA-NN by binary value configuration matrices. A matrix W l defining connections between any two layers (l 1) and l has size [m n], with m being the number of nodes in (l 1)th layer, and n in lth layer. An element W l (i, j) = 1 means that the ith element of layer (l 1) is connected to the jth element of layer l. Similarly, W l (i, j) = 0 means no connection between the corresponding elements. 1. Input to hidden odd layers : W i2o of size [n, n o ] { 1, if i Φ(j) W i2o (i, j) = 0, otherwise (3.1) where i = {1,..., n} and j = {1,..., n o }. 2. Hidden odd to even layer : W o2e of size [n o, n e ] { 1, if j {e E : e = (v, c), c Σ(i), v {V\Φ(i)}} W o2e (i, j) = 0, otherwise (3.2) where i = {1,..., n o } and j = {1,..., n e }. (An odd layer node i = (v, c) is connected to an even layer node j = (v, c) if they are connected to the same check node c, but not the same variable node v v.) 3. Hidden even to odd layer : W e2o of size [n e, n o ] { 1, if j {e E : e = (v, c ), v Φ(i), c {G\Σ(i)}} W e2o (i, j) = 0, otherwise (3.3) where i = {1,..., n e } and j = {1,..., n o }. (An even layer node i = (v, c) is connected to an odd layer node j = (v, c ) if they are connected to the same variable node v, but not the same check node c c.) 4. Hidden even to Output layer : W e2x = Wi2o T of size [n e, n] { 1, if j Φ(i) W e2x (i, j) = 0, otherwise (3.4) where i = {1,..., n e } and j = {1,..., n} Operations Operations in the SPA-NN are only slightly different from the operations defined in Section In the standard SPA, the information passed from a variable node to a check node (cf. eq. 2.13) at any iteration, is the sum of values obtained from corresponding check nodes in the previous iteration. The check node then applies equation 2.12, where operations tanh and tanh 1 are applied in a single step. In the SPA-NN, we separate these two operations by applying tanh operation to the equation 2.13 in first step, and tanh 1 to equation

37 in second step. This allows us to apply these operations as activation functions on the nodes of SPA-NN. Odd layer For any odd layer i, the output at e = (v, c) is given by, ( 1 x i,e=(v,c) = tanh 2 l v + e =(v,c ),c c x i 1,e ) (3.5) where l v is the input LLR value of variable v. For first hidden layer, i = 1, since there is no initial information from the check nodes, x 0,e = 0. For any odd layer i, the output in the matrix format is given by, X i = tanh ( 1 ( ) ) Wi2o T L + We2o T X i 1 (3.6) 2 where operations tanh is applied element-wise, X i is the output vector (size n o ) of ith odd layer, L is the vector (size n) of channel input LLR values, and X i 1 is the output vector of the previous even hidden layer. The information available at the input layer, X 0, is a zero-vector. Even layer Similarly, for any even layer i, the output at e = (v, c) is given by, ( ) x i,e=(v,c) = 2 tanh 1 x i 1,e e =(v,c),v v (3.7) In (3.7), we have to compute product of the elements of x i 1,e which correspond to W o2e = 1, that is product of messages coming from all variable nodes connected to check node c, except variable node v. In order to compute this in matrix operations, we apply the following transformations to obtain the product term: 1. Repeat the odd layer output row vector X i 1 (size n o ) in columns n e times to form a matrix M i 1 of size [n o, n e ]. 2. Calculate Mi 1 = W o2e M i 1, where is the element-wise product. 3. Replace zeros in Mi 1 with ones. 4. Calculate product along the column elements of Mi 1, to obtain vector Xi 1 of size n e. The vector Xi 1 corresponds to the product terms in (3.7). Now we can calculate the output of even layer in matrix format as, ) X i = 2 tanh 1 (X i 1 (3.8) Output layer We can obtain output at every iteration of the SPA by performing operations 27

38 at every even layer (shown in green in Figure 3.2). This will enable us to obtain results from intermediate iterations, and conduct parity check on the estimated codeword obtained after that iteration. The LLR output obtained after any (i 1) 2 iterations is given by, ˆlv,i = l v + x i 1,e (3.9) e =(v,c ) where i = {3, 5,..., 2L + 1} Transforming (3.9) into the matrix format, we get ˆL i = L + W T e2x X i 1 (3.10) where i = {3, 5,..., 2L+1}, L represents the initial channel information vector, and ˆL i represents the estimated LLR at (i 1) 2 th iteration. In the SPA-NN, the SPA is implemented using the neural network graph and operations. However, there are no learning parameters introduced yet, and the SPA-NN performs only fixed operations. In the next section, we will introduce learn-able weights over the edges of the SPA-NN in order to enable data-driven learning for improving the SPA decoding performance. 3.2 Neural Network Decoder Design In order to incorporate data-driven learning, we have to introduce weights in the network defined in Section 3.1. The weights are introduced such that they can be learned using the standard stochastic gradient descent methods (see section 2.4). The activation functions at the nodes of the hidden layers, that is tanh and tanh 1, have finite first and second order derivatives. Therefore, learning using stochastic gradient descent methods can be applied Network Architecture and Operations The network architecture of the Neural Network Decoder (NND) is similar to the SPA-NN architecture described in Section 3.1. Learned weights are introduced between the nodes of (a) even (i 1)th and odd ith layers, and (b) input and odd ith layers. The weights at different layers can either be independently trained, or can be shared, leading to two different network architectures - Feedforward and Recurrent. The architecture of the NND for three full iterations of (7,4) Hamming code is shown in Figure 3.2, with learn-able weights shown by the dashed lines. Similar to the SPA-NN, we can perform parity checks at any intermediate iterations, by taking a hard decision on the LLR information obtained at the output layers (shown by the green nodes in Figure 3.2). The intermediate decisions can be used to implement learning on early layers in neural network s training process (see Section 3.3.6). At any odd layer l, the operations are same as the SPA-NN (cf. (3.6)), except that we introduce learn-able weights W i2o and W e2o, uniquely for each iteration, instead of the binary weight matrices W i2o and W e2o, respectively. At the output layers, we only introduce learn-able weights between the output and the even layers ( W e2x,l ) (cf. (3.10)). 28

39 Hence, the NND network has three sets of weights to learn : W i2o,l, l = {1, 3,..., 2L + 1}, W e2o,l, l = {3, 5,..., 2L + 1} W e2x,l, l = {2, 4,..., 2L} (3.11) Each learn-able weight matrix W Ω,l is uniquely defined by the type layers its connecting (represented by Ω), and the layer number (given by l). { w Ω,l, if W Ω (i, j) = 1 W Ω,l (i, j) = 0, Otherwise where Ω = {i2o, e2o, e2x}, and w Ω,l is a unique and real weight introduced as a learn-able parameter in the network. Notice that when W Ω,l = W Ω, the NND behaves exactly as the SPA-NN. Hence, learning the parameters optimally cannot lead to performance loss as compared to the SPA. The LLR output at any output layer is given by, β = L + W T e2x,l X l (3.12) where β is the vector of output LLR values obtained at any output layer using the information received from an even layer indexed by l = 2, 4,..., 2L. The output layer can be further transformed to obtain bit probabilities instead of LLR. This transformation is required for applying the cross-entropy loss function between the output of the network and the target binary value (see section 3.3.6). A sigmoid function (σ(x) = (1 + e (x)) 1 ) is applied on the output layer to obtain probability of bit y = 0, from its LLR value. P l = σ(β) (3.13) where P l is the vector of bit probabilities evaluated using the extrinsic information obtained from the lth even layer, and σ is the sigmoid function Computational Complexity We analyze the computational complexity of NND algorithm on the basis of number of basic mathematical operations required for decoding a single codeword, such as multiplications, sums, and activations. The computations described here are specific to the online phase of the NND. The complexity of the training phase depends on the loss-function and the optimization method used. Although, the training process complexity is not explicitly analyzed in this work, the analysis of the graph s size and complexity applies similarly to the training process complexity. Following parameters will be useful in calculating the number of computations specific to the NND graph of a code C(n, k): n = size of the codeword in the code, n o = n e = E = number of edges in the Tanner graph of the code, ( { }) Sigmoid of a log-likelihood function: σ(llr) = 1 + exp ln P (0) 1 1 P (0) = P (0) 29

40 Figure 3.2: Neural Network Decoder graph for (7,4) Hamming code for three full iterations and with intermediate output (green) at every even layer. The edges carrying learn-able weights are shown with dashed lines. n i = number of SPA iterations in the NND The decoder receives LLR values as input. Number of computations required for one SPA iteration using the NND, that is a set of odd, even and output layer operations, are given below. The computations required for basic matrix transformations such as repeating or reshaping the matrix, are ignored. Odd-layer operations (cf. (3.6)) Matrix product 1 : ( W i2o T L) [n o, n] [n, 1] O(n o n), Multiplications: n n o, Sums: n n o Matrix product 2 : ( W e2o T X i 1 ) [n o, n e ] [n e, 1] O(n o n e ), Multiplications: n o n e, Sums: n o n e Matrix sum : ( W i2o T L + W e2o T X i 1 ) O(log(n o )), Multiplications: 0, Sums: n o 1 Scalar product 1 : 2 [n o, 1] O(n o ), Multiplications: n o, Sums: 0 Activation : tanh([n o ]) Even-layer operations (cf. (3.8)) Matrix element-wise product : ( W o2e M i 1 ) [n o, n e ] [n o, n e ] O(n o n e ), Multiplications: n o n e, Sums: 0 Product along the column elements of M i 1 : [n o, n e ] [1, n e ] O(n o n e ), Multiplications: n o n e, Sums: 0 Activation : tanh 1 ([n e ]) Scalar product : 2 [n e, 1] O(n e ), Multiplications: n e, Sums: 0 Output-layer operations (cf. (3.13)) Activation functions are implemented using approximations such as look-up tables. 30

41 Matrix product : ( W T e2x X i 1 ) [n, n e ] [n e, 1] O(n n e ), Multiplications: n n e, Sums: n n e Matrix sum : (L+ W T e2x X i 1 ) O(log(n)), Multiplications: 0, Sums: n o Activation : σ([n]) where O() is the computational complexity of mathematical operations in terms of the big-oh notation. Table 3.1 lists the number of operations required at each layer and total number of operations for one complete SPA iteration in the NND. For n i SPA iterations, the total number of operations will be n i times. In the case of the SPA-NN algorithm, since the weight matrices are binary, the multiplication between binary matrix and real vectors can be equivalently represented by sums. Hence, the only multiplication operations required are the ones related to the cumulative product of column elements in the even layer (operations n 2 e). Therefore, SPA-NN perform 2n 2 e + 2n e (n + 1) less computations that the NND. Table 3.1: Number of operations required to perform one SPA iteration in NND. Operations Multiplications Sums Activation Odd-layer: n e (n e + n + 1) n e (n + n e + 1) n e tanh Even-layer: n e (2n e + 1) 0 n e tanh 1 Output-layer: n e n n e n nσ Total: 3n 2 e + 2n e (n + 1) n 2 e+2n e (n+0.5) n e (tanh + tanh 1 ) + nσ The size of the NND graph can be defined by the number of layers, and the number of nodes in each layer. The number of hidden layers in the NND are twice the number of SPA iterations, that is, 2n i. Each hidden layer has a size equal to the number of edges in the Tanner graph, that is, n e. The input and output layers have a size equal to number of variable nodes in the Tanner graph - n. Hence, the total number of nodes in the NND graph for n i SPA iteration is 2n i n e + 2n. The size of the graph grows linearly with the number of SPA iterations, at rate equal to the number of edges in the Tanner graph. The number of edges in a Tanner graph depends on the type of code. For a code with High Density Parity Check (HDPC) code, the number of edges grows almost exponentially with the size and dimension of the code. Hence, the size of the NND graph can become extremely large, causing a bottleneck for scalability of the NND algorithm for longer HDPC codes. Consider the NND graph for 5 SPA iterations for (128,64) polar codes, with n = 127, n e = The NND graph of this HDPC code has a total of nodes in 12 layers. Figure 3.3 shows complexity of the NND algorithm for different sizes of HDPC codes - BCH and polar codes. The computational complexity of matrix product between a sparse matrix and a vector, can be significantly reduced by using sparse matrix operations. However, in this work, we have only considered full matrix multiplications for the sake of simplicity of the analysis. 31

Figure 3.3: Comparison of graph size (bar) and total number of multiplications (line) required for NND of different codes built for 5 SPA iterations. 3.3 Hyper-parameter Analysis In this section, we will present an in-depth analysis of various network hyperparameters essential to the design and performance of the NND.

42 Figure 3.3: Comparison of graph size (bar) and total number of multiplications (line) required for NND of different codes built for 5 SPA iterations. 3.3 Hyper-parameter Analysis In this section, we will present an in-depth analysis of various network hyperparameters essential to the design and performance of the NND. We will start by providing a list of important hyper-parameters used for setting up the training process of the NND. We present results from various tests conducted to compare the performance of the NND for different settings of these hyper-parameters. For fair comparison of parameters analyzed in this section, we used a NND designed for 5 SPA iterations for decoding of (32,16) polar code, and trained for a fixed training length of 2 18 training epochs. The decoding results from the standard SPA run for 5 iterations, are also provided for reference. The experimental setup and tools used for the experiments are described in next chapter Parameters The NND is characterized by its parameter settings. Parameter settings during the training process has a significant effect on the performance of the NND during the online phase. We will use BER and BLER measures to quantify the test performance of the NND. We will define another metric, called Normalized Validation Score, in Section 3.3.8, to obtain a normalized measure of how well the NND perform with respect to the SPA decoder. A list of parameters is provided in Table 3.2. The parameters listed here are discussed and analyzed in the subsequent sections. In Figure 3.4, we analyze some of the parameter settings and compare their performance for (32, 16) polar and (63,45) BCH codes. In Figure 3.4a, the test performance of the NND is different for different parameter settings, with a Feed-Forward NND (FF-NND) architecture giving the best performance using the Energy based Multi-loss (EML) function. All results shown in Figure 3.4a have training SNR parameter value as 2.0 db. both designed for 5 SPA iterations and trained for 2 18 epochs 32

43 However, note that the behaviour of any parameter setting on a particular code cannot be generalized over to any other family or length of code. For example, in the (63,45) BCH code in Figure 3.4b, the best performance is obtained for a lower learning rate of with RNN architecture and training SNR value as 3.0 db. The analysis of various parameters provided in this chapter will help in making decisions about the optimal value of these hyper-parameters. Table 3.2: A list of parameters required to set up NND for training. A typical example of parameter settings is provided for reference. Class Parameter Value Typical Design Weights Optimization Training Code (n, k) type (32,16) polar Parity Check Matrix Binary Matrix - Number of SPA iterations Integer 5 Network Architecture FF or RNN RNN Train input weights ( W i2o ) True or False False Train output weights ( W e2x ) True or False False Weights Initialization Random or Fixed Fixed Loss Function Cross-entropy, Syndrome Cross-entropy or Energy Loss Function type Single or Multiple Multiple Optimizer RMSProp RMSProp Learning Rate float (< 1.0) Training codeword type 0 or random 0 SNR Training (db) float or array [2.0] SNR validation (db) float or array { 3, 2,..., 9} Training Batch length Integer 120 Validation Batch length Integer 600 Total training epochs Integer 2 18 Validate after n epochs Integer 500 LLR Clipping Integer Normalized Validation Score A desired characteristic of the NND is that it should be able to performance better that the SPA. In order to quantitatively analyze and compare the performance of different choices of training hyper-parameters, we use the metrics Normalized Validation Score (NVS) [12], given by: NVS(ζ t ) = 1 S S s=1 BER NND (ζ t, ρ v,s ) BER SPA (ρ v,s ) (3.14) where ζ t is the hyper-parameter setting used for training the NND and ρ v,s is the SNR of the validation data-set. The validation data set is created using 33

44 (a) Comparison of performance for different parameter settings for (32,16) polar code. (b) Comparison of performance for different parameter settings for (63,45) BCH code. Figure 3.4: NND performance for BCH and polar codes for different parameter settings. L2 stands for l2-norm regularization of weights, with the given scale, len denotes the training length, l is the learning rate, and SNR (db) value given is used for generating the training data-set. 34

45 multiple SNR values denoted by ρ v,s, s = {1,..., S}, where S is the total number of discrete SNR values considered. BER NND and BER SPA are BER results of running NND and SPA algorithms, respectively, for validation data-set corresponding to a particular training hyper-parameter setting ζ t and validation set SNR ρ v,s. In other words, the NVS evaluates the decoding performance on a validation data-set created using different values of SNR, for a trained NND with a specific set of hyper-parameters. The hyper-parameter setting used for training could be any setting as listed in Table 3.2. The performance of the NND is compared with the performance of the SPA on the same validation data-set, in terms of BER. When there are common hyper-parameters between the NND and the SPA, such as number of SPA iterations, either the parameters are kept same for both algorithms, or the SPA is run using the parameters that gives the best performance. For example, the NVS is calculated using a larger number of SPA iterations for the SPA. The lower the value of NVS, the better the performance of NND as compared to the SPA. We have used the NVS for comparison of various hyper-parameters, as well as to validate the NND s performance Common Parameters Here we present a list of common parameters that are either fixed or do not affect the NND performance significantly. Code - C(n, k) A NND is designed specific to the Tanner graph of a linear block code. The code is specified by the family it belongs to, for example BCH, Hamming, LDPC and polar, and by its dimensions, given by the pair (n, k). The code C(n, k) is uniquely characterized by its parity check matrix H. The parity check matrix can be used to create the Tanner graph of the code, which in turn can be used to determine the size and configuration of the NND graph. The hyper-parameter selection relies vastly on the network size of the NND graph, which is determined from the parity check matrix of the code. LLR Clipping The application of tanh 1 in even layers of the NND lead to a sudden explosion in the output LLR values. The tanh 1 (x) function reaches infinity as x 1. In order to keep the numeric LLR values within a plausible range, we clip the output from every even layer to ( c, c), where 10 c 20. LLR clipping is applied in standard SPA as well, and does not lead to any performance loss as long as the value of c is kept in the specified range. Weights Settings Once the neural network graph has been designed, we need to select suitable learn-able weights, and initialize them properly before starting the training process for NND. Selection 35

46 There are three sets of learn-able parameters in NND - input weights ( W i2o ), output weights ( W e2x ), and even-to-odd edge weights ( W e2o ) (see Section 3.2). Even-to-odd edge weights define the flow of information through the edges of the Tanner graph, and the NND essentially removes the artifacts from the Tanner graph by assigning weights to the edges of the Tanner graph. Hence, evento-odd edge weights are essential in the NND, and we always include them as a learn-able parameter. the other two types of weights, that is input and output weights, are not mandatory. Removing these weights from the model will lead to the reduction in total number of free parameters to be learned by the model. For NND graphs, where size or complexity of the graph can become a bottleneck in its implementation and performance, we can remove these weights without significant loss in performance. Figure 3.5 compares BER performance of two NND graphs for (32,16) polar code, one trained for all learn-able weights, the other is only trained for even-to-odd edge weights. The performance improvement by training input and output weights is not very significant. Figure 3.5: Comparison of BER performance for NND trained for different selection of learn-able weights. Initialization The learn-able weights of the neural network (cf. (3.11)) has to be initialized before we start the training process. We propose two ways for their initialization. One way is to use random normal distribution, with a specific mean and variance. Another method is to initialize all weights with fixed value ( W = W ), making the 36

47 network perform equivalent to SPA initially, and learn to improve from there. The choice of initialization will affect the convergence of the training process. As SPA performance is close to the optimal decoder performance (optimal in case of graphs without cycles), initializing the weights such that the NND perform equivalent to SPA leads to faster convergence. Quantization The learn-able weights in the NND are real numbers, quantized to a certain level of floating point precision. Finer quantization means higher precision, hence more computations. Similarly, coarser quantization will lead to a loss in performance, but reduction in computations. A recent study shows that the loss in performance incurred by the loss in precision can be reduced significantly by using proper quantization [17]. In this work, as the Tensorflow software package only allows for higher levels of floating point quantization - 32-bit and 64-bit, we only support 32-bit floating point quantization. Optimizer The loss function defines the gradients for the optimization process. The optimization process of neural networks use gradient descent methods to find optimal parameters (see Section 2.4). We use the RMSProp optimizer for all the experiments conducted in this work. The choice of optimizer does not affect the training process significantly. Hence, we do not perform any analysis focusing on the selection of the optimizer. The gradients in the optimization process are calculated symbolically in the Tensorflow software [1]. Since all operations in the NND are differentiable, the gradients can be easily calculated using the chain rule of derivatives. This process is handled internally by Tensorflow, which creates symbolic gradients for each operation in the graph. For more information, please refer to [1] Number of SPA iterations The NND graph is designed for a fixed number of SPA iterations. The size of the graph and complexity of the NND algorithm is directly proportional to the number of SPA iterations the graph is built for (see Section 3.2.2). As the number of iterations grows, and the NND graph becomes bigger, the model will have more learn-able parameters. The approximation capabilities of the neural network grows with the number of free parameters in the model [16]. Therefore, as we increase the number of SPA iterations in the NND, the performance must get better. This behaviour is confirmed in Figure 3.6 for (32,16) polar code. It can be seen in the figure that the increase in performance is not very significant between 5 and 10 SPA iterations as compared to the increase in complexity of corresponding NND graphs. The selection of number of SPA iterations in the NND depends on this trade-off between performance and complexity. Notice that even after 10 SPA iterations, the NND performance is far from the ML performance. The reason for this behavior is two-fold. One is that the connection between different layers in the NND are enforced to a specific configuration, as it is built under the constraints of the Tanner graph. This restricts the NND 37

48 Figure 3.6: Comparison of performance of NND for different number of SPA iterations. from performing similar to a fully-connected feed-forward neural network, which may achieve ML performance in some settings [12]. Second, as the number of SPA iterations (or hidden layers) in the NND increases, the network will become deeper. The deeper the network becomes, the harder it will get for the errors to propagate back to the early layers of the network. Therefore, growing the size of network after a certain point may not lead to any improvement in the performance of the NND. We analyze this behavior by training the (32,16) polar code NND network for 20 SPA iterations. We compare the performance of this 20-iteration network with a similar network built using the weights of the NND trained for only 5 SPA iterations. However, this 5-iteration NND is unrolled 4 times to create a 20- iteration network, keeping the weights same as 5-iterations. The figure 3.7 shows performance comparison for these two NND, as compared to the standard SPA for 20 iterations. As we can see, the network trained for 20 iterations perform similar or sometimes worse than the network trained for only 5 iterations. This shows that training a deeper network might not always lead to increase in its performance Network Architecture The NND architecture (see Section 3.2.1) defines how learn-able weights are configured across different SPA iterations. Two type of architectures arise - Feed-forward for independent weights, and Recurrent for shared weights. 38

49 Figure 3.7: Comparison of performance for two NND networks, one trained for 20 iterations, and another trained for 5 iterations but unrolled 4 times to operate 20 SPA iterations. Performance of both NND is compared with the SPA for 20 iterations. Feed-forward Architecture Each SPA iteration in the NND is characterized by a set of operations involving learn-able parameters ( W Ω,l ). The Feed-Forward architecture based NND (FF- NND) allows the parameters to be learned independently for each iteration. In other words, the back-propagation algorithm of a feed-forward neural network calculates gradients independently for every learn-able parameter in every iteration, leading to a higher degree of freedom in the model. Consequently, the training process of the FF-NND model takes longer to converge to the optimal values. Recurrent Architecture Since the NND has recurring connections at each iteration, the learn-able parameters in each iteration can be shared. This sharing of parameters leads to a Recurrent Neural Network architecture. The recurrent architecture puts an additional constraint on the parameters to be learned. This constraint leads to a similar effect as regularization in the neural networks [18]. This architecture has less number of total learn-able parameters in the model, and leads to faster training convergence. However, it also restricts the degree of freedom of the model during training. The Recurrent Neural Network architecture based NND (RNN-NND) per- 39

50 forms similarly to the SPA. In respective algorithms, the operations in each SPA iteration are exactly the same. In terms of performance, RNN-NND trained for a graph corresponding to L SPA iterations, will perform at its best for L SPA iterations, whereas SPA performance in terms of the number of SPA iterations is characterized by the density evolution of the LLR messages in its Tanner graph [25]. (a) (b) Figure 3.8: Comparison BER performance and training convergence for Feed- Forward and Recurrent Neural Network architectures in NND. Comparison Figure 3.8 compares the feed-forward and recurrent architectures (both trained for 2 18 epochs) in terms of the BER performance and training loss convergence for (32,16) polar code. It is clear from the plot that RNN-NND converges faster, but has a worse test performance compared to FF-NND. In terms of complexity, both architectures require the same number of operations in the online phase. While in the training phase the gradient computations are more complex for RNN-NND, and create gradient vanishing or exploding issues [18]. An comparative analysis of trained weights from RNN-NND and FF-NND is provided in Section Loss functions The NND training process aims to find the optimal set of parameters that leads to ML decoding. The problem can be formally defined as follows: ˆθ ML = argmax θ P m (r y; θ) (3.15) where P m is the probability distribution of the model output for given data and parameters, θ is the set of all parameters in the model, r is the received signal vector, and y is the target value. 40

51 If we consider i.i.d. codewords transmitted over an AWGN channel, the received signal will be i.i.d. Equation (3.15) can be decomposed as: ˆθ ML = argmax θ = argmax θ N P m (r(i) y(i); θ) i=1 N log P m (r(i) y(i); θ) i=1 (3.16) where the second equality follows from the fact that log is a monotonically increasing function. For neural networks trained with limited set of data, we obtain the optimal parameters for an empirical distribution of the training data y, defined as, ˆθ ML = argmax E y log P ˆPdata m (r y; θ) (3.17) θ where E y is the expectation over the training data. ˆPdata The loss function for the training process must be designed such that the NND can find its optimal parameters by training on a limited set of data, but perform ML decoding during its online phase. We propose different loss functions, and analyze them on the basis of the performance of the trained network, on the validation data (see Section 3.3.8). Cross Entropy based Loss function The optimization problem formed by minimization of the cross entropy loss function (cf. (3.17)), is equivalent to training the model such that probability distribution of the model s outputs comes as close as possible (in the Kullback- Leibler(KL) distance) to the probability distribution of the training data. That is, ˆθ ML = argmax E y log P ˆPdata m (r y; θ) θ E y log P ˆPdata m (r y; θ) = argmin θ [ = argmin E y log ˆP ˆPdata data (r y) log P m (r y; θ) θ = argmin D KL ( ˆP data P m ) θ ] (3.18) where D KL (p q) is the KL distance between two probability distributions. This shows that minimizing the loss function given by E y log P ˆPdata m (r y; θ) leads to probability distribution of the output close to the target probability distribution, in KL sense. For Bernoulli distributed target variables y, the loss function takes the form 41

52 of a cross-entropy loss function, L CE f (P, y) = 1 N = 1 N = 1 N N n=1 N n=1 N n=1 ( ) log P m (r y; θ) ( log ( (1 p(n)) y(n) (p(n)) 1 y(n))) ( ) y(n) log (1 p(n)) + (1 y(n)) log p(n) (3.19) where p(n) is the estimated probability of y(n) = 0 obtained from (3.13), and y is the binary vector of target codeword. The subscript f in L CE f denotes that the loss function is based only on the final output of the network. In order to calculate the derivative of this loss function, we decompose the 1 (3.24) for a single bit, using the sigmoid function p(n) =, where β(n) 1+e β(n) is the output LLR for nth bit obtained at the output layer, given by (3.12). L CE f ( (n) = = e β(n) y(n) log 1 + e + (1 y(n)) log 1 ) β(n) 1 + e ( ) β(n) (3.20) log(1 + e β(n) ) + y(n)β(n) Applying chain rule of derivatives, we can find the gradient of the loss function at the final output layer with respect to the parameter W (i, j) as, δl CE f δ W (i, j) = 1 N N n=1 ( δl CE f (n) δβ(n) δβ(n) δ W (i, j) ) (3.21) The derivative of L CE f (n) with respect to β(n) using (3.20) is given by, δl CE f (n) 1 = y(n) (3.22) δβ(n) 1 + e β(n) The derivative of β(n) with respect to the parameter W (i, j), that is δβ(n) δ W (i,j), can be calculated from the (3.12). The gradient of the loss function can be decomposed as products of the partial derivatives of the nested functions leading upto the parameter W (i, j). (n) δβ(n) The first partial derivative δlce f gives the gradient induced by the specific loss function used. The second (or subsequent) partial derivative term δβ(n) δ W (i,j) is the same for different loss functions, as the function β(n) is independent of the loss function (cf. (3.12)). Hence, we can analyze the effect of different loss functions by studying the first partial derivative term in (3.21). Cross Entropy based Multi-loss function We discussed in Section 3.2 that the NND can output bit estimates intermediately after every even layer of the network. A multi-loss function adds the loss from these intermediate outputs in order to enforce learning of parameters in earlier layers. A cross entropy multi-loss function is given by, 42

53 L CE m (p, y) = 1 NL 2L l=2,4,... ( N n=1 ( ) ) y(n) log (1 p(l, n)) + (1 y(n)) log p(l, n) (3.23) where p(l, n) is the network output probability of nth bit at the (l +1)th output layer (cf. (3.13)), where l = {2, 4,..., 2L} is the index of an even layer of NND. The subscript m stands for Multi-loss in this case. Multi-loss function leads to faster training, and overall performance improvement in decoding. Additional computational cost is incurred during the training using a multi-loss function, as compared to the single-output loss function. However, since the number of weights and operations are the same for NND in both cases, there is no additional cost during the online phase. Syndrome Check We perform the syndrome check (ŷ H T = 0) at every intermediate output layers to find whether the correct codeword has been found. If the syndrome check is satisfied, we do not have to iterate any further, and stop. We incorporate this idea to the loss function as follows: L SC f (p, y) = 1 N N n=1 ( ) y(n) log (1 p(n)) + (1 y(n)) log p(n) (3.24) where p(n) is the network output probability of nth bit at the output layer where the syndrome check is satisfied. And for the multi-loss case, L SC m (p, y) = 1 MN 2M l=2,4,... ( N n=1 ( ) ) y n log (1 p(l, n)) + (1 y n ) log p(l, n) (3.25) where p(l, n) is the network output probability of nth bit at the (l +1)th output layer. If the syndrome check is satisfied at layer 0 < k < 2L, then 2M = k, else 2M = 2L. Loss function with syndrome check leads to a slight improvement in BLER at low SNR values, compared to the general cross entropy based loss functions. However, the BLER (and BER) performance at high SNR is worse for syndrome check based loss functions. Figure 3.9 compares the loss functions with and without syndrome check for (32,16) polar code. The reason for degradation of performance for the syndrome check based loss function at high SNR values, is due to the fact that high SNR will lead to less number of errors, causing the syndrome check to be passed at earlier layers of the NND (maybe even at the input layer). This will lead to very small number of cross entropy loss terms added to the loss function. Hence, at high SNR, the network trained with the syndrome check loss function will not be able to perform well in learning to push bit probabilities towards the correct values. 43

54 Figure 3.9: BLER performance comparison for Loss functions with and without syndrome check. Energy based loss function Bruck and Baum shown in [7] that the ML decoding solution for a codeword y with respect to the code C(n, k) is equivalent to finding the maximum of the energy function E, defined as: E(w, y) = N w(n)t(n) (3.26) n=1 where w, t { 1, +1} N, given by w = ( 1) r, r {0, 1} N is the received codeword, and t = ( 1) y. The problem of finding a w which maximizes E, is a NP-hard problem. In order to find an approximate solution, we have to relax the condition on w from { 1, +1} to ( 1, +1). That is, we have to let w(n) be a real number between 1 and +1. The output of the NND (cf. (3.13)) is the probability of the bit y(n) = 0. The BPSK mapping is given by {0, 1} {+1, 1}. We can convert the probability p(n) to a range ( 1, +1) using simple linear transformation, w(n) = 1 2p(n). The loss functions can be formulated as the negative of the Energy function E(w, y), given by, L E f (p, y) = 1 N N n=1 ( (1 2p(n))( 1) y(n)) (3.27) 44

55 where p(n) is the network output probability of nth bit at the final output layer (cf. (3.13)). The energy based loss function for a single bit can be written in terms of LLR output as, L E f (n) = (1 2p(n)) ( 1) y(n) ( = 1 2(1 + e β(n) ) 1) ( 1) y(n) = 1 e β(n) ( 1)y(n) 1 + e β(n) = tanh( β(n) 2 ) ( 1)yn (3.28) The first partial derivative of (3.28) with respect to the LLR output is given by, δl E f (n) [ δβ(n) = 1 1 tanh 2 ( β(n) ] 2 2 ) ( 1) y(n) (3.29) The energy based loss function trains NND to output probabilities close to 0.5, towards the correct side of 0.5. For multi-loss case, the (3.27) becomes, L E m(p, y) = 1 MN 2M l=2,4,... ( N n=1 ( (1 2p(l, n))( 1) y(n))) (3.30) where p(l, n) is the network output probability of nth bit at the (l +1)th output layer (cf. (3.13)). Comparison of Cross entropy and Energy based loss functions The cross entropy loss function puts strong weights on the edges to the hidden units that pin their activation towards extreme values of LLR ( or ). This makes it impossible to propagate errors back towards these hidden units. The Energy based loss function, on the contrary, tries to keep the output LLR close to 0. Strong LLR outputs from the SPA generally gives correct estimates. The false estimates usually end up in a region of uncertainty, close to 0. The characteristic of the energy based loss function to output the correct estimates close to the region of uncertainty, prevents the network from altering the strongly estimated LLR outputs. Figure 3.10 shows the loss and the gradient for both functions, for a target bit y = 0. The correct estimate for this bit will be obtained if the NND outputs a positive LLR value. Cross entropy loss function adds a heavy penalty for wrong estimates, while energy based loss function keeps the penalty constant after a certain LLR. Similarly, from the gradient plot we can infer that the cross entropy loss function makes a significant change in the parameters that leads to strongly incorrect estimates. The energy based loss function keeps the gradient constant for strongly estimated outputs. This leads to an overall improvement in the performance of the NND trained using the energy based loss functions as compared to the cross entropy loss functions. Experiments conducted on (32,16) polar code, as shown in Figure 3.11, confirms this hypothesis. 45

56 (a) (b) Figure 3.10: Comparison of Cross entropy and Energy based loss functions for LLR output of a single bit, given the target bit y = 0. Figure 3.11: BER performance comparison of Energy based and Cross-entropy loss functions for (32,16) polar code Learning rate The learning rate of the optimization process decides how fast the global minimum point is approached by the gradient descent algorithm. If the learning rate is too high, the optimizer might fail to converge to the global minimum. On the other hand, if the learning rate is too low, it will take a very long time to reach the global minima. Adaptive optimization techniques such as RMSProp, utilizes the prior information about the change in gradients from previous runs, and uses this information to adapt the learning rate during the training pro- 46

57 cess. Nonetheless, poor initial setting of the learning rate adversely affects the optimization process, even with the adaptive optimization. We chose the initial learning rate based on the number of learn-able parameters in the network. For small sized NND graphs, for example in case of (32,16) polar code, initial learning rate of around leads to convergence in less than 100,000 training epochs. While for large networks, for example (128,64) polar code, the initial learning rate of less that is required to achieve good results. If the NND has a large number of learn-able parameters, training using a high learning rate may lead to a failure in convergence of the optimization process Training and Validation Data The input data for training the decoder are obtained by transmitting a set of codewords through a channel corrupted with AWGN. The magnitude and patterns of errors induced by the channel are function of the variance of the noise (σ or SNR), and the randomness of the noise process, respectively. Training Data A major simplification in the selection of training data comes from the fact that the performance of SPA is independent of the transmitted codeword. The decoder output at the variable or the check nodes in any iteration of SPA, are only a function of the error-patterns in the channel input (LLR). The performance of the decoder does not depend on the exact values of the transmitted bits, but what matters is how the channel induces patterns of errors to the transmitted signal. Consequently, we are free to choose any codeword to analyze the performance of SPA. In NND, the weights introduced over the edges of the Tanner graph does not altercate the structure and flow of the information in the SPA decoder, and hence this property holds in NND as well. For simplicity of implementation, we choose to train the network using the All-Zero codewords. The decoders, designed using fully-connected feed-forward neural networks, does not satisfy this rule, and hence they require training over a large proportion of the entire code-book [12]. This makes unfeasible to train for larger dimension codes. In NND, this simplification, due to the Tanner graph based design, leads to a significant reduction in size and complexity of the training data. This is one of the reasons why the NND algorithm can scale better than the previous algorithms. However, the design of the NND also puts restrictions on its learning capabilities. For example, the NND performance does not reach the ML threshold, even if it is trained on data spanning the entire code-book (see Section 3.3.4). Validation Data For validation of the network s performance, we use a fixed data-set generated in a similar way as the training data. The validation data-set is constructed using random codewords, and wide range of SNR values, to represent a more realistic set of data. Hence, the decoder s performance on the validation data-set gives a reliable indication of performance of the decoding algorithm during its online execution. Please refer to definition 4.81 in [25] for a proof this property. 47

58 The SNR values and the size of the data-set needs to be specified to generate the validation data-set. SNR values are specified in Decibel (db) as a set of discrete values. The data-set size is specified by a interger number denoting the number of codeword signal generated using each discrete value in SNR set. Unrealistic SNR values or small data-set may lead to unreliable scores, while a large data-set may slow down the training process significantly. The choice these parameters are subjective, and a standard method maybe required in the future. Batch Processing At every training epoch in the training process, the network receives a batch of training data as input. Hence, for n t number of training epochs, and a single batch containing n b of training codewords, the network will be trained on n t n e total codewords. The stochastic gradient descent method for training requires a batch of the training data as input (see Chapter 8 in [18]). A larger batch will lead to more accurate gradients (averaged over a larger set of data), but will require more computations per training epoch. Additionally, when the training data is generated from a range of SNR values, a single batch of data can be generated from the entire SNR range, by splitting the batch into portions generated from uniformly selected SNR values in the range. SNR SNR (or σ) of the AWGN channel is directly related to the pattern of errors in the received signals. The relationship between SNR and the noise variance (σ) of the transmission channel (AWGN, BPSK) can be written as follows: σ = E s 2 10 ( SNR db 10 ) R (3.31) where E s = 1 is the energy of the transmitted signal, R is rate of the code, and SNR = E b N 0 is usually given in Decibels (db), that is SNR db = 10 log 10 (SNR). A desired characteristic of the NND is that it should be able to perform optimally for any plausible input data, obtained from any arbitrary value of channel SNR, during the online execution. However, in our experiments, we have observed that the SNR values, used for generating data during the training phase, have significant effects on the online performance of NND. Training at low SNR leads to too many errors in the input, preventing the NND from learning the structure of coded constraints in the Tanner graph. Conversely, training at a very high SNR leads to too few errors, which does not expose the network to enough errors that are uncorrectable by the SPA. Hence, it is important to find correct SNR values for the training process, such that the network is exposed to different error patterns, and learns to correct them. In order to compare the performance of the NND over different values of the training SNR, we use the NVS metric introduced in Section The training SNR could be a single fixed value or a set of values. A fixed-type training SNR value is a single real number value kept fixed for the entire training data. For the training data-set created using a set of discrete SNR values, the data in each training batch is created by the SNR values picked from the set. Different 48

59 distributions of SNR can be considered to obtain more sophisticated data-sets. In our experiments, we have considered a discrete uniform set of values of SNR, and generated equal portions of training data batch from each value in this set. Hence, every batch of training data contains an equal number of codewords transmitted using each SNR in the set. For (32,16) polar code trained for both fixed and a set of SNR values, a plot of NVS is shown in Figure In this experiment, we generate the validation data set using 20 different values of SNR (SNR = { 3, 2.5, 2,..., 6.5}dB), that is we use S = 20. It is clear from the plot that the optimal choice of the training SNR value lies somewhere in the range -2 db and 4 db. Training on data generated using a fixed SNR of 1.0 db, or varying SNR array (-2.0, 4.0) db, gives the best performance on the validation data used in this experiment. Note that the training SNR value might be different for different families or size of the code. Determination of optimal SNR for a code is a lengthy process, and further study is required to understand its implications in more detail. However, one can safely choose to create the training data from a range of SNR between (-1.0, 5.0), and train on a large training data-set, hoping to expose the network to different error-patterns in each training epoch. Figure 3.12: Comparison of SNR values for training (32,16) polar code. Training Length and Stopping Criteria Performance of the NND usually improves as we train on a large data-set. Figure 3.13 shows the BER performance for increasing training lengths. The performance improvement is significant in earlier stages of the training, but eventually the network reaches a stable convergence point. However, if the learning rate is too high, the network might not be able to reach a stable convergence point, and the performance of NND might become worse with more training. 49

60 The NND is trained till the number of training epochs reaches a maximum value, or the stopping criteria (cf. Section 2.4.2) is satisfied. The stopping criteria is based on the NVS score for a validation test set. The validation set used for the stopping criteria is created using SNR values from the set { 3, 2.5,..., 6.0}dB. The stopping criteria may result in different lengths of training in different training runs. Hence, when comparing the NND parameters, a fixed training length (2 18 ) is used to keep the comparison fair. Figure 3.13: Comparison of BER performance for different training epochs. 3.4 Summary In this section, we provide a summary of the Chapter Neural Network Decoders. We started with an alternative formulation of the SPA, which enabled us to implement it using the neural networks. This method, called SPA-NN, essentially maps the SPA messages flowing over the edges of the Tanner graph, to nodes in the neural network. The SPA-NN operations at odd or even hidden layers follows the same form as SPA operations in either direction in the Tanner graph, with minor adjustments. The connections between different layers of the SPA-NN are motivated by the flow of SPA messages in the Tanner graph. We define the connections between different layers in the SPA-NN using configuration matrices that help in formulating the operations in matrix format. The NND has similar structure as SPA-NN, but incorporates learn-able parameters over the edges between different hidden layers. We discuss the architectural and operational changes transforming SPA-NN into the NND. We also provide an approximate analysis of the complexity of the NND with respect to the parameters of the corresponding code. The complexity of the NND is directly proportional to the number of SPA iterations and the number of edges in the Tanner graph of the code. 50

61 We provide an extensive analysis of different hyper-parameters specific to the design and training of the NND. The performance of the NND is affected by the hyper-parameters, and the optimal setting of these hyper-parameters may vary for different family and size of codes. The selection of hyper-parameter settings largely depends on the trade-off between the complexity and performance of the NND. Some hyper-parameters such as weight initialization, loss-function, optimizer, learning rate or training data type, can either be fixed or chosen safely from a set of values, without affecting the performance of the NND significantly. While others hyper-parameters, such as SNR of the training data, affect the NND performance significantly, and require a more deeper understanding. We provide a method to quantitatively motivate a choice of such hyper-parameters. Our study is supported by results from extensive experimentation, which we have presented throughout this chapter. 51

62 Chapter 4 Experiments and Results In this chapter, we will present the experimental setup and decoding results for the NND algorithm. We will start by introducing the tools and software, and the methods used to perform the experiments. Next, we will take a closer look at the weights in the trained NND, and analyze their behavior and properties that lead to the performance improvement of the NND. This analysis will help in developing a deeper understanding of the decoding capabilities of the NND. In the previous chapter, we gave an in-depth analysis of different hyper-parameters of the training process. The selection of hyper-parameters of the NND for different codes is based on the study and analysis presented in the last chapter. In this chapter we will provide a list of hyper-parameters selected for the results presented, but will not discuss the motivations behind this selection. The focus of this chapter will be to analyze the decoding capabilities of the NND for different family and size of codes. 4.1 Experimental Setup In this section, we will introduce the experimental setup such that the results presented in this work can be replicated. The experiments involving neural networks proceed in a sequence of steps. Initially, suitable hyper-parameters are selected specific to the code. Then we conduct the training and let the network converge till a stopping criteria is reached. If at the end of the training, the final validation scores (NVS) are poor, we change the parameters based on the behavior during the training. Finally, once the network is trained, we perform tests for anlyzing the decoding performance of the trained NND. The training phase is usually much longer, and requires various adjustments to the hyperparameters, before the best setting can be found. Specific details of training and testing will be described in following subsection. We start with an introduction to the various open-source and custom built tools and software used to train and test the NND Tools and Software The neural network is trained and tested using the open source software library for machine learning, called Tensorflow (ver. 1.2) [1]. The experiments are 52

63 performed using multiple NVIDIA GeForce GTX 1080 Graphical Processing Units (GPU). The resources are provided by Ericsson Research. We have used Python (ver. 2.7) language to develop the framework, and to call the Tensorflow library s API functions. Moreover, we have extensively used various additional open source libraries from Python, such as NumPy and Scikit-Learn. The communication system model, as described in Chapter 2 section 2.1, is developed as a Python library module for this project. In order to compare the results of the NND with SPA, we use a open source SPA implementation in the C language, developed by Radford M. Neal (ver ) [23]. We made some adjustments to the source code of this software to allow for raw LLR values as input and output Training First and foremost, we start by designing the NND graph, for which we need to select a code and its Parity check matrix. We conduct experiments on different family of codes - Hamming, polar, BCH, and LDPC (Low Density Parity Check) codes. The parity check matrices for different lengths of codes are obtained from an online database [14]. We refer the reader to Section 3.3 of Chapter 3 for a detailed discussion on hyper-parameter selection for the training process. Also, the typical example of hyper-parameter settings provided in Table 3.2 becomes base for most of the settings in all the codes. We will only provide a list of hyper-parameters that are different from this typical example. Training Process The training process is conducts training for a fixed number of epochs and validates the model based on the NVS as a validation score. The NVS is evaluated at the current state of the network, using S = 20 values of SNR given by { 3, 2.5,..., 6.5} (see Section 3.3.2). The trained weights are saved for the state (weights) of the network which gives the lowest NVS measurements for the entire training Testing The NND network is setup in the same way as the training process, except the network weights are fixed for testing. The decoding results are presented in the form of BER and BLER. The data-set is created using random codewords, transmitted through a channel corrupted with AWGN. The tests are conducted on a range of SNR values given by - { 5.0, 4.5, 4.0,..., 8.5}dB. For each SNR value, the decoding is performed till either at least 500 codewords are found in error, or a total of 50,000 codewords are tested. 4.2 Trained Weights Analysis In this section, we will take a closer look at the trained weights ( W e2o ) of the NND, between the edges of its even and odd layers. First, we will look at a 53

64 trained NND for (7,4) Hamming code to get an understanding of how the NND is trying to mitigate the effect of cycles. Then we will look at different architectures of NND, and analyze the evolution of weight distributions over layers in different iterations. We will also try to understand the learning capabilities of the network for growing number of SPA iterations, based on the weights distribution at different layers Learning Graphical Artifacts In order to get a deeper understanding of the NND s behavior and performance, we train a NND for (7,4) Hamming code. The parity check matrix of a (7,4) Hamming code is given by (4.1). This code has d min = 3 and its Tanner graph contains multiple cycles of girth H = (4.1) Figure 4.1a shows the trained edge weights ( W e2o ) between even and odd layer of the NND for (7,4) Hamming code. An (i, j)th colored block in the weights matrix corresponds to W e20 (i, j), that is an edge between the ith even and jth odd layers in the NND (see Section 3.1 and Figure 3.1a). The Tanner graph of the same code is shown in Figure 4.1b. The edges numbered {0, 1, 4, 6} form a cycle of girth 4 in the Tanner graph. Consider the point (i, j) = (0, 4) in the weights matrix. This weight correspond to an edge between the 0the even layer and 4th odd layer nodes. In the Tanner graph, this corresponds to passage of information from edge 0 to edge 4, that is check node c0 to variable node v0, and finally to check node c1. The information received at c1 is passed back to c0 via v2 using edges (1, 6), since edges (0, 4) form a cycle with edges (1, 6). Similar behavior can be seen for the information flow in the opposite direction over the same edges. Looking at the edge weight distribution for the corresponding nodes in the NND, one can see how NND is trying to mitigate the effect of this cycle in the output. The edge weights connecting the edges (0, 4) and (1, 6) are opposite. This means that pair of information received by c1 from c0, that is (c0 e0 v0 e4 c1) and (c0 e1 v2 e6 c1), are nullified to some extent by opposite weights, if the information is equal. Notice the varying intensity of weights over the pair of edges forming a cycle in one direction information flow - for example (0, 4) and (6, 1) corresponds to information flow in a cycle e0 e4 e6 e1 c0 v0 c1 v2 c0. These edges have same sign but different intensities. This effect allows the network to pass on the useful information from other edges, while reducing the unnecessary information from the cyclic edge. The flow of information in one half of the cycle is magnified, while other half is diminished, finally leading to a normalized flow, which reduces the adverse effects of the cycle. Similar effects can be seen for other cycles in the graph. In order to further strengthen the conclusions from the previous analysis regarding the behavior of NND on graph with cycles, we conduct another experiment on a (7,4) code with a tree-structured Tanner graph. The parity check matrix for this (7,4) tree-structured code is given by (4.2). As shown in Figure 4.2a, the training on this tree-structured Tanner graph NND does not lead to 54

65 v0 v1 0 1 c0 v2 v3 4 6 c1 v4 v5 c2 (a) Trained Edge weights. v6 (b) Tanner graph of Hamming (7,4) code. Figure 4.1: Learned weight distribution over the edges of the NND for (7,4) Hamming code. The effect of cycles is being nullified by assigning complementary weights to set of edges forming cycles. any learning. The edge weight distribution remain close to the initial value of 1.0. H tree = (4.2) This property of NND enables it to mitigate the effects of artifacts in the Tanner graph such as cycles and trapping sets. However, it also restricts the NND to only perform as good as the SPA, in case of codes without such artifacts in the Tanner graphs, such as tree-structured or LDPC codes Evolution of Weights in Consecutive Layers We study the evolution of the learned edge weights over consecutive layers of the NND for different number of SPA iterations. Figure 4.3 shows correlation coefficients of weights for consecutive layers of a (32,16) polar code FF-NND. The plot in black lines shows the evolution of weights with respect to the unlearned fixed edge weight W e2o, for two NND trained for 5 and 20 iterations respectively. The larger the correlation coefficient, the lesser the weights of that particular layer are trained. The NND trained for large number of SPA iterations, learn weights more significantly at the last layers of the network. This behavior is due to the reasons discussed in section As shown in Figure 4.3a (black lines), the initial correlation coefficients are close to 1, and keep on decreasing as we move closer to the last layer. However, when the NND trained for small number of SPA iterations (Figure 4.3b) the behavior is opposite. The network tries to learn as much as possible at the initial layers. 55

66 v0 v1 c0 v2 v3 c1 v4 v5 c2 v6 (a) Trained Edge weights. (b) Tanner graph of (7,4) tree-structured code. Figure 4.2: Learned weight distribution over edges for (7,4) tree-structured code. Since the tree structure has no cycles, the network training leads to no significant change in the edge weights. We discussed in Section 3.3.5, the two architectural designs for NND - Recurrent Neural Network (RNN-NND) and Feed-Forward (FF-NND) architectures. The RNN-NND trains a single set of weights for each SPA-iteration, while FF- NND trains them separately. We compare the learned weights for different layers of FF-NND with the weights of RNN-NND. The plots are shown in Figure 4.3 with red dashed lines. The weights of RNN-NND have high correlation with the most learned weights of FF-NND, that is the FF-NN weights which are least correlated with the un-learned edge weights W e2o. This shows that RNN-NND tries to capture most of the learned weight distribution in its single layer, while FF-NND has varying weight distribution across its layers. 4.3 Decoding Results In this section, we present the decoding results for different families and sizes of codes. We motivate the choice of hyper-parameters for training each code and analyze their weight distributions. A list of tested codes and their properties are listed in Table 4.1. Table 4.1: List of codes evaluated for their decoding performance with the NND. Code rate d min n e BER Gain NND-SPA (32,16) polar dB (32,24) polar dB (128,64) polar dB (63,45) BCH dB (96,48) LDPC

67 (a) 20 iterations (b) 5 iterations Figure 4.3: Analysis of evolution of weights for different number of SPA iterations, and different NND architectures. Correlation coefficients are calculated between learned and un-learned edge weights (black lines), and FF-NND and RNN weights (red lines) (32, 16) polar code A list of hyper-parameters, different from the typical parameters, used for training the NND for (32,16) polar codes are given in Table 4.2 (refer to Table 3.2 for the typical set of the hyper-parameters). FF-NND gives better performance as compared to RNN, and for small size codes, the computational complexity is not significant for training FF-NND as compared to RNN. For similar reasons, input weights are also trained, as they provide additional degree of freedom for the model. Other parameters are chosen based on the discussions in Section 3.3. Table 4.2: Parameter settings for (32,16) polar code. Parameter Selection Number of SPA iterations 5 Network Architecture FF-NND Loss function Energy based Multi-loss Train input weights ( W i2o ) True Learning rate Training batch length 120 Training SNR 2.0dB Training length epochs The BER and BLER results are shown in Figure 4.4a. The network learns a weight distribution that leads to more than 2dB improvement over the SPA performance for BER. Polar codes have many small girth cycles in its Tanner graph. NND learns to diminish the effects of these small cycles to boost SPA performance for polar codes. Edge weights are analyzed using a heat-map plot in Figure 4.4b. The colored points represent the learned weight, with negative 57

68 values shown by blue color and positive by red. It can be observed from the heat-map that the NND has learned complementary weights across the diagonal. As discussed in Section 4.2.1, this behavior leads to mitigation of cycles in the Tanner graph. Figure 4.4c shows a histogram of the first layer of learned edge weights ( W e2o,l=3 ). The histogram is plotted for learned weights only. The values of weights are spread across a region of ( 1.0, 6.0). Interestingly, by visual inspection of the histogram, we can approximate the distribution of the weights as mixture of Gaussian distributions. The peak of the distribution is around 1.0, which is also the value at which we initialized the weights. Hence, the NND is changing the value of those weights which have high gradient for the loss function, and keeping the value of other weights close to 1.0. Similarly, every learned hidden layer weights are approximately normally distributed (32, 24) polar code The (32, 24) polar code is a rate 3 4 code of small code-length. Hyper-parameters used for training the NND of (32, 24) polar code, are listed in Table 4.3. The learning rate is high since number of learn-able weights (128 4 = 512) are low. The results are presented for RNN archirecture, although due to low number of parameters, a Feed-forward architecture might also be a good choice. Table 4.3: Parameter settings for (32,24) polar code. Parameter Selection Number of SPA iterations 5 Network Architecture RNN Loss function Cross-Entropy based Multi-loss Learning rate 0.01 Training batch length 120 Training SNR 1.0 db Training length epochs The decoding results for (32, 24) polar code are shown in Figure 4.5a. At an SNR of 6.0dB, we see an improvement of around 2.0dB in BER, and 1.5dB in BLER. Figure 4.5b shows the weight distribution heat-map, and Figure 4.5c shows the histogram. We can see that corresponding edge weights across the diagonal are assigned opposite weights, a behavior similar to (32, 16) polar codes. The histogram shows that the distribution has a wide spread of values in the range ( 1, 5), with peak at 1.0. This shows that the NND pushed the values of certain edge weights significantly away from their initial values (128, 64) polar code The (128, 64) polar code has a total of 1792 edges in its Tanner graph. Each hidden layer of the NND will have 1792 nodes. As we grow the size of the NND, its complexity increases exponentially. Hence we can only test this code for a maximum of 5 SPA iterations. Other parameters of this test are listed in Table 4.4. Due to long computational time for this code, we used a learning rate of However, the training process start to diverge after epochs. A smaller learning rate might give better results, but will take longer to train. 58

69 (a) BER and BLER decoding results for (32, 16) polar codes. (b) Edge-Weight distribution heat-map. (c) Edge-Weight distribution histogram. Figure 4.4: Decoding results and edge weight analysis for (32, 16) polar codes. 59

70 (a) BER and BLER decoding results for (32,24) polar code. (b) Edge-Weight distribution heat-map. (c) Edge-Weight distribution histogram.. Figure 4.5: Decoding results and edge weight analysis for (32,24) polar code. 60

71 Table 4.4: Parameter settings for (128, 64) polar code. Parameter Selection Network Architecture FF-NND Loss function Energy based Multi-loss Train input weights ( W i2o ) True Train output weights ( W e2x ) True Number of iterations of SPA 5 Learning rate Training SNR 2.0dB Training Batch length 30 Training length epochs As shown in Figure 4.6a, the performance improvement of NND compared to SPA is significant (more than 3 db for high SNR). However, the performance is still far from the ML threshold. Figure 4.6b shows the heat-map of a small section of trained edge-weights. There some corresponding anti-symmetric elements across the diagonal for (128, 64) polar codes as well. The histogram of the weights, as shown in Figure 4.6c, can be approximated as a normal distribution with mean at 1.0. The number of weights with value above 1.0 are almost same as the number below. This shows that the network is trying to assign complementary weights to corresponding edges to remove the cycle effects (63, 45) BCH code BCH codes are algebraic family of codes and their Parity check matrix are high density (HDPC). SPA performs poorly for codes from this family, as they contain lot of small girth cycles in their Tanner graph. The NND is trained (63, 45) BCH code using parameters listed in Table 4.5. We use energy based loss function, as cross-entropy loss leads to worse performance at lower SNR values. The loss function function, instead of a multi-loss, leads to better performance as it keeps the SPA performance consistent for SNR values that are away from the training SNR. Other parameters are set to a typical value, and the network is trained beyond its convergence point. Table 4.5: Parameter settings for (63,45) BCH code. Parameter Selection Network Architecture FF-NND Loss function Energy based Loss Number of iterations of SPA 5 Learning rate Training SNR 2.0dB Training Batch length 120 Training length epochs The decoding results using the NND for (63, 45) BCH codes, are shown in Figure 4.7a. We achieve a 1.5dB gain in BER performance at SNR value 6.0dB. The a section of the edge weight distribution heat-map in shown in Figure 61

72 (a) BER and BLER decoding results for (128,64) polar code. (b) Edge-Weight distribution heat-map for a section. (c) Edge-Weight distribution histogram.. Figure 4.6: Decoding results and edge weight analysis for (128,64) polar code. 62

73 4.7b. The NND trains the weights to take values in a range of ( 0.5, 1.0). We can see that some corresponding weights across the diagonal, are assigned opposite values, in an attempt to reduce the cycle effects. The histogram of weights distribution is shown in Figure 4.7c. Again, the distribution can be approximated as a normal distribution. However, the peak of the distribution is not at 1.0 anymore. It is around a value of 0.4. Hence, in case of BCH code, the network is pushing most of the weights away from their initial value (96, 48) LDPC code LDPC codes perform close to optimal with SPA decoder. They have parity check matices with far less density as compared to BCH or polar codes. However, cycles or trapping sets are still present in good performing codes of small length. The NND decoder is trained for a (96, 48) LDPC code obtained from [14]. The training is conducted using parameters listed in Table 4.6. The recurrent neural network works better in this case. Also, we trained the input weights since number of edge weights in (96, 48) LDPC code is only 296. More weights are required to be trained at 5 iterations. Table 4.6: Parameter settings for (96,48) LDPC code. Parameter Selection Network Architecture RNN-NND Loss function Cross Entroyp Multi-loss Number of iterations of SPA 5 Train input weights ( W i2o ) True Learning rate Training SNR 1.0dB Training Batch length 120 Training length epochs The NND decoder performs only slightly better than SPA as shown in Figure 4.8a. LDPC codes do not have many small girth cycles or trapping sets. Hence, the NND performance is not far better than the SPA. The weight distribution heat-map is shown in Figure 4.8b, and histogram in Figure 4.8c. These plots also show that the weights are only slightly modified from their initial values. 63

74 (a) BER and BLER decoding results for (63, 45) BCH code. (b) Edge-Weight distribution heat-map for a section. (c) Edge-Weight distribution histogram.. Figure 4.7: Decoding results and edge weight analysis for (63,45) BCH code. 64

75 (a) BER and BLER decoding results for (96,48) LDPC code. (b) Edge-Weight distribution heat-map. (c) Edge-Weight distribution histogram.. Figure 4.8: Decoding results and edge weight analysis for (96,48) LDPC code. 65

Channel Decoding in Wireless Communication Systems using Deep Learning

Channel Decoding in Wireless Communication Systems using Deep Learning Gaurang Naik 12/11/2017 Deep Learning Course Project Acknowledgements: Navneet Agrawal, TU Berlin Error Control Coding Wireless Communication