Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression Mohamed A. Abd El ghany Electronics Dept. German University in Cairo Cairo, Egypt E-mail: mohamed.abdel-ghany@guc.edu.eg Aly E. Salama and Ahmed H. Khalil Electronics and Communication Dept. Cairo University Cairo, Egypt E-mails: alisalama@cu.edu.eg, eoip@gega.net Abstract Hardware Implementation of Data compression algorithms is receiving increasing attention due to exponentially expanding network traffic and digital data storage usage. Among lossless data compression algorithms for hardware implementation, Lempel Ziv algorithm is one of the most widely used. The main objective of this paper is to enhance the efficiency of systolic- array approach for implementation of Lempel- Ziv algorithm. The proposed implementation is area and speed efficient. The compression rate is increased by more than 40% and the design area is decreased by more than 30%. The effect of the selected buffer s size on the compression ratio is analyzed. An FPGA implementation of the proposed design is carried out. It verifies that data can be compressed and decompressed onthe- fly. Index Terms CAM, data compression, LZ77, LZ78 LZSS, LZW, systolic array, FPGA. I. INTRODUCTION Data compression is becoming an essential component of high speed data communications and storage. Lossless data compression is the process of encoding ("compressing") a body of data into a smaller body of data which can at a later time be uniquely decoded ("decompressed") back to the original data. In lossy compression, the decompressed data are some approximation of the original data. Many lossless data compression techniques have been proposed in the past and widely used, e.g., Huffman code [1]-[2], arithmetic code [3]-[4], run-length code [5], and Lempel Ziv (LZ) compression algorithms [6]-[9]. Among them, the LZ algorithms are the most popular scheme when no prior knowledge or statistical characteristics of the data being compressed are available. The principle of the LZ algorithms is to find the longest match length in the buffer containing recently received data string with the incoming string and represent it with the position and length of the longest match in the buffer. Since the repeated data is linked to an older one, more concise representation is achieved and compression is done. To fulfill real-time requirements, several works on hardware realization of LZ77 or its variants have been presented in the literature. Some hardware architectures, including content addressable memory (CAM) [10], Systolic array [11] - [13], and embedded processor [14], have been proposed in the past. The microprocessor approach is not attractive for real- time applications, since it dose not fully explore hardware parallelism [13]. CAM has been considered the fastest architecture among all proposed hardware solutions for searching for a given string, as required in LZ77. A CAM- based LZ77 data compressor can process one input symbol per clock cycle, no matter what the buffer size and string length are. Thereby it achieves optimal speed for compression. However, CAM's are used high hardware complexity and high power consumption. The CAM approach performs string match by full parallel searching, while the systolic-array approach does it by pipelining. As compared with CAM- based designs, systolic array compressors are slower, but better in hardware cost and testability [13]. A first systolic array design for LZbased data compression was proposed in [11], in which thousands of processor elements (PEs) are required for a usable scale. High speed designs were then reported later [12] - [13], requiring only tens of PEs. In this paper, we present an efficient systolic array design which is high speed and area efficient. The organization of this paper is as follows: In Section II, we briefly explain the LZ compression algorithm. The results and comments about some software simulations are discussed. Then we show the dependence graph (DG) to investigate the data dependency of every computation step in the algorithm. Section III, we show the most recent systolic array architecture and propose an area and speed efficient architecture. In Section IV, the proposed systolic array structure is compared with the most recent structures [13] in terms of area, and latency. An FPGA implementation for our architecture showing the real time operations is demonstrated in Section V. Finally, conclusions are given in section VI.

II. LEMPEL-ZIV CODING ALGORITHM The LZ77 algorithm was proposed by Ziv and Lempel in [15]. Since the buffer size (n) and match length (Ls) determine the compression efficiency, the relationship between n and Ls for optimal compression performance is briefly examined. Then we investigate the data dependency of every computation step in the LZ compression algorithm. A. The compression algorithm: The LZ77 algorithm and its variants use a sliding window that moves a long with the cursor. The window can be divided into two parts, the part before the cursor, called the dictionary, and the part starting at the cursor, called the look-ahead buffer. The sizes of these two parts are input parameters to compression algorithm. The basic algorithm is very simple, and loops executing the following steps: 1. Find the longest match of a string starting at the cursor and completely contained in the look-ahead buffer to a string starting in the dictionary. 2. Output a triple (I p, L max, S) containing the position I p of the occurrence in the window, the length L max of the match and the next symbol S past the match. 3. Move the cursor L max + 1 symbols forward. Let us consider an example with window size of (n=9) and look-ahead buffer size (Ls=3) shown in Figure 1. Let the content of the window be denoted as X i, i = 0,1,..,n-1 and that of the look-ahead buffer be Y j, j = 0,1, Ls-1 (i.e., Y j = X i+ n-ls ). According to LZ algorithm, the content of look-ahead buffer is compared with the dictionary content starting from X0 to X n-ls-1 to find the longest match length. If the best match in the window is found to start from position I p and the match length is L max. Then L max symbols will represent by a codeword (I p, L max ). The codeword length is Lc: Lc = 1 + [log 2 (n-l s )] + [Log 2 L s ] bits (1) Lc is fixed. Assume w bits are required to represent a symbol in the window, l = [log2 Ls] bits are required to represent Lmax, and p = [log2 (n-ls)] bits are required to represent Ip. Then the compression ratio is (l + p) / (Lmax * w), where 0 Lmax Ls. Hence the compression ratio depends on the match situation. Figure 1 window of the LZ compressor example The codeword design and the choice of widow size are crucial in achieving maximum compression. The LZ technique involves the converting of variable length substrings into fixed length codewords that represent the pointer and the length of match. Hence, selection of values of n and L s can greatly influence the compression efficiency of the LZ algorithm. B. Compression Algorithm Paramters Selection Simulation for the performance of the LZSS algorithm for different buffers sizes is performed using the Calgary corpus and the Silesia Corpus [16], as shown in Figure 2 and Figure 3 In these experiments, the codeword is up to 2 bytes long. compression ratio 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 4 8 16 32 64 128 Ls values n=256 n=512 n=1024 n=2048 n=4096 n=8192 n=16384 Figure 2 The relationship between the compression ratio of Calgary corpus and Ls for different values of n compression ratio 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 4 8 16 32 64 128 Ls values n=256 n=512 n=1024 n=2048 n=4096 n=8192 n=16384 Figure 3 The relationship between the compression ratio of Silesia corpus corpus and Ls for different values of n. From Figure 2 and Figure 3, Improvement of the compression ratio can be achieved when n is greater than 1024. The above improvement can be obtained only when L s = 8, 16, or 32. Based on the results, the best L s for a good compression ratio is 2 4. Increasing L s beyond that will require a much faster growing n (as well as hardware cost and computation time), with saturating or even worsening compression effect. The reason is that repeating patterns tend to be short, and that increasing L s and n also increases the codeword length (l + p). To achieve a good performance for different data formats, L s may range from 8 to 32, while n may be from 1k to 8 k. The simulation results verify the results proposed in [17]. C. Dependency graph: A dependence graph (DG) is a graph that shows the dependence of the computations that occur in an algorithm. The DG of the LZ algorithm can be obtained as shown in Figure 4. In the DG, L (match length) and E (match signal) are propagated from cell to cell. X (content of the window) and Y (content of the look-ahead buffer) are broadcast horizontally and diagonally to all cells, respectively. The DG shown in Figure 4 is called a global DG, because it contains global signals. The global DG can be transformed into a localized DG, by propagating the input data Y and X from cell to cell instead of broadcasting them. Processor assignment can be done by projection of the DG onto the surface normal to the projection vector selected. After

processor assignment, we schedule the events using a schedule vector. i Match results block MRB is needed to determine L max among the serially produced L i s. MRB is shown in Figure 8. The PEs need not store their ids to record the position of the L i s. A special counter is needed to generate the sequence which interleaves the position of the first half of L i s and the position for the second half as shown in Figure 9. The compression time of the interleaved design is n clock cycles. Figure 4 The dependence graph of the LZ compression algorithm. III. SYSTOLIC ARRAY ARCHITECTURE A. The Interleaved Design: From the dependence graph shown in Figure 4, the interleaved design is obtained by projecting all the nodes in a particular raw to a single processor element. This design was first proposed in [13]. The space time diagram and the resulting array of the interleaved design are given in Figure 5. The Y j s which are needed to be accessed in parallel do not change during the encoding step. The special buffer is needed to generate the interleaving symbols of X i and X i + [n- Ls/2] as shown in Figure 6. The first L i will be obtained after L s clock cycles from the leftmost PE and subsequent ones will be obtained every one clock cycles. Figure 5 The space-time diagram and the resulting array of interleaved design. Figure 6 buffer design for generating the interleaving symbols. Before the encoding process, we preload the Y i s which take L s extra cycles. During the encoding process, the time to preload new source symbols depend on how many source symbols were compressed in the previous compression step, L max. The block diagram of the processor element is shown in Figure 7. Figure 7 The block diagram of interleaved design PE. Figure 8 the match results block Figure 9 Special position counter in MRB for interleaved Design. B. Proposed Design (Design- P) From the dependence graph shown in Figure 4, we project all the nodes in a particular row to a single processor element. This produces an array of length Ls. The spacetime diagram and the resulting array of Design-P are given in Figure 10, where D represents a unit delay on the signal line between two processing elements. As shown in Figure 10, the architecture consists of L s processing elements which are used for the comparison and L-encoder which is used to produce the matching length. Consequently, the look-ahead buffer symbols Y j s which do not change during the encoding step, stay in PEs. The dictionary element X i moves systolically from left to right with a delay of 1 clock cycle. The match signal E i of the processing element moves to the L-encoder. The output L i of the encoder is the matching length resulting from the comparisons at step i-1. The first L i will be obtained after one clock cycle and the subsequent ones will be obtained every clock cycles. Before the encoding process, we preload the Y i s to be processed and this takes L s extra cycles. During the encoding process, the time to preload new source symbols depends upon how many source symbols were compressed in the previous compression step, L max. The functional block of the PE is shown in Figure 11. Only one equality is needed for comparing Y j and incoming X i. The result E i (match signal) propagates to L- encoder. The block diagram of L- encoder is shown in Figure 12. According to E i s (match signals), L- encoder computes the match length L i corresponding to position i.

X7. X1 X0 PE2 PE1 PE0 Y2 D D Y1 Y0 t space E2 E1 E0 Figure 10 Architecture of Design-P. L-encoder PE2 PE1 PE0 L i 1 X2-Y0 X1-Y1 X0-Y0 2 X3-Y0 X2-Y1 X1-Y0 L0 3 X4-Y0 X3-Y1 X2-Y0 L1 4 X5-Y0 X4-Y1 X3-Y0 L2 5 X6-Y0 X5-Y1 X4-Y0 L3 6 X7-Y0 X6-Y1 X5-Y0 L4 7 L5 Figure 11 The functional block of Design-P PE. L i L i I p Reg counter Lmax Reg a > b p M ux M ux L s a = b done n - Ls a = b Figure 13 The match results block (MRB) for design-p. Parallel compression can be achieved by using an appropriate number of Design- P modules. For example, we can use two modules of Design P, as shown in Figure 14. The input sequence of the first module (Xi) is obtained from the first position of Buffer. The input sequence of the second module (X i +[(n Ls)/2 ] ) is obtained starting at ((n Ls)/2 ) position in the Dictionary. Note that the MRB now needs to determine L max among LI, LII that are produced at the same time. So we need to modify MRB. The speed is two times the Design P array. l p L max I p Code word ready Figure 12 L-encode From Figure 10, it is clear that the maximum matching length is not produced by the L-encoder. So, we need a match results block (MRB), as shown in Figure 13, to determine L max among the serially produced L i s. Also, the PEs need not store their ids to record the position of the L i s (Ip). Since p = [log2(n-ls)] bits are required to represent Ip, only a p- bit counter is required to provide the position i associated with each L i, since the time when L i is produced corresponds to its position. MRB uses a to compare the current input L i and the present longest match length L max stored in the register. If the current input L i is larger than L max, then L i is loaded into the register and the content of the position counter is also loaded into another register which is used to store the present Ip. Another is used to determine whether the whole window has been searched. It compares the content of the position counter with n- L s, whose output is used as the codeword ready signal. During the searching process, L i might be equal to L s when I< n- L s, i.e., the content in the look-ahead buffer can be fully matched to a subset of the dictionary, and hence searching the whole window is not always necessary. An extra is used to determine whether L max is equal to L s, and hence the string matching process is completed. Therefore, we can immediately start encoding a new set of data. This will reduce the average compression time. The number of clock cycles needed to produce a codeword is [(n- L s ) + 1] clock cycles, so the utilization rate of each PE is (n- L s ) / [(n-l s ) + 1], which almost equals to one. This result is consistent since the PE is busy once L i is determined until the time at which the codeword is produced. IV. Figure 14 Parallel compression. THE COMPARISON BETWEEN THE DESIGN-P AND INTERLEAVED DESIGN (DESIGN-I) The comparison of two designs is given in Table 1. Our Design (Design-P), which has the maximum utilization rate, minimum latency time (high compression speed) and minimum area, is the best array architecture for LZSS. TABLE 1 THE COMPARISON BETWEEN DESIGN-P AND DESIGN-I. Interleaved Design Design-P PEs Ls Ls Utilization (n-ls)/ n (n-ls)/ (n- Ls+ 1) Latency n n- Ls+ 1 DFFs per PE 2w+l+1 2w + 1 accumulator Per PE 1 --- Counter per MRB 2 1 Multiplexer per MRB 3 2 Multiplexer per buffer 1 ----- Equality per MRB 1 2 V. FPGA IMPLEMENTATION In this section, we present the implementation methodology of the design utilizing FPGA. As shown in Figure 15, the architecture consists of 3 major components: systolic-array LZ component SALZC, block RAM, and host controller. In our implementation, the length of window size (n) is 1K, and the length of look-ahead buffer (L s ) is 16. SALZ component contains 16 PEs and implements the most cost effective array architecture (Design P). Full custom

layout is easy since the array is very regular. Only a single cell (PE) was hand laid out. The other 15 PEs are just its copies. Since the array is systolic, routing also is simplified. Block RAM that is used as the datuffer (Dictionary) is not included in SALZ component. Thereby we can increase the dictionary size by directly replacing the block RAM with a larger one. The dictionary size is a parameter to cover a broad range of applications, from text compression to lossless image compression. VI. CONCLUSIONS In this paper, we described a parallel algorithm for LZ based data compression by transforming a data dependent algorithm to a data independent regular algorithm. We propose our structure for LZ compression. To further improve the latency, we used a control variable to indicate early completion. The proposed implementation is area and speed efficient. The compression rate is increased by more than 40% and the design area is decreased by more than 30%. The design can be integrated into real time systems so that data can be compressed and decompressed on the - fly. Figure 15 LZ compression chip The implementation of our architecture (Design-P) and interleaved design (Design-i) are carried out using XILINX (Spartan II XC200) FPGA, for n= 1 k, L s =16, w =8. The implementation results are shown in Table 2. TABLE 2 THE IMPLEMENTATION RESULTS OF DESIGN-P AND DESIGN-I of Slices of Slice Flip Flops of 4 input LUTs of BRAMs Maximum Frequency Design-P 310 13% 408 8% 419 8% 2 14% 105 MHz Design-i 471 20% 511 10% 650 13% 2 14% MHz The compression rate of a compressor is defined as the number of input bits which can be compressed in one second. The compression rate (Rc) can be estimated as follows: Rc = clk [(LsW) /(n-l s +1)] (2) Where clk is the clock rate. Note that only estimated Rc can be obtained, since it depends on the input data. We can not predict exactly how many words will be compressed (L s at most) and how many clock cycles will be required ((n L s +1) at most) for every compression step. In our Implementation, if the window size (n) is 1k, L s =16, w =8, and clk = 105 MHz, The Rc is about 13 M bit per second. If we want to use the parallel scheme to increase the compression rate, we can directly modify the host controller and connect an appropriate number of LZ compressor components in parallel, as shown in Figure 16. Note that the MRB now needs to determine L max among L I, L II and L III. e.g., ten components could be implemented in one chip of large size to achieve a compression rate about 130 M bits per second. Moreover, by modifying the host controller and including, e.g., dictionaries, our chip can be used for other string-matching based LZ algorithms, such as LZ78 and LZW. Our architecture is flexible. Figure 16 Multiple SALZC system. REFERENCES [1] D. A. Huffman "A method for the construction of minimumredundancy codes" Proc. IRE, Vol. 40, pp. 1098-1101, Sept. 1952. [2] R. Gallager, Variations on a theme by Huffman, IEEE Trans.Inform. Theory, vol. IT-24, pp. 668-674, 1978. [3] E. Bodden, M. Clasen, J. Kneis, R. Schluter, "Arithmetic coding revealed" pp. 8-57, May 2004. [4] A. Said, " Introduction to Arithmetic coding Theory and Practice" Imaging Systems Laboratory, HP Laboratories Palo Alto, April 2004. [5] S. Golomb, Run-length encoding, IEEE Trans. Inform. Theory, Vol. IT-12, pp. 399-401, July 1966. [6] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory, Vol. IT-23, pp. 337-343, 1977. [7] J. Ziv and A. Lempel, Compression of individual sequences via variable rate coding, IEEE Trans. Inform. Theory, Vol. IT-24, pp. 530-536, 1978. [8] T. Welch, "A technique for high-performance data compression," IEEE computer, Vol. 17, pp. 8-19, 1984. [9] D. Salomon, "Data Compression" The complete reference, Computer Science Department, California State University, Northridge, pp. 22-205, 2004. [10] K.-J. Lin and C.-W. Wu, "A low-power CAM design for LZ data compressor," IEEE Transactions on Computers, Vol.49, No.10, pp. 1139-1145, Oct. 2000. [11] N. Ranganathan, and S. Henriques, High-Speed VLSI Designs for Lempel-Ziv-Based Data Compression, IEEE Trans. Circuits and Systems. II: Analog and Digital Signal Processing, vol. 40, pp.96-106, Feb. 1993. [12] B. Jung and W. P. Burleson, Efficient VLSI for Lempel-Ziv compression in wireless data communication networks, IEEE Trans. VLSI Syst. Vol. 6, pp. 475-483, Sept. 1998. [13] S.-A. Hwang and C. -W. Wu, Unified VLSI Systolic Array Design for LZ Data compression, IEEE Trans. VLSI Systems, vol. 9, pp. 489-499, August 2001. [14] J. Chang, H. J Jih, and J. W. Liu, A lossless data compression processor, in Proc. 4 th VLSI Design/CAD Workshop, Aug. 1994, pp.134-137. [15] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337-343, 1977. [16] S. Deorowicz Universal lossless data compression algorithms Dissertation, Silesian University, pp 92-119, 2003. [17] M. Arias, R. Cumplido, C. Feregrino "On the Design and Implementation of an FPGA-based Lossless Data Compressor.", ReConFig 04, Colima, Mexico, Sep., 2004. [18] M. M. Sandoval and C. F. Uribe, A hardware Architecture for Elliptic Curve Cryptography and Lossless Data Compression, IEEE Computer Society, May 2005.