A High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm

Size: px

Start display at page:

Download "A High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm"

Georgiana Fields
6 years ago
Views:

1 2012 IEEE th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum A High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian Weis, Norbert Wehn Microelectronic systems design research group TU Kaiserslautern Germany {shcherbakov, weis, wehn}@eit.uni-kl.de Abstract The increasing growth of embedded networking applications has created a demand for high-performance logging systems capable of storing huge amounts of high-bandwidth, typically redundant data. An efficient way of maximizing the logger performance is doing a real-time compression of the logged stream. In this paper we present a flexible high-performance implementation of the LZSS compression algorithm capable of processing up to 50 MB/s on a Virtex-5 FPGA chip. We exploit the independently addressable dual-port block RAMs inside the FPGA chip to achieve an average performance of 2 clock cycles per byte. To make the compressed stream compatible with the ZLib library [1] we encode the LZSS algorithm output using a fixed Huffman table defined by the Deflate specification [2]. We also demonstrate how changing the amount of memory allocated to various internal tables impacts the performance and compression ratio. Finally, we provide a cycle-accurate estimation tool that allows finding a trade-off between FPGA resource utilization, compression ratio and performance for a specific data sample. I. INTRODUCTION Compression is a way of representing data in a way that requires fewer bits. All compression algorithms can be divided into lossy and lossless ones. Lossy compression algorithms discard certain less important parts of the information and are highly tied to the structure of the data. In lots of applications (e.g. video coding) the trade-off between the quality drop and significant size reduction is more than acceptable. Lossless algorithms exploit the redundancy and predictability of the compressed information: highly redundant data will need less bits to be stored. In the worst case when no redundancies are found, the compressed block will actually be bigger than the uncompressed one. However, in any case, the decompression will restore the original data, bit-by-bit. Many lossless compression algorithms are widely used in modern servers and workstations for varieties of tasks. Those algorithms, such as LZMA [3], provide high compression ratios, but require tens to hundreds of megabytes of RAM and a fast CPU. This suits in most of the use cases (e.g. backups), where compression ratio is more important than speed. Another use for lossless compression is the rapidly growing embedded networking system world. Keeping a log of inter-node communications significantly simplifies profiling and debugging tasks. Compressing the logged stream in real time would relax the size and bandwidth requirements for the underlying storage media. Unlike the workstation/server applications, compression throughput becomes one of the most important constraints. Modern FPGAs allow building powerful embedded systems on one chip. A typical high-end FPGA contains tens to hundreds of independent dual-port block RAMs (several kilobytes each), one or more built-in CPUs and a lot of reconfigurable logic. The logic operates at lower frequencies than the CPU, however allows exploiting massive algorithmic parallelism. This defines the layout of a typical FPGA-based system-onchip: a central CPU handling high-level tasks and several accelerators performing highly parallelizable computations. Making a high-performance FPGA-based compressor would require selecting an algorithm capable of exploiting the advantages of the above-mentioned FPGA architecture. Having considered several compression algorithms, we have chosen a subset of the Deflate specification [2] (LZSS [4] + fixedtable Huffman encoding) as it can be efficiently implemented using FPGA logic and block RAMs, keeping the on-chip CPU available for higher-level tasks. The rest of this paper is organized as follows. Section 2 overviews the work related to the LZSS-like algorithms and related hardware. Section 3 brielfy describes the data format. Section 4 describes the presented hardware implementation. Section 5 provides a comparison with a software compressor and shows various trade-offs between speed, compression ratio and memory use. Section 6 summarizes the results. II. RELATED WORK Since the publication the LZSS algorithm [4] based on LZ77 [5], there have been many improvements to both algorithmic and implementation aspects that can be categorized in the following way: Further algorithmic improvements. Improve compression ratio at a cost of more operations and memory. E.g. the LZMA algorithm [3] used in the 7-Zip program. Algorithm variations (e.g. [6]) simplifying random access of the compressed data. Hardware implementations that rely on contentaddressable memories [7] and systolic arrays [8], [9]. Applications of fast hardware decompression for dynamic FPGA reconfiguration [10]. FPGA/ASIC-based implementations: [11], [12] /12 $ IEEE DOI /IPDPSW

2 The presented implementation falls in the last category. We have used the approach presented in [11] (FSM with several independent memories) and significantly optimized the design performance by decomposing and parallelizing several processes (employing dual-port block RAM architecture), making use of wide internal data buses and advanced caching/prefetching techniques. As we are targeting embedded logging applications, we have optimized the compression speed while keeping feasible compression ratio, taking the minimum ZLib compression level as a reference point. Nevertheless, the memory sizes and the algorithm parameters are generics that can be easily adjusted to increase the compression ratio at a cost of additional clock cycles and/or extra block RAMs. III. THE DATA FORMAT Before describing the hardware architecture, we will give an overview of the LZSS algorithm (ZLib-based implementation that has minor differences from the original LZSS [4]). The algorithm consumes a stream of literals (i.e. bytes) and produces a stream of decompressor commands. There are 2 command types: output 1 literal and copy-paste L literals encountered D literals ago. E.g. compressing a string snowy snow will result in 7 commands: 6 describing each byte of snowy and 1 command copying 4 bytes ( snow ) from distance 6. To detect that a string has been encountered in the past, the compressor has to store the last N bytes of the input stream. N is referred as dictionary size or sliding window size. On the bit level, every command has 2 fields: D (log 2 N bits) and L (8 bits). If D is 0, the command is output byte and L contains the byte. Otherwise, D contains the copying distance and L contains the copying length minus 3. If the length is less than 3, normal output byte commands are emitted instead. IV. HARDWARE ARCHITECTURE The LZSS compressor uses handshake interfaces for both input and output streams. The compressor consumes 32-bit words (LSBF/MSBF format can be selected) and produces D/L pairs (see section III) used by a fixed-table Huffman coder that produces a stream of packed 32-bit words. The use of stream interfaces allows connecting to highperformance interfaces (e.g. LocalLink [13]) and compressing real-time streaming data on-the-fly without separate buffering and compressing stages. The LZSS compressor consists of the main finite state machine, 5 independently addressable dual-port memories and several auxiliary FSMs. Figure 1 illustrates the overall structure. The main task of the compressor is finding previous occurrences of a string (matching). To determine if a string S has been encountered before, the compressor computes a hash value from its first 3 bytes and looks through the list of all strings in the dictionary with the same hash value. Filling logic Fig. 1. Overall structure of the LZSS compressor Hash cache Lookahead buffer Comparer Dictionary FSM Iter + - memories Head Base Next Output Matching requires comparing the front of the uncompressed stream with several offsets inside the dictionary to find the longest match. To speed up the comparison we have placed them into 2 independently addressable ring buffers: Lookahead buffer. Contains the front of the input stream (up to 512 bytes). Dictionary (a.k.a. Sliding Window). Contains the last N bytes of the input stream that have just been processed. Having the compared memories in independent block RAMs allows performing one comparison iteration every clock cycle. Furthermore, the data bus width for both memories is 32 bits, that allows comparing 1 to 4 bytes during the first clock cycle and exactly 4 bytes during each following one. E.g. comparing two 50-byte strings would take not more than (50 1) /4 + 1 = 14 clock cycles instead of 100 if both buffers resided in a single byte-addressed memory [11]. Moreover, both ring buffers reside in dual-port block RAMs and are filled in the background requiring no extra clock cycles of the main FSM. If the hash caching was enabled, hash values for every offset of the source stream are computed during background filling and stored in a separate memory. The hash table structure is similar to the ZLib. The following independently addressable tables are maintained: Head table. For each value of the hash function it contains the offset in the dictionary buffer of the last string having it. Next table. For each offset of the dictionary it contains the relative offset of the previous string having the same hash value. The matching means finding the longest string in the dictionary that is equivalent to the beginning of the lookahead buffer. E.g. if the dictionary contains and the lookahead buffer contains , the 1234 and 123 will be the candidates and 1234 will be the longest one. If a hash collision occurs (e.g. hash( 123 ) = hash( 34 ), then will also be considered a candidate. The addresses of the candidate strings are obtained by reading head/next tables. Once the matching is done, a decompressor command is produced and the lookahead buffer/dictionary pointers are moved. An optional hash table updating takes place afterwards

3 To illustrate how the high compression performance is achieved we will describe a typical state flow of the main FSM: Initially, the compressor waits until the lookahead buffer contains at least 262 bytes and the hash value of its front is available. As the filling runs in background, this state typically takes only 1 clock cycle. The hash value from the lookahead buffer is routed to the head table address. As soon as the data is available, the matching preparation occurs. The value from the head table is used as the first string address. It is also is routed to the next table to get the address of the next string with the same hash value. The head and the next tables are updated in this cycle to allow finding the currently processed string in the future. At the next clock cycle the matching begins. The next table is read in parallel, so the bottleneck here is the actual comparison of the strings (accelerated by 32-bit buses). When the matching is complete, the output is produced. The values for D and L depending on the matching results are output to the compressed stream interface. If the sink requests a delay, the main FSM is stalled. If a full hash table updating can be performed (decided based on match length), the FSM updates the head/next tables for every byte of the matched string. Every update iteration takes 1 clock cycle. When the hash table updating is done (or was skipped), the FSM enters the initial waiting stage. Depending on the properties of the input data, 30-85% of the matching operations will be unsuccessful and end up with producing the output literal commands, requiring at least 3 clock cycles (+matching). We have implemented a special hash prefetching mechanism accelerating this scenario. A separate FSM is active during the match preparation and matching. It buffers the data from the lookahead buffer and the hash cache and uses the available clock cycles to prefetch (or precompute) the hash value at offset 1 in the lookahead buffer. If no match was found (i.e. the lookahead buffer is going to be advanced by 1 byte), the prefetched value is routed to the head table address and the FSM goes directly to match preparation state skipping the waiting state requiring only 2 non-matching cycles instead of 3. The concept of head/next tables was introduced in ZLib [1] and mentioned in [11]. Originally, both head and next tables contain absolute string offsets inside the dictionary. Every (dictionary size) bytes, ZLib rotates the dictionary: the last bytes are moved up (a total of 64 Kbytes is allocated) and each head/next value is adjusted accordingly (the ones pointing outside the buffer are zeroed). The time overhead is negligible in the slow software, however it would consume 25-75% of the clock cycles (depending on hash/dictionary sizes) for the fast hardware implementation. We have done 3 improvements that reduce the clock cycle overhead to 1-2%: The next table contains relative addresses. This requires 1 extra adder, to compute the absolute address, but eliminates the need to rotate the next table. Every record inside the head table contains k extra generation bits as if the dictionary was 2 k times bigger. The real dictionary size is still used to detect whether a record points outside the dictionary, but the rotating has to be performed 2 k times rarer. The head table memory is internally split into M submemories, each having the size of a single block RAM inside the FPGA. The rotation happens in parallel and requires M times less cycles. The output interface of the LZSS compressor is connected to a fixed-table pipelined Huffman encoder that produces a ZLib-compatible stream. As the table is fixed, no additional clock cycles or memories are required to build it and the encoder does not introduce any delays to the stream produced by the LZSS compressor. The cost for the high performance is less efficient compression compared to the dynamic huffman coders, however, it can be also compensated by increasing LZSS compression level. Our implementation is generic. Various compile-time parameters can be customized to find an optimal trade-off between FPGA resource utilization, compression ratio and speed. Dictionary size, hash bit count, exact hash function, generation bit count, and the head table division factor can be customized during compile-time. Run-time parameters (e.g. matching iteration limit), can also be changed. We have provided an interactive estimation tool that compresses a given file using several presets and produces reports regarding the block RAM amount, compression ratio and clock cycle usage. To maintain high design modularity and decompose the architecture and the low-level details (e.g. hash function, data types and bus sizes), we have used the policy class-based design approach and the THDL++ language [14] that extends VHDL semantics by object-oriented features. THDL++ code can be compiled to VHDL-93 using the freely available compiler and an IDE [15]. V. RESULTS In this section we evaluate the LZSS compressor design by comparing its performance to a software implementation, provide the FPGA utilization information and show the impact of various design settings on the design size and performance. Our test system is the ML-507 development board based on a Virtex-5 FPGA. We have developed a testbench that receives a data block from the PC over Ethernet, stores it in the DDR2 memory, compresses it and sends the result back. The compression time includes the DMA [13] setup times, but excludes Ethernet transmission time. We have compared a software implementation (ZLib [1] running on the PowerPC processor inside the XC5VFX70T-FF FPGA) and the hardware implementation with parameters optimized for speed (4KB dictionary, 15-bit hash). The clock frequency of the PowerPC was 400 MHz while the compressor was connected to a 100 MHz clock (post-route analysis reported a maximum clock frequency of MHz). We have used 2 data sets: a fragment of a Wikipedia text snapshot [16] (referred as Wiki) and sample

4 data obtained from an automotive CAN logger (referred as X2E). We have run the test with a 10MB and a 50MB fragments to factor out DMA setup time. Table I shows the performance comparison (parameters, input and output streams were equal). TABLE I PERFORMANCE EVALUATION Data sample SW speed HW sped Speedup Compression (MB/s) (MB/s) ratio Wiki 50MB x 1.69 Wiki 10MB x 1.68 X2E 50MB x 1.7 X2E 10MB x 1.7 Additionally to the 15-20x performance increase, the use of the DMA engine to transfer the data between the DRAM and the hardware compressor allows running high-level tasks on the CPU in parallel with the compression. Table II shows that FPGA utilization in terms of lookup tables (LZSS + fixed-table Huffman) remains insignificant and almost the same (5.2+=0.6% of the Virtex5 FPGA) for all reasonable dictionary sizes and hash sizes. TABLE II FPGA UTILIZATION Hash size Dictionary size LUTs Registers 15 bits B bits 8KB bits 4KB Available in XC5VFX70T FPGA To simplify design space exploration we have developed a software estimator tool[17]. The tool consists of a flexible cycle-accurate C++ model and a C# front-end. The C++ model accepts various design parameters (e.g. window size), compresses reference data blocks and produces various cycleaccurate statistics. The C# front-end allows constructing series of parameter sets (e.g. iterating an arbitrary parameter over a given range), iteratively runs the C++ model and visualizes the obtained results. The rest of this section describes several trade-offs explored by running a 100MB Wikipedia snapshot [16] through the software estimator. First of all, increasing the dictionary size improves the compression ratio (fig. 2). Moreover, the improvement is more significant for larger hash sizes. Fig. 2. Compressed size in MB of a 100MB Wiki fragment [16] Hash 67.3 bits: Dictionary: 1K 2K 4K 8K 16K Increasing the dictionary size slightly slows down the compression. This can be compensated by increasing the hash size (fig. 3) and thus, lowering hash collision probability and reducing the amount of matching iterations. However, increasing hash size raises the memory requirements exponentially (head table requires 2 H ( log 2 D + G) bits where H is the hash bit count, D is the dictionary size and G is the amount of generation bits). Fig Hash bits: Compression speed (MB/s) for a 100MB Wiki fragment[16] 2K 4K 8K 16K Dictionary Another way of improving compression efficiency is adjusting the algorithm parameters (e.g. amount of matching attempts before giving up). This can improve the compression by 20% at a cost of 82% performance decrease (fig. 4). Fig. 4. Compressed size and speed for a 100MB Wiki fragment [16] for min/max compression levels and 2 hash size options 73 MB Hash bits; 68 MB compression level: 63 MB 9 bits;min 59 MB 15 bits;min 54 MB 50 MB 9 bits;max 45 MB 15 bits;max 49 MB/s 15 bits;min 38 MB/s 9 bits;min 28 MB/s 18 MB/s 8 MB/s Dictionary: 1K 2K 4K 8K 16K 15 bits;max 9 bits;max As the other hardware implementations [11], [12] do not provide exact performance results, we have analyzed the impact of 3 main optimization techniques compared to the design described in [11] by temporarily disabling them and measuring the performance impact. Table III summarizes the results. TABLE III COMPRESSION SPEED FOR A 100 MB WIKI FRAGMENT WITHOUT OPTIMIZATIONS Configuration Window size 4KB B A) Original (15-bit hash; 32-bit data) 49.0 MB/s 46.2 MB/s B) 8-bit data bus as in [11] 30.3 MB/s 25.9 MB/s C) Disabled hash prefetching 45.2 MB/s 45.0 MB/s D) Reduced generation bits to MB/s 33.8 MB/s Disabled all 3 optimizations over [11] 10.2 MB/s 21.2 MB/s This most efficient optimization for small window sizes is the introduction of generation bits, as using k generation bits makes next table rotation occur 2 k times rarer (if k is 1, rotation happens every D bytes, where D is the dictionary size). Using wide data buses provides a 63-78% performance increase and hash prefetching increases the performance by additional 8%. The overall performance increase due to the

5 described optimizations is 2.2x-4.8x depending on the window size. As an indirect metric of an LZSS compressor efficiency we have measured the amount of clock cycles spent on actually comparing the data from the dictionary with the lookahead buffer (compared to the clock cycles spent on updating hash tables, computing read addresses, etc.). Figure 5 shows state distribution for the 100 MB Wiki fragment with a B dictionary and 15-bit hash. Fig. 5. Time spent on different operations (100MB Wiki fragment) Waiting for data (8,4%) Producing output (11,0%) Updating hash table (11,6%) Rotating hash (0,3%) Fetching data (0,2%) Finding match (68,5%) Most of the time (68.5%) is spent on reading and comparing the data (up to 4 bytes per cycle from each dictionary and lookahead buffer). Producing the output and prefetching the next hash value in parallel takes 11% of the time. Another 11.6% of the time is spent on inserting every byte of a short match (up to 4 bytes) in the hash table. Finally, 8.4% of the time is spent on waiting for the head table to be read when the prefetched hash value is not useful (i.e. when a valid match is found and several bytes are skipped). [6] S. Kreft and G. Navarro, Lz77-like compression with fast random access, in Data Compression Conference (DCC), 2010, march 2010, pp [7] P. Rauschert, Y. Klimets, J. Velten, and A. Kummert, Very fast gzip compression by means of content addressable memories, in TENCON IEEE Region 10 Conference, vol. D, nov. 2004, pp Vol. 4. [8] J.-M. Chen and C.-H. Wei, Vlsi design for high-speed lz-based data compression, Circuits, Devices and Systems, IEE Proceedings -, vol. 146, no. 5, pp , oct [9] B. Jung and W. Burleson, A vlsi systolic array architecture for lempelziv-based data compression, in Circuits and Systems, ISCAS 94., 1994 IEEE International Symposium on, vol. 3, may-2 jun 1994, pp vol.3. [10] M. Huebner, M. Ullmann, F. Weissel, and J. Becker, Real-time configuration code decompression for dynamic fpga self-reconfiguration, in Parallel and Distributed Processing Symposium, Proceedings. 18th International, april 2004, p [11] S. Rigler, W. Bishop, and A. Kennings, Fpga-based lossless data compression using huffman and lz77 algorithms, in Electrical and Computer Engineering, CCECE Canadian Conference on, april 2007, pp [12] (2011, Sep.) Gzip compression/gunzip decompression core. [Online]. Available: [13] (2011, Sep.) Locallink interface. [Online]. Available: UserInterface.htm [14] I. Shcherbakov, C. Weis, and N. Wehn, Bringing c++ productivity to vhdl world: from language definition to a case study, in Specification Design Languages, IC Forum on, sept. 2011, pp [15] (2011, Sep.) Visualhdl website. [Online]. Available: [16] (2011, Sep.) Large text compression benchmark. [Online]. Available: [17] (2011, Sep.) Compression performance analyzer. [Online]. Available: VI. CONCLUSION In this paper we have presented a high-performance flexible implementation of the LZSS algorithm on a Virtex5 FPGA. We have exploited the independently addressable dual-port block RAMs and performed several specific FSM and data structure optimization, resulting in a 15-20x performance increase compared to the optimized software implementation [1]. The compressor design is flexible and allows tuning various parameters to achieve trade-offs between the speed, compression ratio and block RAM utilization. An estimation tool available online [17] allows performing design space exploration and finding optimal parameters based on real data samples. We have verified the quality of our design by compressing more than 1 TB of data on the FPGA and comparing the results to software reference model. REFERENCES [1] (2011, Sep.) Zlib compression library. [Online]. Available: [2] N. W. Group. (1996, May) Deflate compressed data format specification. [Online]. Available: [3] (2011, Sep.) Lzma sdk. [Online]. Available: zip.org/sdk.html [4] (2011, Sep.) Lzss algorithm. [Online]. Available: [5] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, Information Theory, IEEE Transactions on, vol. 23, no. 3, pp , may

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Author: Supervisor: Luhao Liu Dr. -Ing. Thomas B. Preußer Dr. -Ing. Steffen Köhler 09.10.2014