Error Resilient LZ 77 Data Compression

Error Resilient LZ 77 Data Compression Stefano Lonardi Wojciech Szpankowski Mark Daniel Ward Presentation by Peter Macko

Motivation Lempel-Ziv 77 lacks any form of error correction Introducing a single error to the compressed stream corrupts O(n 2/3 log n) ) symbols, where n is the length of the stream LZRS 77 provides a way to add error correction bits without losing compression power or backward compatibility The idea on which LZRS 77 is based can be extended to other algorithms such as LZW

Example of Error Propagation In Sliding-Window Implementation of LZ 77: Original Text: THE THEFT OF THE IDE: THE IDENTITY Compressed Text: THE_ HE_(4,3) (4,3)FT_OF_ FT_OF_(13,4) (13,4)IDE: IDE:(10,8) (10,8)NTITY

Example of Error Propagation Compressed: THE_(4, (4,3)FT_OF_ FT_OF_(13,4) (13,4)IDE: IDE:(10,8) (10,8)NTITY Compressed With Error: THE_(4, (4,4)FT_OF_ FT_OF_(13,4) (13,4)IDE: IDE:(10,8) (10,8)NTITY Decompressed: THE THE FT OF HE TIDE: HE TIDENTITY

Motivation for LZS 77 LZS 77 is capable of storing error- correction bits in LZ 77 files without losing compression power backward-compatible with generic LZ 77 decoders

The Basic Idea of LZS 77 Sliding Window Current Position Z B C D B C D Z B C B X B C Z B C X The choice of the reference inside of the sliding window can be used to carry extra information M = Multiplicity (the number of occurrences of the substring in the sliding window)

Using Redundant Information Sliding Window Current Position Z B C D B C D Z B C B X B C Z B C X 10 01 00 11 In order to store value X, choose (X + 1) th reference in the sliding window (counting from right) log 2 M bits can be stored at this position

How Many Redundant Bits? Theoretically: Pr ( M = j) n p j q + jh q j p M n = multiplicity after n bits are compressed p = the probability of encountering 0 q = p 1 = the probability of 1 h = p log p q log q = Shannon s s entropy

How Many Redundant Bits? Theoretically: Pr ( M = j) n p j q + q jh j p (p = 0.5, q = 0.5) Observations: The probability is maximal for M n = 1 The probability for M n = 2 is 4 times smaller M n is well-concentrated around its mean

How Many Redundant Bits? In Practice: The increasing value of multiplicity M for increasing portions of paper2 (left) and news (right) from the Calgary corpus

Experimental Results File Original Size gzip gzips Redundant Bytes File Size Increase bib 111,261 39,473 39,511 1,721 (4.36%) 0.10% book1 768,771 333,776 336,256 14,524 (4.35%) 0.74% book2 610,856 228,321 228,242 10,361 (4.54%) -0.03% geo 102,400 69,478 71,168 4,101 (5.90%) 2.43% news 377,109 155,290 156,150 5,956 (3.84%) 0.55% obj1 21,504 10,584 10,783 353 (3.34%) 1.88% obj2 246,814 89,467 89,757 3,628 (4.06%) 0.32% paper1 53,161 20,110 20,204 937 (4.66%) 0.47% paper2 82,199 32,529 32,507 1,551 (4.77%) -0.07%

Not Enough? Look for long enough matches Increase the file size by O(log log n / log n), where n is the length of the original file This will be addressed in a future research

Compatibility with LZ 77 The difference between LZS 77 and plain LZ 77 is that in LZS 77, reference inside of the sliding window are not randomly chosen The generic LZ 77 decoder does not care about which substring in the sliding window is referenced

Motivation for LZRS 77 LZRS 77 is a compression algorithm based on LZS 77 which uses Reed- Solomon error correction codes It is capable of fixing a fixed number of errors in a block of data

Reed-Solomon Codes The data is divided into blocks: Maximum Size of Block: 2 s 1 bytes Data (2 s 1 2e bytes) Parity (2e bytes) s = the size of a symbol in bits e = the maximum number of tolerated errors

Reed-Solomon in LZRS 77 The data is divided into blocks: 255 bytes Data (255 2e bytes) Parity (2e bytes) s = 8 (the size of a symbol in SCII) e = the maximum number of tolerated errors

Compression lgorithm 1. Use plain LZ 77 to compress the data Compressed Data 2. Split the compressed data into blocks of size 255 2e bytes Block 1 Block 2 Block 3 Block N...

Compression lgorithm 3. Process blocks in reverse order: generate error correction codes for i th block and embed them to the previous (i( 1) th block Block 1 Block 2 Block 3 Block N... RS RS RS RS RS The RS codes are embedded by modifying the sliding-window references inside the blocks (RS codes for the first block are sent separately).

Decompression 1. fter receiving a block of data, use the error correction codes to check and recover (if possible) the block 2. Extract the data using LZ 77 3. Extract the error correction codes for the next block using LZS 77 (repeat for the next block)

Problems The entire set of buffers needs to be in the memory during compression Solution: divide the compressed file in parts The encoder cannot process the data as they come The RS codes of the first block need to be sent separately Solution: do not send them

Experimental Results The Probability of Decompressing Incorrectly v. Number of Errors e = 1, 100 blocks e = 2, 100 blocks Example: For 2 bytes of parity error correction codes per a 252 byte block and file size of 100 blocks, the probability of decompressing correctly is 90% for 20 uniformly distributed errors.

pplicability of the Idea The underlying idea is applicable to many compression schemes Multiplicity (if not present) can be generated with small loss of compression power

pplicability to LZW / LZ 78 Further research showed that it is possible to adapt the error-resilient scheme to LZW LZW is based on dynamic dictionary LZW is used by Unix Compress, WinZip, GIF, TIFF, PDF, V.42bis, and others

pplicability to LZW / LZ 78 Use a shorter match instead of the longest one to carry additional information Two ways to do this: The same as in LZS 77 Use the shorter match to end a block (the size of the block can carry the extra information)

Summary Multiplicity in LZ 77 can be exploited to add extra error correction bits Without virtually any loss of compression power While preserving backward-compatibility This idea can be extended to other compression schemes such as LZW