LZ UTF8. LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties:

Similar documents
S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

Category: Informational May DEFLATE Compressed Data Format Specification version 1.3

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

ROOT I/O compression algorithms. Oksana Shadura, Brian Bockelman University of Nebraska-Lincoln

PVM. Data Compression: RLE

Hash Tables. Gunnar Gotshalks. Maps 1

EE-575 INFORMATION THEORY - SEM 092

Data and File Structures Chapter 11. Hashing

Data Types. Numeric Data Types

Comp 335 File Structures. Hashing

[MS-XCA]: Xpress Compression Algorithm. Intellectual Property Rights Notice for Open Specifications Documentation

Basic Compression Library

Parallelized Hashing via j-lanes and j-pointers Tree Modes, with Applications to SHA-256

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

A Simple Lossless Compression Heuristic for Grey Scale Images

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

HASH TABLES. Goal is to store elements k,v at index i = h k

Worst-case running time for RANDOMIZED-SELECT

Source coding and compression

Hashing. Hashing Procedures

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

TABLES AND HASHING. Chapter 13

A New Compression Method Strictly for English Textual Data

14.4 Description of Huffman Coding

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Evaluation of a High Performance Code Compression Method

Millau: an encoding format for efficient representation and exchange of XMLover the Web

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Understand how to deal with collisions

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Network Working Group

9/24/ Hash functions

Data Compression Techniques

Module 2: Computer Arithmetic

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

WSDL2RPG FAQ. FAQ How to Send Base64 Encoded Passwords

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Text Compression through Huffman Coding. Terminology

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Dictionaries and Hash Tables

CS 33. Data Representation (Part 3) CS33 Intro to Computer Systems VIII 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

CS3350B Computer Architecture

Announcements. Programming assignment 1 posted - need to submit a.sh file

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2

III Data Structures. Dynamic sets

AN 831: Intel FPGA SDK for OpenCL

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

DS ,21. L11-12: Hashmap

STREAM VBYTE: Faster Byte-Oriented Integer Compression

External Data Representation (XDR)

Lossless Compression Algorithms

Hierarchical Substring Caching for Efficient Content Distribution to Low-Bandwidth Clients

Program Construction and Data Structures Course 1DL201 at Uppsala University Autumn 2010 / Spring 2011 Homework 6: Data Compression

COSC431 IR. Compression. Richard A. O'Keefe

11. Reference Types. Swap! Reference Types: Definition. Reference Types

1 / 22. Inf 2B: Hash Tables. Lecture 4 of ADS thread. Kyriakos Kalorkoti. School of Informatics University of Edinburgh

Direct File Organization Hakan Uraz - File Organization 1

Yarn password hashing function

Heckaton. SQL Server's Memory Optimized OLTP Engine

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Network Working Group. Category: Informational DayDreamer August 1996

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

As of October 1, 1998, our address is:

Decompressing Snappy Compressed Files at the Speed of OpenCAPI. Speaker: Jian Fang TU Delft

Computer Architecture and System Software Lecture 02: Overview of Computer Systems & Start of Chapter 2

CS307: Operating Systems

CS 241 Analysis of Algorithms

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 4 Main Memory

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

CS Computer Architecture

Thrift specification - Remote Procedure Call

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

by Martin J. Dürst, University of Zurich (1997) Presented by Marvin Humphrey for Papers We Love San Diego November 1, 2018

An Asymmetric, Semi-adaptive Text Compression Algorithm

Numbers and Computers. Debdeep Mukhopadhyay Assistant Professor Dept of Computer Sc and Engg IIT Madras

ProFS: A lightweight provenance file system

GZIP is a software application used for file compression. It is widely used by many UNIX

Improved Attack on Full-round Grain-128

Hash Tables and Hash Functions

Compression; Error detection & correction

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

University of Waterloo CS240 Spring 2018 Help Session Problems

( ) n 5. Test 1 - Closed Book

9/3/2015. Data Representation II. 2.4 Signed Integer Representation. 2.4 Signed Integer Representation

Hash Tables Hash Tables Goodrich, Tamassia

ITCT Lecture 6.1: Huffman Codes

Real instruction set architectures. Part 2: a representative sample

Dictionary Based Compression for Images

Action Message Format -- AMF 3

Transcription:

LZ UTF8 LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties: 1. Compress UTF 8 and 7 bit ASCII strings only. No support for arbitrary binary content or other string encodings (such as ISO 8859 1, UCS 2, UTF 16 or UTF 32). 2. Fully compatible with UTF 8. Any valid UTF 8 byte stream is also a valid LZ UTF8 byte stream (but not vice versa). 3. Individually compressed strings can be freely concatenated and intermixed with each other as well as with plain UTF 8 strings, and yield a byte stream that can be decompressed to the equivalent concatenated source strings. 4. An input string may be incrementally compressed with some arbitrary partitioning and then decompressed with any other arbitrary partitioning. 5. Decompressing any input block should always yield a valid UTF 8 string. Trailing incomplete codepoint sequences should be saved and only sent to the output when a valid sequence is available. Decompressor should always output the longest sequence possible. Consequently: no flushing is needed when the end of the stream is reached. 6. Should run efficiently in pure Javascript. Including on mobile. Decompression should be fast in particular. 7. Achieve reasonable compression ratio. 8. Simple algorithm. Relatively small amount of code (decompressor should be less than 100 lines of code). 9. One single permanent, standard scheme. No variations or metadata apart from the raw bytes. Try to adopt the best possible practices and find reasonable parameters that would be uniform across implementations (having uniform compressor output between different implementations would significantly simplify testing and quality assurance).

The Algorithm The algorithm is the standard LZ77 algorithm with a window of 32767 bytes. The minimum match length is 4 bytes, maximum is 31 bytes. The compressed stream format is relatively simple as it is completely byte aligned (necessary in order to allow some of the properties mentioned above). The Byte Stream There are two types of byte sequences in the stream. The first is a standard UTF 8 codepoint sequence a leading byte followed by 0 to 3 continuation bytes. The second is a d pointer sequence (or alternatively, a length distance pair), made of 2 or 3 consecutive bytes, referencing the length and distance to a previous sequence in the stream. The 5 least significant bits of the lead byte of a d pointer sequence contain the length of the sequence, which can have a value between 4 and 31. The 3 most significant bits for the lead byte are either 110 for a 2 byte pointer or 111 for a 3 byte pointer. This was intentionally chosen to be similar to the standard UTF 8 lead byte identifier bits*. The decoder will differentiate between a codepoint and a d pointer sequence by the second byte s top bit which is always 1 in a codepoint sequence but 0 in a d pointer sequence. The second and optionally third bytes of a pointer sequence represent the distance to the start of the pointed sequence. If the distance is smaller than 128, a single byte is used containing the literal 7 bit value. Otherwise two bytes are used and encoded as a big endian 15bit integer. * The reason has to do with being easily able to detect invalid or truncated streams, and issues with decompression of incomplete parts.

The following table demonstrates the unambiguity of the two encoding patterns (where l bits represent length and d bits represent distance) : Sized pointer sequence UTF 8 Codepoint sequence 1 byte n/a 0xxxxxxx 2 bytes 110 lllll 0 ddddddd 110 xxxxx 10 xxxxxx 3 bytes 111 lllll 0 ddddddddddddddd 1110 xxxx 10 xxxxxx 10 xxxxxx 4 bytes n/a 11110 xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx The Compression Process (reference implementation) Note: this describes the exact algorithm used by the reference implementation. In terms of decompressor compatibility, there s no actual necessity that every implementation follows it, but having uniform results from other implementations would greatly simplify testing and overall quality assurance. During compression every consecutive 4 byte input sequence is hashed (including overlapping sequences). The hash function used is a Rabin Karp polynomial hash with a prime constant of 199 ( b0*199^3+b1*199^2+b2*199^1 +b3*199^0). A rolling hash implementation may be used, though in pratice it has proven to be slower than the simpler recalculation of the hash. The hash is then looked up in a hash table (having a capacity of exactly 65537 buckets, with a hash index of (hash mod 65537)) to find a match to previous occurrences of the 4 byte sequence it represents. If the bucket representing the hash of the sequence is empty, no match has been found: the literal byte is sent to the output and read position is incremented to the next byte. Otherwise a match may have been found: all sequences (within the 32k window) referenced in the bucket are then tested. For every possible match tested, all bytes (including additional bytes beyond the first 4) are linearly compared up to the longest match possible (or the maximum allowed 31).

Note: the bucket may contain references to non matching sequences that may happen because of hash collisions. This is not a problem as they will not pass the linear comparison. The longest match found* in the bucket is the one eventually used. A length distance pair is then encoded to a d pointer sequence of 2 or 3 bytes (described above) and sent to the output. Read position is then incremented to the next byte(s), sequences within the matched sequence are hashed and added to the hash table, but not matched to previous ones. Matching resumes at the first byte following the matched range. A reference to the current sequence is then added to the bucket. Every bucket in the hash table may contain up to 64 elements. When filled up to its maximum capacity, the oldest 32 elements are discarded. * Since distances smaller than 128 are encoded to 2 bytes and larger by 3, the compressor optimizes for the highest length/byte count ratio. It will reject matches of distance >=128 that are not more than 1.5x longer than a previous match of distance <128. This may also slightly improve compression speed. The Decompression Process Decompression is very simple. The compressed stream is iterated one byte at a time. If the current byte s top bits are 111 or 110 and the following one has 0 as top bit, the current byte and the one(s) following, are interpreted and decoded as a d pointer sequence (described above). Otherwise, the raw input byte value is sent to the output. Decoding a d pointer is as simple as finding its start position (= current read position distance) and copying the pointed bytes to the output, numbered by length. Note: when decompressing an arbitrary, partial block from the input stream, special care is needed to always output valid UTF8 strings. If any of the last 1 3 output bytes are part of a truncated codepoint or pointer sequence, they are saved and only decoded with the next block.

Efficiency and Performance The Javascript implementation, benchmarked on a single core of an Intel Pentium G3320 (a low end desktop) on a modern browser, for 1MB files, has a typical compression speed of 3 14MB/s. Decompression speed ranges between 20 80MB/s. The C++ implementation is typically around 30 40MB/s for compression and 300 500MB/s for decompression. These figures may further improve with subsequent optimizations of the code. Compression ratio is usually lower than lzma and lz string on relatively long inputs (longer than 32kb) and similar or better on shorter ones. On very long (or some pathologically repetitive) inputs, it may be substantially lower, mainly because of the 32kb window and 31 byte sequence limits. Compression memory requirement is about <input > * 2 plus a hash table memory bounded by an approximate worst case of 65537 * 64 * 4 (<BucketCount> * <MaxBucketCapcity> * <SequenceLocatorByteSize>) = ~16MB. The theoretical upper bounding number of different 4 byte permutations for a 32kb window is 4 * 32kb = 128kb (actual number is smaller because of overlap), which is about twice the hash table bucket count, yielding a worst case load factor of about 2 (assuming a uniform hash distribution), which is acceptable. A large/smaller maximum bucket capacity may be used, increasing/decreasing compression ratio slightly (and decreasing/increasing compression speed respectively), though experimentally it hasn t proven to be significant enough, given the small window.

Some example comparison results in various browsers: (times are in milliseconds, note this includes binary to string conversion time, which may be a significant fraction of the decompression time) English Bible Excerpt ( 978453bytes ) (FF32) (FF32) lzma 244983 2604 307 2825 615 3527 1396 lz string 327206 577 77 455 64 976 71 lz utf8 367359 109 17 103 17 257 50 Hindi Bible Excerpt (999716 bytes) lzma 140104 3738 228 3230 552 3826 1190 lz string 157328 213 37 309 45 300 34 lz utf8 226721 95 16 90 15 280 40 Japanese Alice in Wonderland (246025 bytes) lzma 59711 1405 80 1084 141 1088 272 lz string 59472 58 14 113 11 63 13 lz utf8 86302 59 4 52 3 98 8 JQueryUI.js (451466 bytes) lzma 95942 4198 129 1284 262 1518 585 lz string 154464 212 36 214 19 334 36 lz utf8 137170 74 8 62 8 110 24

Copyright 2014 2015, Rotem Dan <rotemdan@gmail.com>. Available under the MIT license.