Lossy Compression of Scientific Data with Wavelet Transforms

Chris Fleizach Progress Report Lossy Compression of Scientific Data with Wavelet Transforms Introduction Scientific data gathered from simulation or real measurement usually requires 64 bit floating point numbers to retain accuracy. Unfortunately, these numbers do not lend themselves to traditional compression techniques, such as run-length encoding used in most general compression schemes. A different approach is to recognize that a reversible signal transformation can encode the information using less information. Perhaps the most useful technique for this domain is the wavelet, which goes beyond traditional Fourier transforms by effectively capturing time and frequency. The end result is that most of the interesting data is concentrated into a smaller range of values. At this point, the data can be thresholded, while retaining a high percentage of the energy of the signal. Subsequent reconstruction introduces only minor errors depending on the level of thresholding and the wavelet used. Because the data has been thresholded, there will be a large number of zeros which can be compressed using traditional techniques. The original project specifications included an analysis of compression techniques to determine appropriate settings and useful wavelets. After deciding on a viable method, an implementation was to be done in Matlab and then in C++ for 2-D and 3-D data. A final step was to parallelize some portion of the code. Wavelet Analyses There is a wide variety of wavelet processing capabilities available in Matlab. Understanding all of them took a fair amount of reading and experimentation. The goal was to choose which combination of decomposition, wavelet filter and thresholding technique would result in the best accuracy and compression. The first step was to look at continuous or discrete wavelet transformations. While continuous transformations provide exact reconstruction, they are slow and encode redundant data. Instead, it was found rather recently that using only a dyadic sampling of data is remarkably more efficient and nearly as accurate. This is the essence of the discrete transform, which produces an approximation vector and a detail vector (or three detail vectors in the 2-D case). Next, appropriate wavelet filters were examined by a brute force method. There are a large number of such filters that tend to be useful in specific domains. For compression purposes, it turned out a few biorthogonal filters provided the best compression and accuracy. Biorthogonal filters differ from others by requiring different sets of coefficient to perform deconstruction and reconstruction. There are usually two sets for each, one that creates the approximation using a low pass filter and one that creates the details with a high pass filter. Figure 1 shows a

graphical representation of the two filters chosen. Figure 1: Biorthogonal Filters (3.1 and 5.5) were the most effective Additionally, wavelets can be recursively decomposed a number of times. Doing so has the advantage of separating out more detail from the approximation vectors created, which allows for a greater thresholding. Each iteration reduces the approximation vector by half and introduces another detail vector. Decomposing three times and five times were examined. Testing each wavelet involved decomposing a 1000x1000 grid of vorticity data, thresholding it to a certain value, storing and compressing it using gzip. Then the data was reconstructed and compared to the original to obtain the max error between the two. Thresholding was examined in a few ways. First, various detail coefficient vectors were removed. Then reconstructing with only the approximation vector was tried. Lastly, removing numbers below some small threshold was tested for limits between 1x10-4 and 1x10-7. Three methods were chosen for implementation which exhibited different qualities and are listed in Table 1. Figure 2 shows the visual representation of the original data, the reconstruction and the error between the two for a 75x75 grid using the High type of compression listed in Table 1. Matlab functions have been implemented to provide this functionality automatically. (Note: the nsflow program was modified to automatically include the size of the array in the output file, simplifying some processing).

Type Wavelet Max Error Compression Ratio Low Comp. Bior5.5 Level 5 (1e-7 threshold) Medium Bior3.1 Level 5 (1e-5 threshold) High Bior3.1 Level 5 (1e-4 threshold) 7.35x10-8 25.36 3.54x10-6 59.91 3.03x10-5 128.92 Data was 1000x1000 vorticity grid. Compression ratio is the ratio of the gzipped compressed transformation data compared to the gzipped compressed original data Figure 2: Matlab Processing of 75x75 grid with High Compression C++ Implementation Implementing a wavelet transform turned out to be more difficult than expected. Although there is a wealth of information on abstract formulas, there are subtleties involved that make it fairly difficult to get right and process efficiently. The basic steps in a multi-level 2-D discrete wavelet transform are to 1) convolve all the rows with the low pass filter and the high pass filter resulting in the approximation and detail vectors respectively. 2) Downsample and keep only the

even indexed columns. 3) Convolve the filters on all the columns and keep only the even rows. 4) At this point there is 1 approximation and 3 details vectors. 5) Recursively process the approximation vector until the desired level is reached. The data has now been transformed. It can be reconstructed using the reconstruction filters. The hardest part of this process is symmetrically extending the data to accommodate the dyadic sampling that is required to speed up processing. The convolution step requires that the sample be of a certain size power of 2 and if it the data is not extended correctly, the results are poor. Although, I have implemented my own wavelet transform, it did not perform very well. Instead a modified wavelet library was used which handled the data extension and convolution steps, but currently has issues with accuracy around the edges. Figure 3 shows original data, reconstructed data and Figure 4 shows the error between them from the C++ implementation for a 200x200 u Velocity grid which resulted in a 3.96 compression ratio using the high type of compression. Visually, they look the same, but the edges are troublesome in the error surface plot. The edges have problems because there is not enough data for the filters to process. Various attempts to symmetrically extend these borders have not yet yielded useful results. Additionally, the compression results are still much lower than what Matlab was able to yield, but are still an improvement over traditional compression. For example, to compress the 200x200 grid, bzip2 only achieves a 1.023 compression ratio (compared to 3.96 with the transform). Also note, as size of the grid increases, the compression ratio will also increase because there will be more zeros. Figure 3: C++ Processing of 200x200 grid of velocity data

Figure 5: C++ Implementation (200x200 Grid) High Compression Figure 4: C++ Error between original and reconstructed Further Work The main reason for a C++ implementation will be for the chance to experiment with parallelizing. In the 2-D implementation each row is convoluted, then each column is convoluted, separately. Rows and columns could be sent to a different node for processing if necessary. A simple two node implementation will be attempted using XML-RPC. The framework for handling 3-D data has been designed, but not yet tested. Some simple 3-D datasets will be analyzed.