A High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm

Size: px
Start display at page:

Download "A High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm"

Transcription

1 2012 IEEE th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum A High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian Weis, Norbert Wehn Microelectronic systems design research group TU Kaiserslautern Germany {shcherbakov, weis, wehn}@eit.uni-kl.de Abstract The increasing growth of embedded networking applications has created a demand for high-performance logging systems capable of storing huge amounts of high-bandwidth, typically redundant data. An efficient way of maximizing the logger performance is doing a real-time compression of the logged stream. In this paper we present a flexible high-performance implementation of the LZSS compression algorithm capable of processing up to 50 MB/s on a Virtex-5 FPGA chip. We exploit the independently addressable dual-port block RAMs inside the FPGA chip to achieve an average performance of 2 clock cycles per byte. To make the compressed stream compatible with the ZLib library [1] we encode the LZSS algorithm output using a fixed Huffman table defined by the Deflate specification [2]. We also demonstrate how changing the amount of memory allocated to various internal tables impacts the performance and compression ratio. Finally, we provide a cycle-accurate estimation tool that allows finding a trade-off between FPGA resource utilization, compression ratio and performance for a specific data sample. I. INTRODUCTION Compression is a way of representing data in a way that requires fewer bits. All compression algorithms can be divided into lossy and lossless ones. Lossy compression algorithms discard certain less important parts of the information and are highly tied to the structure of the data. In lots of applications (e.g. video coding) the trade-off between the quality drop and significant size reduction is more than acceptable. Lossless algorithms exploit the redundancy and predictability of the compressed information: highly redundant data will need less bits to be stored. In the worst case when no redundancies are found, the compressed block will actually be bigger than the uncompressed one. However, in any case, the decompression will restore the original data, bit-by-bit. Many lossless compression algorithms are widely used in modern servers and workstations for varieties of tasks. Those algorithms, such as LZMA [3], provide high compression ratios, but require tens to hundreds of megabytes of RAM and a fast CPU. This suits in most of the use cases (e.g. backups), where compression ratio is more important than speed. Another use for lossless compression is the rapidly growing embedded networking system world. Keeping a log of inter-node communications significantly simplifies profiling and debugging tasks. Compressing the logged stream in real time would relax the size and bandwidth requirements for the underlying storage media. Unlike the workstation/server applications, compression throughput becomes one of the most important constraints. Modern FPGAs allow building powerful embedded systems on one chip. A typical high-end FPGA contains tens to hundreds of independent dual-port block RAMs (several kilobytes each), one or more built-in CPUs and a lot of reconfigurable logic. The logic operates at lower frequencies than the CPU, however allows exploiting massive algorithmic parallelism. This defines the layout of a typical FPGA-based system-onchip: a central CPU handling high-level tasks and several accelerators performing highly parallelizable computations. Making a high-performance FPGA-based compressor would require selecting an algorithm capable of exploiting the advantages of the above-mentioned FPGA architecture. Having considered several compression algorithms, we have chosen a subset of the Deflate specification [2] (LZSS [4] + fixedtable Huffman encoding) as it can be efficiently implemented using FPGA logic and block RAMs, keeping the on-chip CPU available for higher-level tasks. The rest of this paper is organized as follows. Section 2 overviews the work related to the LZSS-like algorithms and related hardware. Section 3 brielfy describes the data format. Section 4 describes the presented hardware implementation. Section 5 provides a comparison with a software compressor and shows various trade-offs between speed, compression ratio and memory use. Section 6 summarizes the results. II. RELATED WORK Since the publication the LZSS algorithm [4] based on LZ77 [5], there have been many improvements to both algorithmic and implementation aspects that can be categorized in the following way: Further algorithmic improvements. Improve compression ratio at a cost of more operations and memory. E.g. the LZMA algorithm [3] used in the 7-Zip program. Algorithm variations (e.g. [6]) simplifying random access of the compressed data. Hardware implementations that rely on contentaddressable memories [7] and systolic arrays [8], [9]. Applications of fast hardware decompression for dynamic FPGA reconfiguration [10]. FPGA/ASIC-based implementations: [11], [12] /12 $ IEEE DOI /IPDPSW

2 The presented implementation falls in the last category. We have used the approach presented in [11] (FSM with several independent memories) and significantly optimized the design performance by decomposing and parallelizing several processes (employing dual-port block RAM architecture), making use of wide internal data buses and advanced caching/prefetching techniques. As we are targeting embedded logging applications, we have optimized the compression speed while keeping feasible compression ratio, taking the minimum ZLib compression level as a reference point. Nevertheless, the memory sizes and the algorithm parameters are generics that can be easily adjusted to increase the compression ratio at a cost of additional clock cycles and/or extra block RAMs. III. THE DATA FORMAT Before describing the hardware architecture, we will give an overview of the LZSS algorithm (ZLib-based implementation that has minor differences from the original LZSS [4]). The algorithm consumes a stream of literals (i.e. bytes) and produces a stream of decompressor commands. There are 2 command types: output 1 literal and copy-paste L literals encountered D literals ago. E.g. compressing a string snowy snow will result in 7 commands: 6 describing each byte of snowy and 1 command copying 4 bytes ( snow ) from distance 6. To detect that a string has been encountered in the past, the compressor has to store the last N bytes of the input stream. N is referred as dictionary size or sliding window size. On the bit level, every command has 2 fields: D (log 2 N bits) and L (8 bits). If D is 0, the command is output byte and L contains the byte. Otherwise, D contains the copying distance and L contains the copying length minus 3. If the length is less than 3, normal output byte commands are emitted instead. IV. HARDWARE ARCHITECTURE The LZSS compressor uses handshake interfaces for both input and output streams. The compressor consumes 32-bit words (LSBF/MSBF format can be selected) and produces D/L pairs (see section III) used by a fixed-table Huffman coder that produces a stream of packed 32-bit words. The use of stream interfaces allows connecting to highperformance interfaces (e.g. LocalLink [13]) and compressing real-time streaming data on-the-fly without separate buffering and compressing stages. The LZSS compressor consists of the main finite state machine, 5 independently addressable dual-port memories and several auxiliary FSMs. Figure 1 illustrates the overall structure. The main task of the compressor is finding previous occurrences of a string (matching). To determine if a string S has been encountered before, the compressor computes a hash value from its first 3 bytes and looks through the list of all strings in the dictionary with the same hash value. Filling logic Fig. 1. Overall structure of the LZSS compressor Hash cache Lookahead buffer Comparer Dictionary FSM Iter + - memories Head Base Next Output Matching requires comparing the front of the uncompressed stream with several offsets inside the dictionary to find the longest match. To speed up the comparison we have placed them into 2 independently addressable ring buffers: Lookahead buffer. Contains the front of the input stream (up to 512 bytes). Dictionary (a.k.a. Sliding Window). Contains the last N bytes of the input stream that have just been processed. Having the compared memories in independent block RAMs allows performing one comparison iteration every clock cycle. Furthermore, the data bus width for both memories is 32 bits, that allows comparing 1 to 4 bytes during the first clock cycle and exactly 4 bytes during each following one. E.g. comparing two 50-byte strings would take not more than (50 1) /4 + 1 = 14 clock cycles instead of 100 if both buffers resided in a single byte-addressed memory [11]. Moreover, both ring buffers reside in dual-port block RAMs and are filled in the background requiring no extra clock cycles of the main FSM. If the hash caching was enabled, hash values for every offset of the source stream are computed during background filling and stored in a separate memory. The hash table structure is similar to the ZLib. The following independently addressable tables are maintained: Head table. For each value of the hash function it contains the offset in the dictionary buffer of the last string having it. Next table. For each offset of the dictionary it contains the relative offset of the previous string having the same hash value. The matching means finding the longest string in the dictionary that is equivalent to the beginning of the lookahead buffer. E.g. if the dictionary contains and the lookahead buffer contains , the 1234 and 123 will be the candidates and 1234 will be the longest one. If a hash collision occurs (e.g. hash( 123 ) = hash( 34 ), then will also be considered a candidate. The addresses of the candidate strings are obtained by reading head/next tables. Once the matching is done, a decompressor command is produced and the lookahead buffer/dictionary pointers are moved. An optional hash table updating takes place afterwards

3 To illustrate how the high compression performance is achieved we will describe a typical state flow of the main FSM: Initially, the compressor waits until the lookahead buffer contains at least 262 bytes and the hash value of its front is available. As the filling runs in background, this state typically takes only 1 clock cycle. The hash value from the lookahead buffer is routed to the head table address. As soon as the data is available, the matching preparation occurs. The value from the head table is used as the first string address. It is also is routed to the next table to get the address of the next string with the same hash value. The head and the next tables are updated in this cycle to allow finding the currently processed string in the future. At the next clock cycle the matching begins. The next table is read in parallel, so the bottleneck here is the actual comparison of the strings (accelerated by 32-bit buses). When the matching is complete, the output is produced. The values for D and L depending on the matching results are output to the compressed stream interface. If the sink requests a delay, the main FSM is stalled. If a full hash table updating can be performed (decided based on match length), the FSM updates the head/next tables for every byte of the matched string. Every update iteration takes 1 clock cycle. When the hash table updating is done (or was skipped), the FSM enters the initial waiting stage. Depending on the properties of the input data, 30-85% of the matching operations will be unsuccessful and end up with producing the output literal commands, requiring at least 3 clock cycles (+matching). We have implemented a special hash prefetching mechanism accelerating this scenario. A separate FSM is active during the match preparation and matching. It buffers the data from the lookahead buffer and the hash cache and uses the available clock cycles to prefetch (or precompute) the hash value at offset 1 in the lookahead buffer. If no match was found (i.e. the lookahead buffer is going to be advanced by 1 byte), the prefetched value is routed to the head table address and the FSM goes directly to match preparation state skipping the waiting state requiring only 2 non-matching cycles instead of 3. The concept of head/next tables was introduced in ZLib [1] and mentioned in [11]. Originally, both head and next tables contain absolute string offsets inside the dictionary. Every (dictionary size) bytes, ZLib rotates the dictionary: the last bytes are moved up (a total of 64 Kbytes is allocated) and each head/next value is adjusted accordingly (the ones pointing outside the buffer are zeroed). The time overhead is negligible in the slow software, however it would consume 25-75% of the clock cycles (depending on hash/dictionary sizes) for the fast hardware implementation. We have done 3 improvements that reduce the clock cycle overhead to 1-2%: The next table contains relative addresses. This requires 1 extra adder, to compute the absolute address, but eliminates the need to rotate the next table. Every record inside the head table contains k extra generation bits as if the dictionary was 2 k times bigger. The real dictionary size is still used to detect whether a record points outside the dictionary, but the rotating has to be performed 2 k times rarer. The head table memory is internally split into M submemories, each having the size of a single block RAM inside the FPGA. The rotation happens in parallel and requires M times less cycles. The output interface of the LZSS compressor is connected to a fixed-table pipelined Huffman encoder that produces a ZLib-compatible stream. As the table is fixed, no additional clock cycles or memories are required to build it and the encoder does not introduce any delays to the stream produced by the LZSS compressor. The cost for the high performance is less efficient compression compared to the dynamic huffman coders, however, it can be also compensated by increasing LZSS compression level. Our implementation is generic. Various compile-time parameters can be customized to find an optimal trade-off between FPGA resource utilization, compression ratio and speed. Dictionary size, hash bit count, exact hash function, generation bit count, and the head table division factor can be customized during compile-time. Run-time parameters (e.g. matching iteration limit), can also be changed. We have provided an interactive estimation tool that compresses a given file using several presets and produces reports regarding the block RAM amount, compression ratio and clock cycle usage. To maintain high design modularity and decompose the architecture and the low-level details (e.g. hash function, data types and bus sizes), we have used the policy class-based design approach and the THDL++ language [14] that extends VHDL semantics by object-oriented features. THDL++ code can be compiled to VHDL-93 using the freely available compiler and an IDE [15]. V. RESULTS In this section we evaluate the LZSS compressor design by comparing its performance to a software implementation, provide the FPGA utilization information and show the impact of various design settings on the design size and performance. Our test system is the ML-507 development board based on a Virtex-5 FPGA. We have developed a testbench that receives a data block from the PC over Ethernet, stores it in the DDR2 memory, compresses it and sends the result back. The compression time includes the DMA [13] setup times, but excludes Ethernet transmission time. We have compared a software implementation (ZLib [1] running on the PowerPC processor inside the XC5VFX70T-FF FPGA) and the hardware implementation with parameters optimized for speed (4KB dictionary, 15-bit hash). The clock frequency of the PowerPC was 400 MHz while the compressor was connected to a 100 MHz clock (post-route analysis reported a maximum clock frequency of MHz). We have used 2 data sets: a fragment of a Wikipedia text snapshot [16] (referred as Wiki) and sample

4 data obtained from an automotive CAN logger (referred as X2E). We have run the test with a 10MB and a 50MB fragments to factor out DMA setup time. Table I shows the performance comparison (parameters, input and output streams were equal). TABLE I PERFORMANCE EVALUATION Data sample SW speed HW sped Speedup Compression (MB/s) (MB/s) ratio Wiki 50MB x 1.69 Wiki 10MB x 1.68 X2E 50MB x 1.7 X2E 10MB x 1.7 Additionally to the 15-20x performance increase, the use of the DMA engine to transfer the data between the DRAM and the hardware compressor allows running high-level tasks on the CPU in parallel with the compression. Table II shows that FPGA utilization in terms of lookup tables (LZSS + fixed-table Huffman) remains insignificant and almost the same (5.2+=0.6% of the Virtex5 FPGA) for all reasonable dictionary sizes and hash sizes. TABLE II FPGA UTILIZATION Hash size Dictionary size LUTs Registers 15 bits B bits 8KB bits 4KB Available in XC5VFX70T FPGA To simplify design space exploration we have developed a software estimator tool[17]. The tool consists of a flexible cycle-accurate C++ model and a C# front-end. The C++ model accepts various design parameters (e.g. window size), compresses reference data blocks and produces various cycleaccurate statistics. The C# front-end allows constructing series of parameter sets (e.g. iterating an arbitrary parameter over a given range), iteratively runs the C++ model and visualizes the obtained results. The rest of this section describes several trade-offs explored by running a 100MB Wikipedia snapshot [16] through the software estimator. First of all, increasing the dictionary size improves the compression ratio (fig. 2). Moreover, the improvement is more significant for larger hash sizes. Fig. 2. Compressed size in MB of a 100MB Wiki fragment [16] Hash 67.3 bits: Dictionary: 1K 2K 4K 8K 16K Increasing the dictionary size slightly slows down the compression. This can be compensated by increasing the hash size (fig. 3) and thus, lowering hash collision probability and reducing the amount of matching iterations. However, increasing hash size raises the memory requirements exponentially (head table requires 2 H ( log 2 D + G) bits where H is the hash bit count, D is the dictionary size and G is the amount of generation bits). Fig Hash bits: Compression speed (MB/s) for a 100MB Wiki fragment[16] 2K 4K 8K 16K Dictionary Another way of improving compression efficiency is adjusting the algorithm parameters (e.g. amount of matching attempts before giving up). This can improve the compression by 20% at a cost of 82% performance decrease (fig. 4). Fig. 4. Compressed size and speed for a 100MB Wiki fragment [16] for min/max compression levels and 2 hash size options 73 MB Hash bits; 68 MB compression level: 63 MB 9 bits;min 59 MB 15 bits;min 54 MB 50 MB 9 bits;max 45 MB 15 bits;max 49 MB/s 15 bits;min 38 MB/s 9 bits;min 28 MB/s 18 MB/s 8 MB/s Dictionary: 1K 2K 4K 8K 16K 15 bits;max 9 bits;max As the other hardware implementations [11], [12] do not provide exact performance results, we have analyzed the impact of 3 main optimization techniques compared to the design described in [11] by temporarily disabling them and measuring the performance impact. Table III summarizes the results. TABLE III COMPRESSION SPEED FOR A 100 MB WIKI FRAGMENT WITHOUT OPTIMIZATIONS Configuration Window size 4KB B A) Original (15-bit hash; 32-bit data) 49.0 MB/s 46.2 MB/s B) 8-bit data bus as in [11] 30.3 MB/s 25.9 MB/s C) Disabled hash prefetching 45.2 MB/s 45.0 MB/s D) Reduced generation bits to MB/s 33.8 MB/s Disabled all 3 optimizations over [11] 10.2 MB/s 21.2 MB/s This most efficient optimization for small window sizes is the introduction of generation bits, as using k generation bits makes next table rotation occur 2 k times rarer (if k is 1, rotation happens every D bytes, where D is the dictionary size). Using wide data buses provides a 63-78% performance increase and hash prefetching increases the performance by additional 8%. The overall performance increase due to the

5 described optimizations is 2.2x-4.8x depending on the window size. As an indirect metric of an LZSS compressor efficiency we have measured the amount of clock cycles spent on actually comparing the data from the dictionary with the lookahead buffer (compared to the clock cycles spent on updating hash tables, computing read addresses, etc.). Figure 5 shows state distribution for the 100 MB Wiki fragment with a B dictionary and 15-bit hash. Fig. 5. Time spent on different operations (100MB Wiki fragment) Waiting for data (8,4%) Producing output (11,0%) Updating hash table (11,6%) Rotating hash (0,3%) Fetching data (0,2%) Finding match (68,5%) Most of the time (68.5%) is spent on reading and comparing the data (up to 4 bytes per cycle from each dictionary and lookahead buffer). Producing the output and prefetching the next hash value in parallel takes 11% of the time. Another 11.6% of the time is spent on inserting every byte of a short match (up to 4 bytes) in the hash table. Finally, 8.4% of the time is spent on waiting for the head table to be read when the prefetched hash value is not useful (i.e. when a valid match is found and several bytes are skipped). [6] S. Kreft and G. Navarro, Lz77-like compression with fast random access, in Data Compression Conference (DCC), 2010, march 2010, pp [7] P. Rauschert, Y. Klimets, J. Velten, and A. Kummert, Very fast gzip compression by means of content addressable memories, in TENCON IEEE Region 10 Conference, vol. D, nov. 2004, pp Vol. 4. [8] J.-M. Chen and C.-H. Wei, Vlsi design for high-speed lz-based data compression, Circuits, Devices and Systems, IEE Proceedings -, vol. 146, no. 5, pp , oct [9] B. Jung and W. Burleson, A vlsi systolic array architecture for lempelziv-based data compression, in Circuits and Systems, ISCAS 94., 1994 IEEE International Symposium on, vol. 3, may-2 jun 1994, pp vol.3. [10] M. Huebner, M. Ullmann, F. Weissel, and J. Becker, Real-time configuration code decompression for dynamic fpga self-reconfiguration, in Parallel and Distributed Processing Symposium, Proceedings. 18th International, april 2004, p [11] S. Rigler, W. Bishop, and A. Kennings, Fpga-based lossless data compression using huffman and lz77 algorithms, in Electrical and Computer Engineering, CCECE Canadian Conference on, april 2007, pp [12] (2011, Sep.) Gzip compression/gunzip decompression core. [Online]. Available: [13] (2011, Sep.) Locallink interface. [Online]. Available: UserInterface.htm [14] I. Shcherbakov, C. Weis, and N. Wehn, Bringing c++ productivity to vhdl world: from language definition to a case study, in Specification Design Languages, IC Forum on, sept. 2011, pp [15] (2011, Sep.) Visualhdl website. [Online]. Available: [16] (2011, Sep.) Large text compression benchmark. [Online]. Available: [17] (2011, Sep.) Compression performance analyzer. [Online]. Available: VI. CONCLUSION In this paper we have presented a high-performance flexible implementation of the LZSS algorithm on a Virtex5 FPGA. We have exploited the independently addressable dual-port block RAMs and performed several specific FSM and data structure optimization, resulting in a 15-20x performance increase compared to the optimized software implementation [1]. The compressor design is flexible and allows tuning various parameters to achieve trade-offs between the speed, compression ratio and block RAM utilization. An estimation tool available online [17] allows performing design space exploration and finding optimal parameters based on real data samples. We have verified the quality of our design by compressing more than 1 TB of data on the FPGA and comparing the results to software reference model. REFERENCES [1] (2011, Sep.) Zlib compression library. [Online]. Available: [2] N. W. Group. (1996, May) Deflate compressed data format specification. [Online]. Available: [3] (2011, Sep.) Lzma sdk. [Online]. Available: zip.org/sdk.html [4] (2011, Sep.) Lzss algorithm. [Online]. Available: [5] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, Information Theory, IEEE Transactions on, vol. 23, no. 3, pp , may

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Author: Supervisor: Luhao Liu Dr. -Ing. Thomas B. Preußer Dr. -Ing. Steffen Köhler 09.10.2014

More information

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression Design and Implementation of FPGA- based Systolic Array for LZ Data Compression Mohamed A. Abd El ghany Electronics Dept. German University in Cairo Cairo, Egypt E-mail: mohamed.abdel-ghany@guc.edu.eg

More information

HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM

HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM Parekar P. M. 1, Thakare S. S. 2 1,2 Department of Electronics and Telecommunication Engineering, Amravati University Government College

More information

Simple variant of coding with a variable number of symbols and fixlength codewords.

Simple variant of coding with a variable number of symbols and fixlength codewords. Dictionary coding Simple variant of coding with a variable number of symbols and fixlength codewords. Create a dictionary containing 2 b different symbol sequences and code them with codewords of length

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013 Abdullah-Al Mamun CSE 5095 Yufeng Wu Spring 2013 Introduction Data compression is the art of reducing the number of bits needed to store or transmit data Compression is closely related to decompression

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Runlength Compression Techniques for FPGA Configurations

Runlength Compression Techniques for FPGA Configurations Runlength Compression Techniques for FPGA Configurations Scott Hauck, William D. Wilson Department of Electrical and Computer Engineering Northwestern University Evanston, IL 60208-3118 USA {hauck, wdw510}@ece.nwu.edu

More information

A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs

A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs Jeremy Fowers, Joo-Young Kim and Doug

More information

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

FPGA Provides Speedy Data Compression for Hyperspectral Imagery FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics

More information

Gipfeli - High Speed Compression Algorithm

Gipfeli - High Speed Compression Algorithm Gipfeli - High Speed Compression Algorithm Rastislav Lenhardt I, II and Jyrki Alakuijala II I University of Oxford United Kingdom rastislav.lenhardt@cs.ox.ac.uk II Google Switzerland GmbH jyrki@google.com

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING Dieison Silveira, Guilherme Povala,

More information

Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor

Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor 2016 International Conference on Information Technology Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor Vasanthi D R and Anusha R M.Tech (VLSI Design

More information

High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms

High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms Weikang Qiao, Jieqiong Du, Zhenman Fang, Michael Lo, Mau-Chung Frank Chang, Jason Cong Center for Domain-Specific Computing, UCLA

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4 Bas Breijer, Filipa Duarte, and Stephan Wong Computer Engineering, EEMCS Delft University of Technology Mekelweg 4, 2826CD, Delft, The Netherlands email:

More information

Hardware Parallel Decoder of Compressed HTTP Traffic on Service-oriented Router

Hardware Parallel Decoder of Compressed HTTP Traffic on Service-oriented Router Hardware Parallel Decoder of Compressed HTTP Traffic on Service-oriented Router Daigo Hogawa 1, Shin-ichi Ishida 1, Hiroaki Nishi 1 1 Dept. of Science and Technology, Keio University, Japan Abstract This

More information

LZ UTF8. LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties:

LZ UTF8. LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties: LZ UTF8 LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties: 1. Compress UTF 8 and 7 bit ASCII strings only. No support for arbitrary

More information

Evaluation of a High Performance Code Compression Method

Evaluation of a High Performance Code Compression Method Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

DEFLATE COMPRESSION ALGORITHM

DEFLATE COMPRESSION ALGORITHM DEFLATE COMPRESSION ALGORITHM Savan Oswal 1, Anjali Singh 2, Kirthi Kumari 3 B.E Student, Department of Information Technology, KJ'S Trinity College Of Engineering and Research, Pune, India 1,2.3 Abstract

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

Decompressing Snappy Compressed Files at the Speed of OpenCAPI. Speaker: Jian Fang TU Delft

Decompressing Snappy Compressed Files at the Speed of OpenCAPI. Speaker: Jian Fang TU Delft Decompressing Snappy Compressed Files at the Speed of OpenCAPI Speaker: Jian Fang TU Delft 1 Current Project SHADE Scalable Heterogeneous Accelerated DatabasE Spark DB CPU POWER9 ARROW DNA Seq Sort Join

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Basic Compression Library

Basic Compression Library Basic Compression Library Manual API version 1.2 July 22, 2006 c 2003-2006 Marcus Geelnard Summary This document describes the algorithms used in the Basic Compression Library, and how to use the library

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Database Systems II. Secondary Storage

Database Systems II. Secondary Storage Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM

More information

FPGA based Data Compression using Dictionary based LZW Algorithm

FPGA based Data Compression using Dictionary based LZW Algorithm FPGA based Data Compression using Dictionary based LZW Algorithm Samish Kamble PG Student, E & TC Department, D.Y. Patil College of Engineering, Kolhapur, India Prof. S B Patil Asso.Professor, E & TC Department,

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

An Efficient Implementation of LZW Decompression Using Block RAMs in the FPGA (Preliminary Version)

An Efficient Implementation of LZW Decompression Using Block RAMs in the FPGA (Preliminary Version) Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186 5140 Volume 5, Number 1, pages 12 19, January 2016 An Efficient Implementation of LZW Decompression Using Block RAMs in

More information

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast

More information

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (4 th Week)

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (4 th Week) + (Advanced) Computer Organization & Architechture Prof. Dr. Hasan Hüseyin BALIK (4 th Week) + Outline 2. The computer system 2.1 A Top-Level View of Computer Function and Interconnection 2.2 Cache Memory

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work? EEC 17 Computer Architecture Fall 25 Introduction Review Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology

More information

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit. Memory Hierarchy Goal: Fast, unlimited storage at a reasonable cost per bit. Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory. Fast: When you need something

More information

Code Compression for the Embedded ARM/THUMB Processor

Code Compression for the Embedded ARM/THUMB Processor IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 8-10 September 2003, Lviv, Ukraine Code Compression for the Embedded ARM/THUMB Processor

More information

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy ENG338 Computer Organization and Architecture Part II Winter 217 S. Areibi School of Engineering University of Guelph Hierarchy Topics Hierarchy Locality Motivation Principles Elements of Design: Addresses

More information

ROOT I/O compression algorithms. Oksana Shadura, Brian Bockelman University of Nebraska-Lincoln

ROOT I/O compression algorithms. Oksana Shadura, Brian Bockelman University of Nebraska-Lincoln ROOT I/O compression algorithms Oksana Shadura, Brian Bockelman University of Nebraska-Lincoln Introduction Compression Algorithms 2 Compression algorithms Los Reduces size by permanently eliminating certain

More information

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements EECS15 - Digital Design Lecture 11 SRAM (II), Caches September 29, 211 Elad Alon Electrical Engineering and Computer Sciences University of California, Berkeley http//www-inst.eecs.berkeley.edu/~cs15 Fall

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

Category: Informational May DEFLATE Compressed Data Format Specification version 1.3

Category: Informational May DEFLATE Compressed Data Format Specification version 1.3 Network Working Group P. Deutsch Request for Comments: 1951 Aladdin Enterprises Category: Informational May 1996 DEFLATE Compressed Data Format Specification version 1.3 Status of This Memo This memo provides

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Chapter 14 - Processor Structure and Function

Chapter 14 - Processor Structure and Function Chapter 14 - Processor Structure and Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 14 - Processor Structure and Function 1 / 94 Table of Contents I 1 Processor Organization

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding SIGNAL COMPRESSION Lecture 5 11.9.2007 Lempel-Ziv Coding Dictionary methods Ziv-Lempel 77 The gzip variant of Ziv-Lempel 77 Ziv-Lempel 78 The LZW variant of Ziv-Lempel 78 Asymptotic optimality of Ziv-Lempel

More information

CHAPTER II LITERATURE REVIEW

CHAPTER II LITERATURE REVIEW CHAPTER II LITERATURE REVIEW 2.1 BACKGROUND OF THE STUDY The purpose of this chapter is to study and analyze famous lossless data compression algorithm, called LZW. The main objective of the study is to

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Unified VLSI Systolic Array Design for LZ Data Compression

Unified VLSI Systolic Array Design for LZ Data Compression Unified VLSI Systolic Array Design for LZ Data Compression Shih-Arn Hwang, and Cheng-Wen Wu Dept. of EE, NTHU, Taiwan, R.O.C. IEEE Trans. on VLSI Systems Vol. 9, No.4, Aug. 2001 Pages: 489-499 Presenter:

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

WEEK 7. Chapter 4. Cache Memory Pearson Education, Inc., Hoboken, NJ. All rights reserved.

WEEK 7. Chapter 4. Cache Memory Pearson Education, Inc., Hoboken, NJ. All rights reserved. WEEK 7 + Chapter 4 Cache Memory Location Internal (e.g. processor registers, cache, main memory) External (e.g. optical disks, magnetic disks, tapes) Capacity Number of words Number of bytes Unit of Transfer

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

On Data Latency and Compression

On Data Latency and Compression On Data Latency and Compression Joseph M. Steim, Edelvays N. Spassov, Kinemetrics, Inc. Abstract Because of interest in the capability of digital seismic data systems to provide low-latency data for Early

More information

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

An Efficient Implementation of LZW Compression in the FPGA

An Efficient Implementation of LZW Compression in the FPGA An Efficient Implementation of LZW Compression in the FPGA Xin Zhou, Yasuaki Ito and Koji Nakano Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1, Higashi-Hiroshima, 739-8527

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

High Speed ACSU Architecture for Viterbi Decoder Using T-Algorithm

High Speed ACSU Architecture for Viterbi Decoder Using T-Algorithm High Speed ACSU Architecture for Viterbi Decoder Using T-Algorithm Atish A. Peshattiwar & Tejaswini G. Panse Department of Electronics Engineering, Yeshwantrao Chavan College of Engineering, E-mail : atishp32@gmail.com,

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved. + William Stallings Computer Organization and Architecture 10 th Edition 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. 2 + Chapter 4 Cache Memory 3 Location Internal (e.g. processor registers,

More information

Introduction to Compression. Norm Zeck

Introduction to Compression. Norm Zeck Introduction to Compression 2 Vita BSEE University of Buffalo (Microcoded Computer Architecture) MSEE University of Rochester (Thesis: CMOS VLSI Design) Retired from Palo Alto Research Center (PARC), a

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

History. PowerPC based micro-architectures. PowerPC ISA. Introduction PowerPC based micro-architectures Godfrey van der Linden Presentation for COMP9244 Software view of Processor Architectures 2006-05-25 History 1985 IBM started on AMERICA 1986 Development of RS/6000 1990

More information

Multi MicroBlaze System for Parallel Computing

Multi MicroBlaze System for Parallel Computing Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need

More information

CSE 4/521 Introduction to Operating Systems. Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018

CSE 4/521 Introduction to Operating Systems. Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018 CSE 4/521 Introduction to Operating Systems Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018 Overview Objective: To discuss how paging works in contemporary computer systems. Paging

More information

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor) Chapter 7-1 Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授 V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor) 臺大電機吳安宇教授 - 計算機結構 1 Outline 7.1 Introduction 7.2 The Basics of Caches

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Improving Cache Performance

Improving Cache Performance Improving Cache Performance Computer Organization Architectures for Embedded Computing Tuesday 28 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition,

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

Pipelined processors and Hazards

Pipelined processors and Hazards Pipelined processors and Hazards Two options Processor HLL Compiler ALU LU Output Program Control unit 1. Either the control unit can be smart, i,e. it can delay instruction phases to avoid hazards. Processor

More information

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands MSc THESIS. An FPGA-based Snappy Decompressor-Filter

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands  MSc THESIS. An FPGA-based Snappy Decompressor-Filter Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ 2018 MSc THESIS An FPGA-based Snappy Decompressor-Filter Yang Qiao Abstract New interfaces to interconnect CPUs and

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods R. Nigel Horspool Dept. of Computer Science, University of Victoria P. O. Box 3055, Victoria, B.C., Canada V8W 3P6 E-mail address: nigelh@csr.uvic.ca

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS Yair Wiseman 1* * 1 Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel Email: wiseman@cs.huji.ac.il, http://www.cs.biu.ac.il/~wiseman

More information

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

ChunkStash: Speeding Up Storage Deduplication using Flash Memory ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath +, Sudipta Sengupta *, Jin Li * * Microsoft Research, Redmond (USA) + Univ. of Minnesota, Twin Cities (USA) Deduplication

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

A Hybrid Approach to Text Compression

A Hybrid Approach to Text Compression A Hybrid Approach to Text Compression Peter C Gutmann Computer Science, University of Auckland, New Zealand Telephone +64 9 426-5097; email pgut 1 Bcs.aukuni.ac.nz Timothy C Bell Computer Science, University

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information