Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh
|
|
- Shon Strickland
- 6 years ago
- Views:
Transcription
1 Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh
2 Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast compression Gigabyte-per-second Lower disk space in data centers Less power on communication networks 2
3 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 3
4 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 4
5 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 5
6 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 6
7 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 7
8 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 8
9 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 1. Match length 2. Match offset 3. Replace with a reference to previous occurrence 9
10 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 1. Match length = 2 2. Match offset 3. Replace with a reference to previous occurrence 10
11 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 1. Match length = 3 2. Match offset 3. Replace with a reference to previous occurrence 11
12 LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes 1. Scan file byte by byte 2. Look for matches 1. Match length = 8 2. Match offset 3. Replace with a reference to previous occurrence 12
13 LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes 1. Scan file byte by byte 2. Look for matches 1. Match length = 8 2. Match offset = Replace with a reference to previous occurrence 13
14 LZ77 Compression Example This sentence is an to compress. 1. Scan file byte by byte 2. Look for matches Match length = 8 Match offset = Replace with a reference to previous occurrence Marker, length, offset 14
15 LZ77 Compression Example This sentence is an easy sentence to compress. This sentence is an to compress. Saved 5 bytes! 1. Scan file byte by byte 2. Look for matches Match length = 8 Match offset = Replace with a reference to previous occurrence Marker, length, offset 15
16 Altera OpenCL Compiler for FPGAs 16 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x Store z Load y DDRx Memory
17 Altera OpenCL Compiler for FPGAs 17 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x 1 Store z Load y DDRx Memory
18 Altera OpenCL Compiler for FPGAs 18 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x 2 1 Store z Load y DDRx Memory
19 Altera OpenCL Compiler for FPGAs 19 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x Store z Load y DDRx Memory
20 FPGAs can be VERY Custom Host CPU ARM Host on FPGA chip IO Channels PCIe FPGA Accelerator Load x Load y IO Channels Store z Different memory types RDL? QDR? DDRx Memory
21 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 21
22 1. Shift In New Data Current Window Input from DDR memory 22
23 1. Shift In New Data Current Window o l d _ t e x t e.g. sample_text Cycle boundary 23
24 1. Shift In New Data Current Window o l d _ t e x t e.g. sample_text Use text in our example, but can be anything Cycle boundary VEC = 4 24
25 1. Shift In New Data Current Window t e x t e.g. sample_text Cycle boundary 25
26 1. Shift In New Data Current Window t e x t s a m p e.g. le_text Cycle boundary 26
27 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 27
28 2. Dictionary Lookup/Update Dictionary 0 Current Window: t e x t s a m p Dictionary 1 1. Compute hash 2. Look for match in 4 dictionaries 3. Update dictionaries Dictionary 2 Dictionary 3 Dictionaries buffer the text that we have already processed, e.g.: 28
29 2. Dictionary Lookup/Update t a n _ Dictionary 0 Current Window: t e x t s a m p t e x t t e x t Dictionary 1 Hash e x t x t s s a t s a m t e x l Dictionary 2 t e e n Dictionary 3 29
30 2. Dictionary Lookup/Update t a n _ Dictionary 0 e a t e Current Window: t e x t s a m p t e x t t e x t Dictionary 1 e a r s Hash e x t x t s s a t s a m t e x l Dictionary 2 e e p s t e e n Dictionary 3 e n t e 30
31 2. Dictionary Lookup/Update t a n _ Dictionary 0 e a t e x a n t Current Window: t e x t s a m p t e x t t e x t Dictionary 1 e a r s x y l o Hash e x t x t t s s a s a m t e x l Dictionary 2 e e p s x e l y t e e n Dictionary 3 e n t e x i r t 31
32 2. Dictionary Lookup/Update t a n _ Dictionary 0 e a t e x a n t t a n _ t e x t Dictionary 1 e a r s x y l o t a m e Possile matches from history (dictionaries) Hash Current Window: t e x t s a m p t e x t e x t s x t s a t s a m t e x l Dictionary 2 e e p s x e l y t e a l t e e n Dictionary 3 e n t e x i r t 32 t e e n
33 2. Dictionary Lookup/Update Dictionary 0 Current Window: t e x t s a m p t e x t Dictionary 1 Hash e x t x t s s a t s a m Dictionary 2 Dictionary 3 33
34 2. Dictionary Lookup/Update RD03 RD02 t e e n Dictionary 0 t e x l RD01 RD00 W0 t a n _ t e x t t e x t Current Window: t e x t s a m p RD13 RD12 Dictionary 1 RD11 RD10 RD23 RD22 Dictionary 2 W1 RD21 RD20 W2 Generate exactly the number of read/write ports that we need and the width 256 read ports, 16 write ports 128 bits RD33 RD32 Dictionary 3 RD31 RD30 W3 34
35 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 35
36 3. Match Search & Filtering Comparison Windows: Current Windows: t e e n t e x l t e x t t a n _ t e x t e n t e e e p s e a r s e a t e e x t s x i r t x e l y x y l o x a n t x t s a t e e n t e a l t a m e t a n _ t s a m A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows 36
37 3. Match Search & Filtering Comparison Windows: t e e n t e x l t e x t t a n _ Comparators Current Window: t e x t Match Length: We have another 3 of those Compare each byte 37
38 3. Match Search & Filtering Comparison Windows: t e e n t e x l t e x t t a n _ Comparators Current Window: t e x t Match Length: Match Reduction Best Length: 4 38
39 3. Match Search & Filtering 39
40 3. Match Search & Filtering 40
41 3. Match Search & Filtering 41
42 3. Match Search & Filtering Typical C-code Fixed loop bounds compiler can unroll loop 42
43 3. Match Search & Filtering One bestlength associated with each current_window t e x t s a m p t e x t 3 e x t s 1 3 x t s a 3 t s a m
44 3. Match Search & Filtering Cycle boundary Best lengths: t e x t s a m p Matches Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 44
45 3. Match Search & Filtering Cycle boundary Best lengths: t e x t s a m p Last-fit Matches 1 Too short 2 Overlap 4 Last-fit Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 45
46 3. Match Search & Filtering Cycle boundary Best lengths: t e x t s a m p Last-fit Matches 1 Too short 2 Overlap 4 Last-fit Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 46
47 3. Match Search & Filtering Best lengths: t e x t s a m p Cycle boundary First Valid position next cycle Matches: Last-fit Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 47
48 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 53
49 4. Writing to Output Marker, length, offset Length is limited by VEC (=16 in our case) fits in 4 bits Offset is limited by 0x40000 (doesn t make sense to be more) fits in 21 bits Use either 3 or 4 bytes for this: Offset < 2048 MARKER LENGTH OFFSET OFFSET Offset = MARKER LENGTH OFFSET OFFSET OFFSET 54
50 Results 55 MARKER LENGTH OFFSET OFFSET OFFSET
51 Comparison against CPU/Verilog Best Gzips out there! 56
52 Comparison against CPU/Verilog Best implementation of Gzip on CPU By Intel corporation On Intel Core i5 (32nm) processor 2013 Compression Speed: 338 MB/s Compression ratio: 2.18X 57
53 Comparison against CPU/Verilog Best implementation on ASICs AHA products group Coming up Q Compression Speed: 2.5 GB/s 58
54 Comparison against CPU/Verilog Best implementation on FPGAs Verilog IBM Corporation Nov ICCAD Altera Stratix-V A7 Compression Speed: 3 GB/s 59
55 Comparison against CPU/Verilog OpenCL design example Altera Stratix-V A7 Developed in 1 month Compression speed? Compression Ratio? 60
56 Comparison against CPU/Verilog 2.7 GB/s 3 GB/s 2.5 GB/s 0.3 GB/s 61
57 Comparison against CPU Same compression ratio 12X better performance/watt 62
58 Comparison against Verilog 10% Slower 12% more resources Much lower design effort and design time Days instead of months 63
59 Thank You
AN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationHigh-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms
High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms Weikang Qiao, Jieqiong Du, Zhenman Fang, Michael Lo, Mau-Chung Frank Chang, Jason Cong Center for Domain-Specific Computing, UCLA
More informationOpenCL on FPGAs - Creating custom accelerated solutions
OpenCL on FPGAs - Creating custom accelerated solutions Manuel Greisinger Channel Manager, Central & Eastern Europe Oct 13 th, 2015 ESSEI Technology Day, Gilching, Germany Industry Trends Increasing product
More informationA Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs
2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs Jeremy Fowers, Joo-Young Kim and Doug
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationNETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS
... THE CASE FOR EMBEDDED NETWORKS ON CHIP ON FIELD-PROGRAMMABLE GATE ARRAYS... THE AUTHORS PROPOSE AUGMENTING THE FPGA ARCHITECTURE WITH AN EMBEDDED NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION
More informationTo Zip or not to Zip. Effective Resource Usage for Real-Time Compression
To Zip or not to Zip Effective Resource Usage for Real-Time Compression Danny Harnik, Oded Margalit, Ronen Kat, Dmitry Sotnikov, Avishay Traeger IBM Research - Haifa Our scope Real-Time Compression Compression
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationRealtime Signal Processing on Embedded GPUs
Realtime Signal Processing on Embedded s Dr. Matthias Rosenthal Armin Weiss Dr. Amin Mazloumian Institute of Embedded Systems Realtime Platforms Research Group Zurich University of Applied Sciences Motivation
More informationPactron FPGA Accelerated Computing Solutions
Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market
More informationDatabase Acceleration Solution Using FPGAs and Integrated Flash Storage
Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access
More informationBasic Compression Library
Basic Compression Library Manual API version 1.2 July 22, 2006 c 2003-2006 Marcus Geelnard Summary This document describes the algorithms used in the Basic Compression Library, and how to use the library
More informationFPGA Acceleration of 3D Component Matching using OpenCL
FPGA Acceleration of 3D Component Introduction 2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined
More informationHEAD HardwarE Accelerated Deduplication
HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version
More informationS 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources
Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Author: Supervisor: Luhao Liu Dr. -Ing. Thomas B. Preußer Dr. -Ing. Steffen Köhler 09.10.2014
More informationJPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.
JPEG decoding using end of block markers to concurrently partition channels on a GPU Patrick Chieppe (u5333226) Supervisor: Dr. Eric McCreath JPEG Lossy compression Widespread image format Introduction
More informationA Case Study in Optimizing GNU Radio s ATSC Flowgraph
A Case Study in Optimizing GNU Radio s ATSC Flowgraph Presented by Greg Scallon and Kirby Cartwright GNU Radio Conference 2017 Thursday, September 14 th 10am ATSC FLOWGRAPH LOADING 3% 99% 76% 36% 10% 33%
More informationA Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)
A Study of Data Partitioning on OpenC-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA
More information6 - Main Memory EECE 315 (101) ECE UBC 2013 W2
6 - Main Memory EECE 315 (101) ECE UBC 2013 W2 Acknowledgement: This set of slides is partly based on the PPTs provided by the Wiley s companion website (including textbook images, when not explicitly
More informationWelcome. Altera Technology Roadshow 2013
Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees
More informationImplementing Ultra Low Latency Data Center Services with Programmable Logic
Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://algo-logic.com Solutions@Algo-Logic.com (408) 707-3740 2255-D Martin Ave.,
More informationConTutto - A flexible memory interface in the OpenPOWER ecosystem OpenPOWER Foundation
ConTutto - A flexible memory interface in the OpenPOWER ecosystem 2016 OpenPOWER Foundation P8 Memory Sub-System 8 DMI links available on a P8 Dual-Chip-Module Differential Memory Interface (DMI) high-speed
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationChapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)
Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationDecompressing Snappy Compressed Files at the Speed of OpenCAPI. Speaker: Jian Fang TU Delft
Decompressing Snappy Compressed Files at the Speed of OpenCAPI Speaker: Jian Fang TU Delft 1 Current Project SHADE Scalable Heterogeneous Accelerated DatabasE Spark DB CPU POWER9 ARROW DNA Seq Sort Join
More informationA High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm
2012 IEEE 2012 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum A High-Performance
More informationB. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)
A. Summary - In the area of Evaluation and Exploration of Next Generation Systems for Applicability and Performance, over the period of 01/01/11 through 03/31/11 the NCSA Innovative Systems Lab team investigated
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationBrotli Compression Algorithm outline of a specification
Brotli Compression Algorithm outline of a specification Overview Structure of backward reference commands Encoding of commands Encoding of distances Encoding of Huffman codes Block splitting Context modeling
More information«Real Time Embedded systems» Multi Masters Systems
«Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationEE 457 Unit 7b. Main Memory Organization
1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)
More informationData Representation. Types of data: Numbers Text Audio Images & Graphics Video
Data Representation Data Representation Types of data: Numbers Text Audio Images & Graphics Video Analog vs Digital data How is data represented? What is a signal? Transmission of data Analog vs Digital
More informationNeuromorphic Data Microscope
Neuromorphic Data Microscope CLSAC 16 October 28, 2016 David Follett Founder, CEO Lewis Rhodes Labs (LRL) david@lewis-rhodes.com 978-273-0537 Slide 1 History Neuroscience 1998-2012 Neuronal Spiking Models
More informationVirtual Memory. Kevin Webb Swarthmore College March 8, 2018
irtual Memory Kevin Webb Swarthmore College March 8, 2018 Today s Goals Describe the mechanisms behind address translation. Analyze the performance of address translation alternatives. Explore page replacement
More informationSimple variant of coding with a variable number of symbols and fixlength codewords.
Dictionary coding Simple variant of coding with a variable number of symbols and fixlength codewords. Create a dictionary containing 2 b different symbol sequences and code them with codewords of length
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 20 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Pages Pages and frames Page
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 3: Caching (1) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2017 - Lecture 3: Caching (1) Welcome! Today s Agenda: The Problem with Memory Cache Architectures Practical Assignment 1 INFOMOV Lecture 3 Caching
More informationBASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition
BASIC COMPUTER ORGANIZATION Silberschatz, Galvin and Gagne 2009 Topics CPU Structure Registers Memory Hierarchy (L1/L2/L3/RAM) Machine Language Assembly Language Running Process 3.2 Silberschatz, Galvin
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23 Computer Systems Performance depends on which is slowest: the processor or
More informationCloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera
Cloud Acceleration with FPGA s Mike Strickland, Director, Computer & Storage BU, Altera Agenda Mission Alignment & Data Center Trends OpenCL and Algorithm Acceleration Networking Acceleration Data Access
More informationL9: Storage Manager Physical Data Organization
L9: Storage Manager Physical Data Organization Disks and files Record and file organization Indexing Tree-based index: B+-tree Hash-based index c.f. Fig 1.3 in [RG] and Fig 2.3 in [EN] Functional Components
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationHuffman encoding parallelization Taavi Adamson
Huffman encoding parallelization Taavi Adamson 1. Overview For my project I decided to develop a parallelization of Huffman encoding procedure. The topic was chosen due to my understanding of the subject
More informationINT-1010 TCP Offload Engine
INT-1010 TCP Offload Engine Product brief, features and benefits summary Highly customizable hardware IP block. Easily portable to ASIC flow, Xilinx or Altera FPGAs INT-1010 is highly flexible that is
More informationYet Another Implementation of CoRAM Memory
Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS
More informationSoC FPGAs Fuel Next Generation of IoT, Data Center and Communications Infrastructure Applications through Power Efficient Processing
SoC FPGAs Fuel Next Generation of IoT, Data Center and Communications Infrastructure Applications through Power Efficient Processing By Jag Bolaria Principal Analyst September 2015 www.linleygroup.com
More informationLIQUID METAL Taming Heterogeneity
LIQUID METAL Taming Heterogeneity Stephen Fink IBM Research! IBM Research Liquid Metal Team (IBM T. J. Watson Research Center) Josh Auerbach Perry Cheng 2 David Bacon Stephen Fink Ioana Baldini Rodric
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationA DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU
A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:
More informationStudy of LZ77 and LZ78 Data Compression Techniques
Study of LZ77 and LZ78 Data Compression Techniques Suman M. Choudhary, Anjali S. Patel, Sonal J. Parmar Abstract Data Compression is defined as the science and art of the representation of information
More informationA Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs
A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University
More informationAccelerating Business Analytics with Flash Storage and FPGAs
Accelerating Business Analytics with Flash Storage and FPGAs Satoru Watanabe Center for Technology Innovation - Information and Telecommunications Hitachi, Ltd., Research and Development Group Aug.10 2016
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 6: Dictionary Compression Juha Kärkkäinen 15.11.2017 1 / 17 Dictionary Compression The compression techniques we have seen so far replace individual
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationChapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition
Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation
More informationRealtime Signal Processing on Nvidia TX2 using CUDA
Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of
More informationChapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition
Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 3: Caching (1) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2015 - Lecture 3: Caching (1) Welcome! Today s Agenda: The Problem with Memory Cache Architectures Practical Assignment 1 INFOMOV Lecture 3 Caching
More informationUniversity of Osnabruck - FTP Site Statistics. Top 20 Directories Sorted by Disk Space
University of Osnabruck - FTP Site Statistics Property Value FTP Server ftp.usf.uni-osnabrueck.de Description University of Osnabruck Country Germany Scan Date 17/May/2014 Total Dirs 29 Total Files 92
More informationDifference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski
Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski What is Virtual machine monitor (VMM)? Guest OS Guest OS Guest OS Virtual machine
More informationReal-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university
Real-Time Buffer Compression Michael Doggett Department of Computer Science Lund university Project 3D graphics project Demo, Game Implement 3D graphics algorithm(s) C++/OpenGL(Lab2)/iOS/android/3D engine
More informationClassifying Information Stored in Memory! Memory Management in a Uniprogrammed System! Segments of a Process! Processing a User Program!
Memory Management in a Uniprogrammed System! A! gets a fixed segment of (usually highest )"! One process executes at a time in a single segment"! Process is always loaded at "! Compiler and linker generate
More informationImplementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor
2016 International Conference on Information Technology Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor Vasanthi D R and Anusha R M.Tech (VLSI Design
More informationCatapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud
Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationMatrox Imaging White Paper
Reliable high bandwidth video capture with Matrox Radient Abstract The constant drive for greater analysis resolution and higher system throughput results in the design of vision systems with multiple
More informationCS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77
CS 493: Algorithms for Massive Data Sets February 14, 2002 Dictionary-based compression Scribe: Tony Wirth This lecture will explore two adaptive dictionary compression schemes: LZ77 and LZ78. We use the
More informationMemory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts
Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationAccelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL
Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL Oliver Sinnen, Tyrone Sherwin, and Haomiao Wang & Prabu Thiagaraj (Manchester Uni/Raman Research Institute, Bangalore) Parallel
More informationSODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou
SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding
More informationINVITED PAPER: USING OPENCL TO EVALUATE THE EFFICIENCY OF CPUS, GPUS AND FPGAS FOR INFORMATION FILTERING. Doris Chen, Deshanand Singh
INVITED PAPER: USING OPENCL TO EVALUATE THE EFFICIENCY OF CPUS, GPUS AND FPGAS FOR INFORMATION FILTERING Doris Chen, Deshanand Singh Altera Toronto Technology Center Toronto, Ontario, Canada dochen, dsingh@altera.com
More informationFractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms
Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms Doris Chen Altera Toronto Technology Center Toronto, Ontario, Canada e-mail: dochen@altera.com Deshanand
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationInterconnection Network for Tightly Coupled Accelerators Architecture
Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What
More informationLab Determining Data Storage Capacity
Lab 1.3.2 Determining Data Storage Capacity Objectives Determine the amount of RAM (in MB) installed in a PC. Determine the size of the hard disk drive (in GB) installed in a PC. Determine the used and
More informationmemory management Vaibhav Bajpai
memory management Vaibhav Bajpai OS 2013 motivation virtualize resources: multiplex CPU multiplex memory (CPU scheduling) (memory management) why manage memory? controlled overlap processes should NOT
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationSynthesizable FPGA Fabrics Targetable by the VTR CAD Tool
Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design
More informationMapping-Aware Constrained Scheduling for LUT-Based FPGAs
Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang School of Electrical and Computer Engineering Cornell University High-Level Synthesis (HLS) for
More informationWhite Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices
Introduction White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices One of the challenges faced by engineers designing communications equipment is that memory devices
More informationParallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU
Parallel LZ77 Decoding with a GPU Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Outline Background (What?) Problem definition and motivation (Why?)
More informationVirtual Memory. Study Chapters something I understand. Finally! A lecture on PAGE FAULTS! doing NAND gates. I wish we were still
Virtual Memory I wish we were still doing NAND gates Study Chapters 7.4-7.8 Finally! A lecture on something I understand PAGE FAULTS! L23 Virtual Memory 1 You can never be too rich, too good looking, or
More informationJignesh M. Patel. Blog:
Jignesh M. Patel Blog: http://bigfastdata.blogspot.com Go back to the design Query Cache from Processing for Conscious 98s Modern (at Algorithms Hardware least for Hash Joins) 995 24 2 Processor Processor
More informationBits and Bit Patterns
Bits and Bit Patterns Bit: Binary Digit (0 or 1) Bit Patterns are used to represent information. Numbers Text characters Images Sound And others 0-1 Boolean Operations Boolean Operation: An operation that
More informationAnalysis of Parallelization Effects on Textual Data Compression
Analysis of Parallelization Effects on Textual Data GORAN MARTINOVIC, CASLAV LIVADA, DRAGO ZAGAR Faculty of Electrical Engineering Josip Juraj Strossmayer University of Osijek Kneza Trpimira 2b, 31000
More informationUnderstanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.
Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp. Primary Storage Optimization Technologies that let you store more data on the same storage Thin provisioning Copy-on-write
More informationROOT I/O compression algorithms. Oksana Shadura, Brian Bockelman University of Nebraska-Lincoln
ROOT I/O compression algorithms Oksana Shadura, Brian Bockelman University of Nebraska-Lincoln Introduction Compression Algorithms 2 Compression algorithms Los Reduces size by permanently eliminating certain
More informationAccelerating String Matching Using Multi-threaded Algorithm
Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationLOSSLESS DATA COMPRESSION AND DECOMPRESSION ALGORITHM AND ITS HARDWARE ARCHITECTURE
LOSSLESS DATA COMPRESSION AND DECOMPRESSION ALGORITHM AND ITS HARDWARE ARCHITECTURE V V V SAGAR 1 1JTO MPLS NOC BSNL BANGALORE ---------------------------------------------------------------------***----------------------------------------------------------------------
More information