Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

Size: px

Start display at page:

Download "Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh"

Shon Strickland
6 years ago
Views:

1 Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

2 Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast compression Gigabyte-per-second Lower disk space in data centers Less power on communication networks 2

3 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 3

4 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 4

5 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 5

6 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 6

7 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 7

8 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 3. Replace with a reference to previous occurrence 8

9 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 1. Match length 2. Match offset 3. Replace with a reference to previous occurrence 9

10 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 1. Match length = 2 2. Match offset 3. Replace with a reference to previous occurrence 10

11 LZ77 Compression Example This sentence is an easy sentence to compress. 1. Scan file byte by byte 2. Look for matches 1. Match length = 3 2. Match offset 3. Replace with a reference to previous occurrence 11

12 LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes 1. Scan file byte by byte 2. Look for matches 1. Match length = 8 2. Match offset 3. Replace with a reference to previous occurrence 12

13 LZ77 Compression Example This sentence is an easy sentence to compress. Match offset = 20 bytes 1. Scan file byte by byte 2. Look for matches 1. Match length = 8 2. Match offset = Replace with a reference to previous occurrence 13

14 LZ77 Compression Example This sentence is an to compress. 1. Scan file byte by byte 2. Look for matches Match length = 8 Match offset = Replace with a reference to previous occurrence Marker, length, offset 14

15 LZ77 Compression Example This sentence is an easy sentence to compress. This sentence is an to compress. Saved 5 bytes! 1. Scan file byte by byte 2. Look for matches Match length = 8 Match offset = Replace with a reference to previous occurrence Marker, length, offset 15

Altera OpenCL Compiler for FPGAs 16 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int

16 Altera OpenCL Compiler for FPGAs 16 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x Store z Load y DDRx Memory

17 Altera OpenCL Compiler for FPGAs 17 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x 1 Store z Load y DDRx Memory

18 Altera OpenCL Compiler for FPGAs 18 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x 2 1 Store z Load y DDRx Memory

19 Altera OpenCL Compiler for FPGAs 19 Host Code //host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers OpenCL Single-threaded Code void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; } } Altera s OpenCL Compiler Altera s OpenCL Compiler Host CPU PCIe FPGA Accelerator Load x Store z Load y DDRx Memory

20 FPGAs can be VERY Custom Host CPU ARM Host on FPGA chip IO Channels PCIe FPGA Accelerator Load x Load y IO Channels Store z Different memory types RDL? QDR? DDRx Memory

21 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 21

22 1. Shift In New Data Current Window Input from DDR memory 22

23 1. Shift In New Data Current Window o l d _ t e x t e.g. sample_text Cycle boundary 23

24 1. Shift In New Data Current Window o l d _ t e x t e.g. sample_text Use text in our example, but can be anything Cycle boundary VEC = 4 24

25 1. Shift In New Data Current Window t e x t e.g. sample_text Cycle boundary 25

26 1. Shift In New Data Current Window t e x t s a m p e.g. le_text Cycle boundary 26

27 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 27

28 2. Dictionary Lookup/Update Dictionary 0 Current Window: t e x t s a m p Dictionary 1 1. Compute hash 2. Look for match in 4 dictionaries 3. Update dictionaries Dictionary 2 Dictionary 3 Dictionaries buffer the text that we have already processed, e.g.: 28

29 2. Dictionary Lookup/Update t a n _ Dictionary 0 Current Window: t e x t s a m p t e x t t e x t Dictionary 1 Hash e x t x t s s a t s a m t e x l Dictionary 2 t e e n Dictionary 3 29

30 2. Dictionary Lookup/Update t a n _ Dictionary 0 e a t e Current Window: t e x t s a m p t e x t t e x t Dictionary 1 e a r s Hash e x t x t s s a t s a m t e x l Dictionary 2 e e p s t e e n Dictionary 3 e n t e 30

31 2. Dictionary Lookup/Update t a n _ Dictionary 0 e a t e x a n t Current Window: t e x t s a m p t e x t t e x t Dictionary 1 e a r s x y l o Hash e x t x t t s s a s a m t e x l Dictionary 2 e e p s x e l y t e e n Dictionary 3 e n t e x i r t 31

32 2. Dictionary Lookup/Update t a n _ Dictionary 0 e a t e x a n t t a n _ t e x t Dictionary 1 e a r s x y l o t a m e Possile matches from history (dictionaries) Hash Current Window: t e x t s a m p t e x t e x t s x t s a t s a m t e x l Dictionary 2 e e p s x e l y t e a l t e e n Dictionary 3 e n t e x i r t 32 t e e n

33 2. Dictionary Lookup/Update Dictionary 0 Current Window: t e x t s a m p t e x t Dictionary 1 Hash e x t x t s s a t s a m Dictionary 2 Dictionary 3 33

2. Dictionary Lookup/Update RD03 RD02 t e e n Dictionary 0 t e x l RD01 RD00 W0 t a n _ t e x t t e x t Current Window: t e x t s a m p RD13 RD12 Dictionary 1 RD11 RD10 RD23 RD22

34 2. Dictionary Lookup/Update RD03 RD02 t e e n Dictionary 0 t e x l RD01 RD00 W0 t a n _ t e x t t e x t Current Window: t e x t s a m p RD13 RD12 Dictionary 1 RD11 RD10 RD23 RD22 Dictionary 2 W1 RD21 RD20 W2 Generate exactly the number of read/write ports that we need and the width 256 read ports, 16 write ports 128 bits RD33 RD32 Dictionary 3 RD31 RD30 W3 34

35 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 35

36 3. Match Search & Filtering Comparison Windows: Current Windows: t e e n t e x l t e x t t a n _ t e x t e n t e e e p s e a r s e a t e e x t s x i r t x e l y x y l o x a n t x t s a t e e n t e a l t a m e t a n _ t s a m A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows 36

37 3. Match Search & Filtering Comparison Windows: t e e n t e x l t e x t t a n _ Comparators Current Window: t e x t Match Length: We have another 3 of those Compare each byte 37

38 3. Match Search & Filtering Comparison Windows: t e e n t e x l t e x t t a n _ Comparators Current Window: t e x t Match Length: Match Reduction Best Length: 4 38

39 3. Match Search & Filtering 39

40 3. Match Search & Filtering 40

41 3. Match Search & Filtering 41

42 3. Match Search & Filtering Typical C-code Fixed loop bounds compiler can unroll loop 42

43 3. Match Search & Filtering One bestlength associated with each current_window t e x t s a m p t e x t 3 e x t s 1 3 x t s a 3 t s a m

44 3. Match Search & Filtering Cycle boundary Best lengths: t e x t s a m p Matches Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 44

45 3. Match Search & Filtering Cycle boundary Best lengths: t e x t s a m p Last-fit Matches 1 Too short 2 Overlap 4 Last-fit Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 45

46 3. Match Search & Filtering Cycle boundary Best lengths: t e x t s a m p Last-fit Matches 1 Too short 2 Overlap 4 Last-fit Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 46

3. Match Search & Filtering Best lengths: 0 1 2 3 t e x t s a m p 3 1 3 4 Cycle boundary 0 1 2 3 3 First Valid position next cycle Matches: Last-fit Select the best combination of matches from the

47 3. Match Search & Filtering Best lengths: t e x t s a m p Cycle boundary First Valid position next cycle Matches: Last-fit Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones (heuristic for bin-packing) last-fit 3. Compute first valid position for next step 47

48 Implementation Overview 1. Shift In New Data 2. Dictionary Lookup/Update 3. Match Search & Filtering 4. Write to output 53

49 4. Writing to Output Marker, length, offset Length is limited by VEC (=16 in our case) fits in 4 bits Offset is limited by 0x40000 (doesn t make sense to be more) fits in 21 bits Use either 3 or 4 bytes for this: Offset < 2048 MARKER LENGTH OFFSET OFFSET Offset = MARKER LENGTH OFFSET OFFSET OFFSET 54

50 Results 55 MARKER LENGTH OFFSET OFFSET OFFSET

51 Comparison against CPU/Verilog Best Gzips out there! 56

52 Comparison against CPU/Verilog Best implementation of Gzip on CPU By Intel corporation On Intel Core i5 (32nm) processor 2013 Compression Speed: 338 MB/s Compression ratio: 2.18X 57

53 Comparison against CPU/Verilog Best implementation on ASICs AHA products group Coming up Q Compression Speed: 2.5 GB/s 58

54 Comparison against CPU/Verilog Best implementation on FPGAs Verilog IBM Corporation Nov ICCAD Altera Stratix-V A7 Compression Speed: 3 GB/s 59

55 Comparison against CPU/Verilog OpenCL design example Altera Stratix-V A7 Developed in 1 month Compression speed? Compression Ratio? 60

56 Comparison against CPU/Verilog 2.7 GB/s 3 GB/s 2.5 GB/s 0.3 GB/s 61

57 Comparison against CPU Same compression ratio 12X better performance/watt 62

58 Comparison against Verilog 10% Slower 12% more resources Much lower design effort and design time Days instead of months 63

59 Thank You

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1