Parallelized Progressive Network Coding with Hardware Acceleration

Size: px

Start display at page:

Download "Parallelized Progressive Network Coding with Hardware Acceleration"

Hugh O’Neal’
5 years ago
Views:

1 Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto

2 Network coding Information is coded at potentially every node

3 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 [a] 4 [b] [b]

4 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 4 [a] [b] 5 [a] 6 7 2

5 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 [a] 4 [a+b] [b] [a+b] 5 [a+b] 6 7 2

6 Randomized network coding [Ho et al. 2003] data flow blocks h e h e l Encode Decode l l o l o segments encoded block: Operates in GF(2 8 ) 166 h+191 e+216 l+109 l+237 o Coding coefficients are randomly generated 3

7 To code or not to code on a network node? 4

8 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics 4

9 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics In wireless networks [Katti et al. SIGCOMM 2006] Naturally leverage wireless broadcast channels Alleviate congested locality 4

10 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics In wireless networks [Katti et al. SIGCOMM 2006] Naturally leverage wireless broadcast channels Alleviate congested locality Do advantages come for free? High coding complexity [Wang and Li, IWQoS 2006] Communication overhead of carrying coefficients 4

11 Our contributions

12 Coming (almost) free to a desktop near you 6

13 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance 6

14 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity 6

15 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks 6

16 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed 6

17 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed Future use of network coding fully justified 6

18 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed Future use of network coding fully justified To tradeoff processor power for coding benefits 6

19 20x performance boost over a baseline implementation before this work Example after optimization 348 Mbps (64 blocks, 32 KB each) on a Power Mac G5 Quad (circa October 2005)

20 Progressive decoding as blocks arrive b = C 1 x T

21 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data 9

22 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

23 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

24 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

25 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

26 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

27 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

28 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

29 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

30 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data h e l l o 9

31 Hardware acceleration using SIMD vector instructions (x86 SSE2 & PowerPC AltiVec) b = C 1 x T

32 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] 11

33 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access 11

34 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops 11

35 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops Impossible to accelerate with SIMD vector instructions 11

36 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops Impossible to accelerate with SIMD vector instructions Changing to GF(2 16 ) does not help 11

37 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; 12

38 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop 12

39 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration 12

40 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration But: can be accelerated with SIMD vector instructions to operate on 16 bytes concurrently with 128-bit registers 12

41 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration But: can be accelerated with SIMD vector instructions to operate on 16 bytes concurrently with 128-bit registers Challenge: 16-byte alignment of memory allocation is OSspecific 12

42 SIMD instructions: available everywhere 13

43 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) 13

44 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) 13

45 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) IBM PowerPC family of processors: AltiVec available in PowerPC G4 and PowerPC G5 (since 2001) 13

46 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) IBM PowerPC family of processors: AltiVec available in PowerPC G4 and PowerPC G5 (since 2001) Our implementation supports all the above SIMD instruction sets running on Mac OS X, Microsoft Windows, and Linux 13

Speedup with SIMD acceleration: decoding 6.3 5.7 5.

47 Speedup with SIMD acceleration: decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) block size (bytes) 14

48 Decoding performance with SIMD acceleration Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) 8 MBytes/second block size (bytes) 15

49 Parallelized implementation to utilize multi-core processors b = C 1 x T

50 Partitioning the decoding of coded blocks n n c' 1 c' 1 c' 2 c' 2 n c' 3 c' 3 n thread 1 k x' 1 thread 2 x' 2 x' 3 n k/2 k/2 17

51 Speedup of threaded decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) block size (bytes) 18

Performance of threaded decoding 45 36 Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad

52 Performance of threaded decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) MBytes/second block size (bytes) 19

53 Partitioning both coefficients and coded blocks n/2 n/2 c' 1 n c' 2 c' 3 thread 1 n k x' 1 thread 2 x' 2 x' 3 n k/2 k/2 20

54 Performance of aggressive threading Performance (MB/sec) Quad Xeon (1024B per block) Dual Xeon (1024B per block) Quad G5 (1024B per block) Speedup number of blocks 21

55 Coding performance on various platforms System Quad PowerPC G5 2.5 GHz Quad P4 Xeon 2.8 GHz Dual Opteron (AMD) 2.4 GHz Dual P4 Xeon 3.6 GHz imac Intel Core Duo 1.83 GHz Intel Core Duo 1.66 GHz Platform setup OS SIMD type # of threads Coding rate (MB/s) L2 Cache Encoding Decoding Mac OS X AltiVec 4 1 MB Linux SSE KB Linux SSE2 2 1 MB Linux SSE2 2 2 MB Mac OS X SSE2 2 Windows XP SSE2 2 2 MB (shared) 2 MB (shared) blocks of 4 KB each 22

56 The lunch is almost free 23

57 The lunch is almost free Network coding is almost free with current processors: 23

58 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 23

59 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each 23

60 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress 23

61 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress Support all modern processors (Intel, AMD, PowerPC) and OSes (Windows, Mac OS X, Linux) 23

62 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress Support all modern processors (Intel, AMD, PowerPC) and OSes (Windows, Mac OS X, Linux) Don t forget the Moore s Law! 23

63 To revisit this presentation (PDF or Flash) iqua.ece.toronto.edu Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto

Pushing the Envelope: Extreme Network Coding on the GPU

Pushing the Envelope: Extreme Network Coding on the GPU Abstract While it is well known that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained