Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms

Size: px

Start display at page:

Download "Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms"

Leslie Morris
5 years ago
Views:

1 Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms SAAHPC June Knoxville, TN Kathrin Peter Sebastian Borchert Thomas Steinke Zuse Institute Berlin (ZIB) <

2 Outline Motivation The Reed-Solomon algorithm Platforms and implementations Reed-Solomon throughput and efficiency Conclusions 2

3 Motivation I: Fault-tolerance Storage systems Mean Time To Data Loss (MTTDL) for 100k disk deployments: RAID-5 is non-starter with 100k disks: MTTDL ~ 9 days! RAID-D2 (8+2P stripes): MTTDL ~ 100 years RAID-D3 (8+3P stripes): MTTDL ~ 130 million years! source: IBM, Almaden Research Center Storage Systems, SC 09 3

4 Motivation II: Application Level Fault-Recovery Mean Time To Interrupt (MTTI) for Petascale+ class compute configurations: O(1 day) application level fault-recovery, application level checkpoint-restart example: Charm++ provides in-memory distributed checkpoint scheme - memory footprint doubled F Cappello, A Geist, B Gropp, S Kale, B Kramer, M Snir: Toward Exascale Resilience,

5 Scope & Limitations objective: investigating alternative processing platforms for RS encoding (decoding) focus on one particular step of the overall processing pipeline aspects ignored include - application (producer) : data injection bandwidth - disk I/O bandwidth, disk grouping project is not aiming to design a storage system no disk and data path configuration options considered here 5

6 The Reed-Solomon algorithm Non-binary, cyclic block code (1960 I. Reed, G. Solomon) Applications: Reliable data transmission Reliable data storage: en-/decoding in the disk (RAID) controller Crash Data disk disk disk disk disk (Re) Calculation when read / write Requirement: Fast encoding 6

7 Advantages of the Reed-Solomon Coding Flexibility in the coding schema k + m RS code means: k data blocks m check blocks up to m errors can be tolerated 7

8 Encoding Principle for (k+m) RS Schema Encoding is a matrix-vector multiplication: Galois field multiplication is expensive Cauchy Reed-Solomon 8

9 The Cauchy Variant of Reed-Solomon Cauchy Reed-Solomon : work of J. S. Plank et. al. GF 2 only XOR operations 9

2 Cell BE: IBM PowerXCell8i/IBM QS22 IBM

10 Platforms Used in this Study GPGPU: NVIDIA Tesla C1070/SGI XE500, Tesla C870/Sun Ultra27 CUDA 2.3, CUDA 3.0 FPGA: SGI RC 100/SGI Altix 450 Mitrionics SDK 2.0, RASClib 2.2, Xilinx ISE 9.2 Cell BE: IBM PowerXCell8i/IBM QS22 IBM CBE SDK 3.1 SIMD Processor: ClearSpeed CSX e620/sun X4600M2 ClearSpeed s Cn Compiler, CSAPI v

11 Memory Hierarchy Data Source Data Processing Host RAM Global Device Memory Local Memory Data Check QPI: 32 GB/s XDR: 25 GB/s PCIe x16: 8 GB/s NUMAlink4: 6 GB/s in: GPU ClearSpeed CBE 11

12 General Implementation Strategy 5+3 Reed-Solomon schema, Cauchy RS input data volumes: Mbytes to saturate the complete data path co-processor model (except CBE and x86) requires overlapping of data processing & communication 12

13 Platform Specific Optimizations x86 GPGPU SSE transfer models parallelization: OpenMP - synchronous xfer (block) FPGA NUMA: mem affinity (numactl) - asynchronous stream kernel is called as a 2D grid with 1D thread pool XOR tree /w constants 128 bit wide I/O double buffering via RASClib 1/5 resource utilization CellBE 8 SPUs, 512 byte blocks double buffering SPU NUMA 8-16 SPUs Flip Flops Slices BRAM 13

14 Metrics Used for Performance Evaluation raw throughput performance Reed-Solomon rate: RS rate := size of input data set / total time host memory-to-host memory performance (includes data transfers) normalization: relative RS rate := RS rate / link bandwidth 14

15 RS Rates (Comparing Apples with Oranges ) Best Reed-Solomon Rates and Kernel Rates 5+3 RS Schema ClearSpeed (2007) CSX600 (96x) overall RS Rate Kernel Rate FPGA (2006) XCV4LX GPU (2009) T10 (32x) GPU (2007) G80 (64x) CBE (2008) XPowerCell8i (8x) x86 (2009) X5570 (8x) RS Rate [MByte/s] 15

16 RS Rates (Comparing Apples with Oranges ) Best Reed-Solomon Rates and Kernel Rates 5+3 RS Schema ClearSpeed (2007) CSX600 (96x) overall RS Rate Kernel Rate FPGA (2006) XCV4LX GPU (2009) GPU (2007) T10 (32x) G80 (64x) Reference data: Curry et al. (2008): 13+3 RS schema on GTX 260 RS rate: 1.4 GB/s CBE (2008) XPowerCell8i (8x) Brinkmann et.al. (2009): X-8 RS schema on 8800 GTS RS rate: 1.0 GB/s x86 (2009) X5570 (8x) RS Rate [MByte/s] 16

Reed-Solomon Efficiencies Overall RS & Kernel Efficiencies ClearSpeed (2007) CSX600 (96x) FPGA (2006) XCV4LX200 GPU (2009) T10 (32x) 3 2 1 GPU (2007)

17 Reed-Solomon Efficiencies Overall RS & Kernel Efficiencies ClearSpeed (2007) CSX600 (96x) FPGA (2006) XCV4LX200 GPU (2009) T10 (32x) GPU (2007) G80 (64x) CBE (2008) x86 (2009) XPowerCell8i (8x) X5570 (8x) overall RS Efficiency Kernel Efficiency Efficiency [%] 17

18 Reed-Solomon on x86: Performance & Scaling Reed-Solomon Rate: Intel Nehalem QPI bandwidth limit: MB/s RS Rate [MB/s] # Threads optimization level: SSE, OpenMP, NUMA 18

19 RS on CBE PowerXCell8i: Performance & Scaling Reed-Solomon Rate: PowerXCell8i XDR bandwidth limit: MB/s RS Rate [MB/s] # Threads 19

20 Overall Results 20

21 Performance & Efficiency Summary Ranking according to sustained Reed-Solomon rate Categories according to Reed-Solomon efficiency 1. Cell BE, x86 Nehalem 2. GPGPU, FPGA 3. ClearSpeed Category 50+: Category 40: Cell BE x86 Nehalem, FPGA, GPGPU-C1060 Category 20: GPGPU-C870, ClearSpeed 21

22 Limitations of the Study we measured the performance of the encoding step for a fixed 5+3 RS schema, only performance of the decoding step can be considered similar the total data processing workflow includes additional steps application (producer) : data injection bandwidth permanent storage : disk I/O bandwidth 22

23 Conclusion Reed-Solomon encoding is feasible using non-asic technology algorithmic improvements: Cauchy Reed-Solomon technology improvements: energy efficient accelerators Reed-Solomon application scenarios: 1. non-critical requirements (power, cooling) x86 platform is a convenient solution 2. data intensive processing environments: FPGA integrated into data path 23

24 Acknowledgement Thanks to Michael Peick, Johannes Bock (initial GPU & FPGA version) Mathias Foquet-Lapar, SGI (Tesla C1070 on SGI s AEP sys) Willi Homberg, FZ Jülich (QS22 system JUICEnext)??? 24

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction