A Real Time Controller for E-ELT

Size: px

Start display at page:

Download "A Real Time Controller for E-ELT"

Georgina McLaughlin
5 years ago
Views:

/ Observatoire de Paris Project #671662 funded by European

1 A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project # funded by European Commission under program H2020-EU coordinated in H2020-FETHPC-2014

2 Green Flash RTC prototypes or E-ELT AO system European project: Horizon 2020 Research and innovation program 3 years project (a year has passed already) Low cost and low energy consumption RTC Different kind of accelerators characterization GPU FPGA XeonPhi Public/private sector partnership

6k 16 bits 500Hz 800 x 800 16 bits 1000Hz SCAO Pyramid 240 x 240 16 bits 1000Hz 40.

3 E-ELT Numbers & Methods Case Method Dimension Encoding Frequency Size Throughput MCAO SCAO Shack- Hartmann Shack- Hartmann 1.6k x 1.6k 16 bits 500Hz 800 x bits 1000Hz SCAO Pyramid 240 x bits 1000Hz 40.96Mb (5,12Mo) 10.24Mb (1,28Mo) ~1Mb (~125Ko) Gb/s Gb/s 1 Gb/s Sequential Slopes/MVM processing Segmented Slopes/MVM processing

4 E-ELT Numbers & Methods Case Method Dimension Encoding Frequency Size Throughput MCAO SCAO Shack- Hartmann Shack- Hartmann 1.6k x 1.6k 16 bits 500Hz 800 x bits 1000Hz SCAO Pyramid 240 x bits 1000Hz 40.96Mb (5,12Mo) 10.24Mb (1,28Mo) ~1Mb (~125Ko) Gb/s Gb/s 1 Gb/s MCAO Case: 40,96 Mb with 40Gb/s network => 1ms latency

5 Latency and jitter constraints Latency typical optic fiber: 4,9µs/Km Closest city 130km (Antofagasta) implies 0.630ms latency (+transfer time) Jitter Under 10% of overall latency in order to be manageable Too much jitter Too much frame skips Correction stability hard to reach MCAO 2ms/iter, SCAO 1ms/iter The nearer the better The lesser the better

6 Usual Telescope RTC SPARTA Design for VLT scales RTC BOX: x86 CPU FPGA DSP Co-processing Cluster: X86 CPU At ELT scales: needs for a super-calculator in the observatory

7 E-ELT GreenFlash RTC Prototype Sensor(s) Low latency Low jitter DM 40 GbE network High bandwidth Low latency Real Time Controller 40 GbE network High bandwidth Supervisor Telemetry High framerate High bandwidth Low latency Fast storage high throughput High bandwidth High throughput

8 Legacy GPU programming main { setup(); while(run){ recv( ); cudamemcpy(, HostToDevice); computing_kernel<<<>>>( ); cudamemcpy(, DeviceToHost); send( ); } } PCIe GPU GPU RAM 10GbE NIC CPU CPU RAM

9 Legacy GPU Programming cudamemcopy() overhead times (5.12Mo in, 64Ko out) Kernel launches overhead times Both cases : jitter of 20 to 30 µsec

10 Legacy GPU Programming cudamemcopy() overhead times (5.12Mo in, 64Ko out) Kernel launches overhead times Both cases : jitter of 20 to 30 µsec (40 µsec sometimes)

11 Legacy GPU programming MCAO : image 40.96Mb, commands 512Kb Over 40GbE network transfer takes almost 1ms Same for cudamemcpy() operations camera integration transfer i i+1 i+2 i-1 i i+1 transfer handling data / exec handling sequential processing segmented processing Leaves not enough to no time for computations Dec 21st 2016

12 GPUDirect + Custom FPGA NIC Allows third party PCI-e device p2p access Goals : Negates latency overhead and reduce jitter induced by cudamemcpy cudamemcopy() overhead times (5,12Mo in, 64Ko out) Linux Kernel module: expose CUDA buffers phy@

13 Persistent CUDA kernel Exports computation loop on GPU Goals : Reduce kernel launch jitter & start computations as soon as data arrive Kernel launches overhead times Uses memory polling for data arrival detection

14 Persistent kernel + GPUDirect main { setup(); persistent_kernel<<<>>>( ); while(run){ waitgpu( ); startdmatransfer( ); } } persistent_kernel( ){ while(run){ pollmemory( ); notifycpu( ); } } PCIe GPU notify GPU RAM start CPU Problem: GPU can t command FPGA Mapped CPU memory in CUDA for GPU/CPU notification 10GbE FPGA NIC CPU RAM

15 Persistent kernel + GPUDirect FPGA writes/reads directly to/from GPU memory Using only writes would be better though DMA Host ram CPU app Camera control UDP Offload Engine Latency measurement Camera protocol handler DMC protocol handler FPGA NIC measures answers DMA DMA DMA DMA PCI-e start GPU ram Pixels ring buffer DM com buffer FPGA control Sync thread GPU polling kernel compute kernels notify

16 GPUDirect & FPGA NIC - RTT FPGA PLDA XPressG5 GPU Tesla C2070 OS Debian wheezy Camera EVT HS-2000M 10GbE network µsec SCAO Pyramid case: 240 x 240 pixels, encoded on 16b No GPUDirect GPUDirect + persistent kernel Results are coherent with expectations iterations

17 Persistent kernel + GPUDirect If it scales to MCAO case camera integration i i+1 i+2 transfer transfer handling data / exec handling sequential processing segmented processing i-1 i i+1

$5) cumemhostregister( ) using CU_MEMHOSTREGISTER_IOMEMORY flag main { setup(); persistent_kernel <<<>>>( ); } persistent_kernel( ){ while(run){ pollmemory( );$

18 IOMemory mapping (CUDA 7.5) cumemhostregister( ) using CU_MEMHOSTREGISTER_IOMEMORY flag main { setup(); persistent_kernel <<<>>>( ); } persistent_kernel( ){ while(run){ pollmemory( ); actual_computation; startdmatransfer( ); } } PCIe GPU GPU RAM Mapping FPGA addresses into CUDA memory space allows DMA control fromgpu start CPU 10GbE FPGA NIC CPU RAM

19 IOMemory mapping UDP Offload Engine Latency measurement Camera protocol handler DMC protocol handler FPGA NIC measures answers DMA DMA DMA DMA DMA PCI-e start Host ram GPU ram Pixels ring buffer DM com buffer CPU app Camera control FPGA control Meas. Comp. GPU polling kernel compute kernels Little to no improvements, but CPU free for other kind of computations

20 Conclusion / Perspectives Using GPUDirect and a persistent kernel allow efficient data delivery to the Real Time Controller GPUDirect approach can be applied to Supervisor using interruptions from FPGA for CPU execution control (kernel launch) Simulation setup to benchmark ELT SCAO/MCAO scales thoroughly Test with new hardware: PLDA ExpressKUS FPGA and Nvidia K40/80 & P100 GPUs. (and Arria10 FPGAs in a near future) Develop computation modules on FPGA to further reduce GPU load e.g.: Slopes computation for segmented processing

21 Thank you for your attention (Questions?) Project # funded by European Commission under program H2020-EU coordinated in H2020-FETHPC-2014

A Real Time Controller for E-ELT

A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project #671662 funded by European Commission under program H2020-EU.1.2.2