Down selecting suitable manycore technologies for the ELT AO RTC. David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz

Size: px

Start display at page:

Download "Down selecting suitable manycore technologies for the ELT AO RTC. David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz"

Elmer Green
5 years ago
Views:

1 Down selecting suitable manycore technologies for the ELT AO RTC David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz

2 GFLOPS RTC for AO workshop 27/01/2016 AO RTC Complexity 1.E+05 1.E+04 E-ELT EPICS 1.E+03 IFS LTAO MOS 1.E+02 VLT AOF IFS SCAO 1.E+01 SPHERE 1.E Year System Telescope Type Channels WFS sub-aps Frequency (Hz) SCAO 1 74x IFS E-ELT LTAO 6 74x MOS E-ELT MOAO 10 74x74 250

3 RTC for AO workshop 27/01/2016 Typical RTC and hardware Wavefront sensor camera Real-time control computer

4 RTC for AO workshop 27/01/2016 Typical RTC and hardware Tilera?

5 RTC for AO workshop 27/01/2016 Typical RTC and hardware Xeon Phi?

6 RTC for AO workshop 27/01/2016 Tilera WF Pixel Processing Pixel calibration Sub Aperture Processing Simple centre of gravity for a Shack-Hartman WFS. Tested for two scenarios Full frame: Pipelining:

RTC for AO workshop 27/01/2016 Tilera - Tile Gx-36 Multiple 10 Gbps Ethernet ports 9,16,36,72 cores (Tiles) @ 1.

7 RTC for AO workshop 27/01/2016 Tilera - Tile Gx-36 Multiple 10 Gbps Ethernet ports 9,16,36,72 cores 1.2 GHz Uses a C/C++ compiler; abstraction, portability. Zero Overhead Linux (ZOL) mode. ZOL mode prevents Linux system level calls on specific cores.

8 RTC for AO workshop 27/01/2016 Full Frame: Mean execution time 74x74 (16x16): 1764 µs Detector: 1200 x x74 (10x10): 734 µs Detector: 800 x 800 (e.g. E-ELT MOS single channel) 74x74 (6x6): 265 µs Detector: 500 x 500 (e.g. E-ELT IFS SCAO)

9 RTC for AO workshop 27/01/2016 Full Frame: Stability σ = 1.28 µs Detector approx. 500 x 500 Execution time 265±6 µs

10 Pipelining To achieve the best performance the pixel processing is started as soon as a row of sub-apertures has arrived. WF processing delay Detector approx. 500 x 500 WF processing delay <50 µs RTC for AO workshop 27/01/2016

11 RTC for AO workshop 27/01/2016 Company Stability and Direction EZchip bought by Mellanox. Facebook has bought some for testing and evaluating. Future of the Tilera cards seem stable.

12 RTC for AO workshop 27/01/2016 Matrix Vector Multiplication Wavefront sensor camera Real-time control computer We are only looking at the Matrix Vector Multiplication (MVM) for control calculation. MVM for E-ELT first light instruments has the highest computation complexity increase with O(D 4 ). MVM is a memory bandwidth limited routine.

13 Xeon Phi Mean performance (MVM) Connects via PCIe (accelerator card, similar to GPUs) Easy to program (similar to CPUs) Good performance Large number of cores (60) High memory bandwidth (320 GB/s) RTC for AO workshop 27/01/2016

14 Xeon Phi Stability (MVM) Good scalability Multiple Xeon Phis allows speed up by approx Poor stability Due to how the data transfer over PCIe is handled More details in (Barr et al, MNRAS 2015) RTC for AO workshop 27/01/2016

15 Memory Bandwidth Dual Xeon E (CPU) NVIDIA K40 (1) Xeon Phi 5110p K80 (2) Next Gen. Xeon Phi (3) Advertised (GB/s) 2x ~500 Achieved (GB/s) Percentage ~ % 79.5 % 52.0% Low High 250 >400 (1) Reguly I. Z. et al, PMAM 2014 (2) Deakin. T. et al, (2015) (3) Intel datasheet Next Generation Xeon Phi moving to a integrated CPU Removing the need for transfer data over PCIe Assumption for next gen: same achievable memory BW Low: 50% of memory Bandwidth Achievable. High: Intel s benchmark of >400 GB/s RTC for AO workshop 27/01/2016

16 Xeon Phi Next gen. performance RTC for AO workshop 27/01/2016

17 RTC for AO workshop 27/01/2016 Xeon Phi Next gen. performance 40 x 40 (µs) 74 x 74 (µs) Xeon Phi 5110p Next Gen Xeon Phi Low High K K

18 Xeon Phi: Power Consumption Processor Release Power Max (Watts) Intel Xeon Phi 5110p Q Intel Xeon Phi (Next Gen.) Intel Xeon E5-2650V3 Q NVIDIA K40 Q NVIDIA K80 Q Tile-Gx36 Q Tile-Mx >30(?) Next generation Xeon Phi Moving to Intel Atom Cores Reducing power W while increasing performance. RTC for AO workshop 27/01/2016

19 Tilera: Power Consumption Processor Release Power Max (Watts) Intel Xeon Phi 5110p Q Intel Xeon Phi (Next Gen.) Intel Xeon E5-2650V3 Q NVIDIA K40 Q NVIDIA K80 Q Tile-Gx36 Q Tile-Mx >30(?) Next generation Tilera Moving to ARM processors Known for low power (~300 mw per core) RTC for AO workshop 27/01/2016

Overall performance estimate Example SCAO E-ELT first light instrument Valid sub-apertures: ~4K (74x74) Detector approx. 500x500 (6x6 per sub-aperture) Latency requirement: 1500 µs Current Gen.

20 Overall performance estimate Example SCAO E-ELT first light instrument Valid sub-apertures: ~4K (74x74) Detector approx. 500x500 (6x6 per sub-aperture) Latency requirement: 1500 µs Current Gen. Next Gen. Image process & centre of gravity: TILERA 50 µs <50 µs MVM: Single Xeon Phi (Dual Xeon Phi) 1140 µs (850 µs) Low 500 µs High 320 µs Time available for rest of loop ~950 µs 1500 µs RTC for AO workshop 27/01/2016

21 Conclusions TILERA Programmability/portability similar to CPU Very good stability Next gen. will have more memory BW and cores ELT ready! Xeon Phi Programmability/portability similar to CPU Poor stability mainly due to data transfer Next gen. More memory bandwidth No transfer (i.e. better stability) Should be ELT ready (needs testing) RTC for AO workshop 27/01/2016

22 Thanks for Listening Any Questions

Reducing adaptive optics latency using Xeon Phi many-core processors

doi:10.1093/mnras/stv1813 Reducing adaptive optics latency using Xeon Phi many-core processors David Barr, 1,2 Alastair Basden, 3 Nigel Dipper 3 andnoahschwartz 1 1 UK Astronomy Technology Centre, Royal