CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE. Applications correlators, beamformers, spectrometers, FRB

Size: px

Start display at page:

Download "CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE. Applications correlators, beamformers, spectrometers, FRB"

Giles Spencer
6 years ago
Views:

CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE Frameworks MPI, heterogenous large systems Pipelines hashpipe, psrdata, bifrost,

1 CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE Frameworks MPI, heterogenous large systems Pipelines hashpipe, psrdata, bifrost, htgs Data transport DPDK, libvma, NTOP Applications correlators, beamformers, spectrometers, FRB Hardware configurations and future hardware roadmaps

2 USEFUL LINKS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE hashpipe - psrdada - bifrost - htgs - DPDK - libvma - NTOP -

3 APPLICATIONS FRB searching (Dan) - Building systems for GBT, Arecibo, FAST. Using Heimdall. Building the whole FPGA/switch/GPU processing engine. Have they build the whole ultimate CASPER backend? Not yet. There is a SETI GPU, an FRB GPU, etc. Heimdall dedispersion is the hardest computational task, but overall still swamped by the number of candidates. Beamformers Max Planck beamformer on MeerKAT (commensal backend) Packet capture and beamforming in bifrost. DifX (reported by Jonathon Weintroub) used some aspect of MPI to move existing DifX X-engine into GPU? [From discussions with Arash after the meeting: he did need to hand-port FFTW to CuFFT, and some aspects of X-engine to CUDA kernel.] Dan: Use GPU correlators for ~ 2**8 antennas. Not needed for small number of antennas (e.g. VLBI).

4 DATA TRANSPORT DPDK, etc zero copy operations, bypass kernel space. Goes from NIC to GPU memory saving one hop. RDMA direct to GPU with Inifiniband, Rocky = similar over Ethernet, layer above RDMA. All still have to go through system memory. DPDK has to have an Intel NIC (or clone) can get 80 Gb/sec into GPU (2x 40 Gb NICs). [Edit: DPDK does support some Mellanox / Broadcomm / Cisco / Chelsio chipset] libvma: equivalent with Mellanox NICs; 40 Gb/sec per NIC. Using SPEAD packets Would like a SPEAD reader using DPDK for psrdada, bifrost, etc. Dan bottleneck into PCs is packets/sec, not bits/sec, want giant packets. (Jumbo = 9k packets). NICs now supporting interrupt coalescing will wait for e.g. 10 packets before it interrupts the CPU. Dave s hashpipe uses this. Kernel tuning parameters critical need a CASPER memo for this. Danny maybe one exists. Application code also needs to be bound to correct processor. Threads need to be locked to the correct core. Dan: action item group to get together to identify memo(s) of required reading before attempting to develop HPC code. Group to consist of: John Ford, Dave MacMahon, Danny Price. How to do high speed data transport.

5 HOW TO DO HIGH-SPEED DATA TRANSPORT: A READING LIST FOR THE CURIOUS CASPERITE Digital signal processing using stream high performance computing: A 512-input broadband correlator for radio astronomy, J Kocz, LJ Greenhill, BR Barsdell arxiv: A Scalable Hybrid FPGA/GPU FX Correlator, J Kocz, LJ Greenhill, BR Barsdell - Journal of Astronomical Instrumentation, 2014 The Breakthrough Listen Search for Intelligent Life: A Wideband Data Recorder System for the Robert C. Byrd Green Bank Telescope, D MacMahon, DC Price, M Lebofsky, arxiv: An Efficient Real-time Data Pipeline for the CHIME Pathfinder Radio Telescope X-Engine, A Recnik, K Bandura, N Denman arxiv:

6 HARDWARE CONFIGURATIONS Danny: Breakthrough uses 4U servers from SuperMicro, dual xeons, capture raw voltages to disk. After observations play back through NVIDIA 1080 gaming cards one per node. Typically BTL/GBT use one GPU per box. Others using 2/4 GPUs per box. CHIME correlator uses AMD. Code written in OpenCL. Dan NVIDIA is into supercomputing; AMD is selling chips to gamers. Can run OpenCL on NVIDIA. CUDA gives you cufft, cublas, Thrust library. Does AMD have equivalents? Number of PCI Express lanes the CPU can support is important. AMD CPU + NVIDIA GPU may be beneficial. Power 8/9 have bluelink connections. May develop NICs which use bluelink. IBM has shown a lot of dedication to getting the GPU as high speed interconnect as possible. Vendors: very cheap 10/40Gb transceivers from FiberStore (fs.com). Also sell 100 Gb switches.

7 PIPELINES HTGS does inverse of bifrost. Bifrost binds thread to an operation. HTGS define nodes in a graph, nodes will be bound to a CPU thread. Aim is to overlap data transport and computation. Get hybrid, multicore pipeline. Uses explicit graph representation throughout. Hashpipe originally developed for GUPPI (Paul D.) Generalized by Dave MacMahon. Not as sophisticated as bifrost/htgs. Provides support for metadata. Hashpipe does not support forking ring buffers. Simple and straightforward, well documented, CASPER tutorials available. PSRDADA similar to hashpipe. Low level. Simple and conservative: use hashpipe or PSRDADA. Bifrost in a single instrument. HTGS just starting prototyping use in GB. Unique in using graph representation maintained through analysis and execution. Also can use multiple GPUs formulate a subgraph and encapsulate it into an execution pipeline graph, bound to a GPU. Should put a link to Tim s thesis from CASPER website. Link to paper is link.springer.com/article/ /s

8 GPU ROADMAP vega for AMD coming out next week. Volta for NVIDIA. Volta has tensor cores 4x4 matrix multiplications, 16-bit inputs, 32-bit outputs (designed for AI training / inferencing). CUDA 9 formalized some of the threading models can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards.

9 GPU ROADMAP vega for AMD coming out next week. Volta for NVIDIA. Volta has tensor cores 4x4 matrix multiplications, 16-bit inputs, 32-bit outputs (designed for AI training / inferencing). CUDA 9 formalized some of the threading models can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards.

10 FPGA ROADMAP latest generation Ultrascale+. Some chips in production. Lots more memory on chip. 10s of Mbytes -> Gbits. 26 Gbit links, 100 Gb Ethernet on eval boards. $7k for a VCU118 eval board with a $20k chip on. Not engineered for industrial applications. HBM (high bandwidth memory) superhigh bandwidth DRAM, connects over substrate. FPGAs with high-speed ADCs/DACs on chip. 8 3Gsps ADCs/DACs. Not generally available yet, will be out at the end of the year. Working on 7nm chips no date for availability yet. Dan: for performance/$, use latest generation family, but medium-size chip. Can buy VCU118 boards in bulk. Power to FPGA is throttled to 60W (?). May be a problem for full utilization, but looks encouraging. Full investigation not complete.

The Breakthrough LISTEN Search for Intelligent Life: A Wideband Data Recorder for the Robert C. Byrd Green Bank Telescope

The Breakthrough LISTEN Search for Intelligent Life: A Wideband Data Recorder for the Robert C. Byrd Green Bank Telescope Dave MacMahon University of California at Berkeley Breakthrough LISTEN SETI Project