Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing

Size: px

Start display at page:

Download "Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing"

Deirdre Simon
5 years ago
Views:

1 Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing Nikolaus Rath March 20th, 2013 N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

2 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

3 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

4 Fusion keeps the Sun Burning Nuclear fusion is the process that keeps the sun burning. Very hot hydrogen atoms (the plasma ) collide to form helium, releasing lots of energy Would be great to replicate this on earth. Plenty of fuel available, and no risk of nuclear meltdown. Challenges: heat things to millions of degrees (not so hard), and keep them confined (very hard) 2 H 3 H 4 He MeV n MeV N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

5 At Millions of Degrees, Small Plasmas Evaporate Away N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

6 Magnetic Fields Constrain Plasma Movement to One Dimension N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

7 Closed Magnetic Fields Can Confine Plasmas N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

Tokamaks Confine Plasmas Using Magnetic Fields Orange, Magenta, Green: magnetic field generating coils Violet: plasma; Blue: single magnetic field line

8 Tokamaks Confine Plasmas Using Magnetic Fields Orange, Magenta, Green: magnetic field generating coils Violet: plasma; Blue: single magnetic field line (example) 1 meter radius, 1 million C, Ampere current N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

9 Self Generated Fields Cause Instabilities Electric currents (which generate magnetic fields) flow not just in the coils, but also in the plasma itself The plasma thus modifies the fields that confine it... sometimes in a self-amplifying way instability Typical shape: rotating, helical deformation. Timescale: 50 microseconds. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

10 Only High-Speed Feedback Control Can Preserve Confinement Sensors detect deformations due to plasma currents Control coils dynamically push back feedback control N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

11 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

12 Real-Time Performance is Determined By Latency and Sampling Period latency sampling period S GPU Processing Pipelines S S S S sample paket Digitizer S S S S S Analog Output Latency is response time of feedback system Sampling period determines smoothness Algorithmic complexity limits latency, not sampling period Need both latency and sampling period in the order of few microseconds N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

13 Control Algorithm is Implemented in One Kernel CPU GPU CPU GPU Read input data Send parameters to GPU memory Process data Start GPU kernel Read data Send data to GPU memory Process data Start GPU kernel A Compute result a Compute result a Wait for GPU kernel A Process results Read results from Compute GPU Memory result b Process results... Send new data to Write output data GPU memory Start GPU kernel B Wait for GPU kernel Compute result b Wait for GPU kernel B Read results from GPU Memory Write output data Time N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

14 Redundant PCIe Transfers have to be Avoided To Reduce Latency Traditional Data bounces through host RAM PCIe bus has multi GB/s throughput Transfer setup takes several µs Okay if data chunks are big, transfer and processing takes long Bad if latency is longer than transfer time N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

15 Redundant PCIe Transfers have to be Avoided To Reduce Latency New Peer-to-peer transfers eliminate need for bounce buffer Good performance even for small amounts of data Can be implemented in software (kernel) Required peer-to-peer capable root-complex present in most midto high-end mainboards. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

16 Peer-to-peer PCIe transfers are set up by sharing BARs GPU GPU Memory A/D Module D/A Module BARs 0x01 0x02 0x03 DMA Controller BARs 0x05 0x06 0x03 DMA Controller BARs 0x08 0x09 0x01 writes reads Initialized from BIOS by CPU PCIe devices communicate via BARs in the PCI address space GPU can map part of its memory into a BAR AD/DA modules can transfer to/from arbitrary PCI address CPU establishes communication by telling AD/DA modules about GPU BAR. Required some trickery in the past, but with CUDA 5 now officially supported. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

17 Example: Userspace /* Allocate buffer with extra space for 64kb alignment */ CUdeviceptr dev_addr; cumemalloc(&dev_addr, size + 0xFFFF); /* Prepare mapping */ CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens; cupointergetattribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_addr); /* Align to 64kb */ dev_addr += 0xFFFF; dev_addr &= ~0xFFFF; /* Call custom kernel module to get bus address, refers to open device file */ struct rdma_info s; s.dev_addr = dev_addr; s.p2ptoken = tokens.p2ptoken; s.vaspacetoken = tokens.vaspacetoken; s.size = size; ioctl(fd, RDMA_TRANSLATE_TOKEN, &s) N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

18 Example: Kernelspace long rtm_t_dma_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { nvidia_p2p_page_table_t *page_table; //... switch(cmd){ case RDMA_TRANSLATE_TOKEN: { } COPY_FROM_USER(&rdma_info, varg, sizeof(struct rdma_info)); nvidia_p2p_get_pages(rdma_info.p2ptoken, rdma_info.vaspacetoken, rdma_info.dev_addr, rdma_info.size, &page_table, rdma_free_callback, tdev); rdma_info.bus_addr = page_table->pages[0]->physical_address; COPY_TO_USER(varg, &rdma_inf, sizeof(struct rdma_info)); return 0; } // Other ioctls N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

19 Userspace Continued /* Call custom kernel module to get bus address, refers to open device file */ rtm_t_rdma_info s; s.dev_addr = dev_addr; ioctl(fd, RTM_T_TRANSLATE_TOKEN, &s) /* Retrieve bus address */ uint64_t bus_addr; bus_addr = s.bus_addr; /* Send bus address to digitizer */ init_rtm_t(bus_addr, other, stuff, here); // Start GPU kernel // Kernel polls input buffer // Wait for kernel to complete N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

20 Outline 1 Motivation 2 GPU Control System Architecture 3 Performance N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

bit) 2 D-TACQ AO32CPCI D-A Converter (2 x 32 channels, 16 bit) Standard Linux host system (no

21 The HBT-EP Plasma Control System was Built with Commodity Hardware. Hardware: Workstation PC NVIDIA GeForce GTX 580 D-TACQ ACQ196 A-D Converter (96 channels, 16 bit) 2 D-TACQ AO32CPCI D-A Converter (2 x 32 channels, 16 bit) Standard Linux host system (no real-time kernel required!) N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

22 P2P Transfers Reduce Latency by 50% Latency [us] GPU RAM Host RAM Sampling Period [us] Optimal latency when using host memory: 16 µs Optimal latency when using GPU memory: 10 µs 50% difference does not mean having to wait twice as long, it is the difference between things blowing up or not. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

23 GPU Beats CPU in Computational and Real-Time Performance even in the Microsecond Regime Performance tested with repeated matrix application GPU beats CPU down to 5 µs Missed samples counted in 1000 runs Missed samples with GPU: None, with CPU: up to 2.5% Sampling Period [us] Count GPU CPU Matrix Size CPU GPU Missed Samples [%] N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

24 Summary 1 The advantages of GPUs are not restricted to large problems requiring long calculations. 2 Even when processing kb sized batches under microsecond latency constraints, GPUs can be faster than CPUs, while at the same time offering better real-time performance. 3 In these regimes, data transfer overhead becomes the dominating factor, and using peer to peer transfers improves performance by more than 50%. 4 A GPU based real-time control system has been developed at Columbia University and tested for the control of magnetically confined plasmas. Contact us for details. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 23

25 Outline 4 Backup Slides N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

26 Latency and Sampling Period are Measured Experimentally by Copying Square Waves Volt A Time [us] B Shot Control Input Control Output Sample Clock Control algorithm set up to copy input to output 1:1 Blue trace is input square wave Green trace is output square wave Output lags behind input by control system latency Red trace is sampling interval (sampling on downward edge) N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

27 Plasma Physics Results: Dominant Mode Amplitude Reduced by up to 60% 0.24 No FB g=144 g=577 Amplitude Frequency [khz] N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

Self Generated Fields Cause Instabilities Electric currents (which generate magnetic fields) flow not just in the coils, but also in the plasma itself The plasma thus modifies the fields that confine

28 Self Generated Fields Cause Instabilities Electric currents (which generate magnetic fields) flow not just in the coils, but also in the plasma itself The plasma thus modifies the fields that confine it... sometimes in a self-amplifying way instability Typical shape: rotating, helical deformation. Timescale: 50 microseconds. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

29 Feedback Control uses Measurements to Determine Control Signals Input Controller Control Signal / Control Output Actuators Physical Interaction System Output Physical Interaction Measurements / Control Input Sensors Goal: keep system in specific state If system is perfectly known, can calculate required control signals (open-loop control) In practice, need to use measurements to determine effects and update signals: feedback control A control system acquires measurements, performs computations, and generates control output to manipulate the system state. N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

30 Data Passthrough Establishes 8 µs Lower Latency Limit Latency [us] GPU RAM Host RAM Sampling Period [us] Control system uses same buffer to write input and read output No GPU processing, so no difference between host and GPU memory Jump: 4 µs required for A-D conversion and data push Offset: 4 µs required for data pull and D-A conversion N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, / 6

Abstract. * Supported by U.S. D.O.E. Grant DE-FG02-96ER M.W. Bongard, APS-DPP, Denver, CO, October 2005

Abstract. * Supported by U.S. D.O.E. Grant DE-FG02-96ER M.W. Bongard, APS-DPP, Denver, CO, October 2005 Abstract The Phase II PEGASUS ST experiment includes fully programmable power supplies for all magnet coils. These will be integrated with a digital feedback plasma control system (PCS), based on the PCS