Intro to SKARAB for programmers

Size: px

Start display at page:

Download "Intro to SKARAB for programmers"

Mervyn Webb
6 years ago
Views:

1 Intro to SKARAB for programmers (and how to use HMC!) Jason Manley 2017 CASPER workshop

2 Hardware

3 Hardware Virtex 7, 690T FPGA 4 Mezzanine sites per SKARAB 2 in front, 2 in back 16 SERDES links per site Designed to early PowerMX standard. Fans over-provisioned, normally run around 20% - 30% rated speed.

4 Hardware Mezzanine cards allow trading off of memory vs IO capacity. Four cards per SKARAB. Only one type of off-chip memory currently available on SKARAB: HMC. HMC replaces QDR/SRAM and also DRAM found on previous CASPER boards. 40G mezzanine card offers 4x40G QSFP Ethernet ports, can drive optics or copper. No more complicated, flaky PHY chips that need firmware loaded to function properly. An ADC is now also available, with other cards to follow.

5 Hardware: HMC Mezzanine card 1x HMC device per card HMC is 2GiB or 4GiB Two independent interfaces per card: 2x half-width (8 lane) links at 10Gbps per lane. Each link is bi-directional. Up to 160Gbps throughput per card.

Hardware: QSFP 40G mezzanine card Quad 40G QSFP Ethernet card PHY-less (purely passive). Does have a little micro processor for SFP management (power, temp etc). Able to drive optics directly.

6 Hardware: QSFP 40G mezzanine card Quad 40G QSFP Ethernet card PHY-less (purely passive). Does have a little micro processor for SFP management (power, temp etc). Able to drive optics directly. Tested with up to 7m passive cables. Recommend AOC (Active Optical Cables) for anything 5m and over. Does not currently work in breakout mode with spider/octopus cables. (turning one 40G port into 4x10G ports)

7 Compared to existing CASPER hardware ibob ROACH ROACH-2 SKARAB Logic cells 53K 94K 476K 693K DSP slices BRAM capacity 4.2Mb 8.8Mb 38Mb 53Mb SRAM capacity 2x18Mb 2x36Mb 4x144Mb 9Gbps 43Gbps 200Gbps - 1x8Gb 1x16Gb SRAM bandwidth DDR capacity (max) DDR bandwidth (total) Ethernet ports HMC < 8x 32Gib 8x 30Gbps R+W - 38Gbps 50Gbps 2x 10G 4x10G 8x10G < 16x40G

8 Hardware Uses the JASPER flow, not the traditional CASPER flow. Python now forms the backend for managing: busses Yellowblock Backend is Xilinx VIVADO, not ISE (hard break at Virtex-6/ROACH-2; no overlapping tool support). (recall Wesley s JASPER/VIVADO in talk on Monday) SKARAB incorporates all the lessons-learnt from SKA-SA s sizable deployments of ibob/bee2, ROACH-1 and ROACH-2s. After compiling a bitstream, interacting with a SKARAB from a network-attached control computer using any of the standard tools is the same as working with any previous CASPER hardware. But it is quite different under-the-hood...

9 Remotely controlling SKARABs Previous CASPER boards (ibobs, BEE2s, ROACH1s, ROACH2s) all had out of band management ports (separate 100Mbps or 1G Ethernet ports from the 10G data ports). SKARAB can do everything in-band: data, management as well as (re)programming Eventually over any network interface, But currently only over 1G port or first 40G port. Work in progress! SKARAB does not have a separate management processor. It uses a lightweight on-fpga softcore MicroBlaze. Microblaze is reloaded whenever FPGA is reprogrammed Process must be robust, and managed carefully, to avoid losing comms to boards. Simpler setup and maintenance: Just need a power cable and network cable to each SKARAB. Network appliance: No need for managing boot servers, Linux filesystems etc Entire platform can be managed remotely, including upgrading all firmware over network. Designed for large-scale deployments (MeerKAT, with an eye on SKA).

10 SKARAB startup sequencing Onboard flash memory ships with two (space for up to four) bitstreams pre-loaded. Golden Image and Multiboot Image Exactly same bitstream; Tries to boot multiboot image quickly. If that fails, falls back to golden image more slowly. You can load your own images here, if you want, but that s not the idea Most large CASPER deployments have a control computer on the network to configure the FPGA boards. SKARAB is designed to work in this environment. Host computer stores your various bitstreams. So, when SKARAB boots, loads flash image, asks for DHCP. Server then knows about new SKARAB board on network, and can load whichever DSP gateware image, configure registers and set it to work. Default is for DHCP on all network ports on startup. (SKARAB wants DHCP server. Hard-coding IP addresses in your bitstreams no longer so easy.) Hostname support, for example, skarab LLDP support (boards announce themselves to switches) MAC addresses are based on serial number and network port. First 40G port has hostname skarab , with MAC 06:50:02:03:02:01 After loading DSP bitstream, network interfaces flap and a new DHCP transaction ensues. Depending on your DHCP server and network (switch), can take a few seconds to bring link back up.

11 What s working? Working Not (yet) working Basic JASPER toolflow Legacy CASPER toolflow (and never will) Polling sensors (power, temp, fans etc) Automatic fan speed control HMC Mezzanine cards Retrieval of logs for hardware errors First 40G ethernet port Arbitrary combinations of Ethernet and HMC cards 1G ethernet port Onboard USB JTAG bridge Remote reprogramming and control Fast (~1 second) remote reloading of FPGA gateware Remote updates (flash firmware) Large wishbone bus (timing implications; WIP) DHCP, LLDP, ARP, PING and other network services Comprehensive DRC during compile Python casperfpga interfaces (mostly; WIP)

12 Tips for designs Keep to the UDP port compiled-in to your yellowblock for all your high-speed traffic. Else, can overwhelm microblaze with traffic; especially problematic while trying to reprogram. Yellowblock default is to use 7148 (SPEAD default at SKA-SA). Don t ever use: 7778 decimal (0x1e62); that s for controlling the microblaze decimal (0x7148); that s used for reprogramming. In the event of a network failure at startup, SKARAB will try indefinitely to get a DHCP lease. LEDs on front panel indicate DHCP success on golden image (useful for basic/visual debugging). Check for updates regularly. Development s very fluid at the moment, and nothing is stable yet. Current bus architecture limitations prevent very large numbers of attachments (~50 slaves ok). Good news is that V7 seems to have much better routing resources, especially when building large BRAMs. Timing much easier for large FFTs and snapshot blocks than on V6. Large designs easily meet timing at 240MHz. You ll get to play with all this stuff during Adam s SKARAB tutorials.

HMC memory What is Hybrid Memory Cube? Stacked DRAM on a chip, with a built-in management layer. Designed and optimised for very high throughput, not low-latency.

Don t have to deal with refreshes, bank management etc in FPGA controller anymore. HMC contains smarts... has buffers and a small ALU. (can build accumulator inside the memory!

13 HMC memory What is Hybrid Memory Cube? Stacked DRAM on a chip, with a built-in management layer. Designed and optimised for very high throughput, not low-latency. Perfect for RA instrumentation! HMC takes care of itself, including error detection on memory cells and IO operations. Don t have to deal with refreshes, bank management etc in FPGA controller anymore. HMC contains smarts... has buffers and a small ALU. (can build accumulator inside the memory!) External interface is high speed serial ( SERDES ) links. HMC supports up to 4 sets of bidirectional 16-lane links, with each lane operating up to 15Gbps... That s up to 1.9Tbps. It s FAST! Micron already on 3rd generation HMC. SKARAB uses 2nd generation at lower speeds.

14 Accessing HMC memory Yellowblock packages your instructions (read/write) into flits. A flit is a packet containing a header (instruction) and data (see HMC datasheet for details). Fortunately, all of this is abstracted-away for user; Yellowblock makes HMC look like a conventional memory interface. Each HMC yellowblock offers two dual-ported interfaces. Simultaneous read and write operations are combined into a single flit. Memory is organised into Vaults, Banks and DRAMs. The controller allows you to arbitrarily map these into your address bits. By default, SKARAB s implementation optimises for linear reads and writes. a26... a8 a7 a6 a5 a4 a3 a2 a1 a0 D19... D0 B3 B2 B1 B0 V3 V2 V1 V0 Yellowblock accesses 256 bits at a time, and presents a 256 bit bus. One clock cycle per read&/write request No need for burst reads or writes: truly random access possible.

15 Accessing HMC memory Yellowblock packages your instructions (read/write) into flits. A flit is a packet containing a header (instruction) and data (see HMC datasheet for details). Fortunately, all of this is abstracted-away for user; Yellowblock makes HMC look like a conventional memory interface. Each HMC yellowblock offers two dual-ported interfaces. Simultaneous read and write operations are combined into a single flit. Memory is organised into Vaults, Banks and DRAMs. The controller allows you to arbitrarily map these into your address bits. By default, SKARAB s implementation optimises for linear reads and writes. a26... a8 a7 a6 a5 a4 a3 a2 a1 a0 D19... D0 B3 B2 B1 B0 V3 V2 V1 V0 Yellowblock accesses 256 bits at a time, and presents a 256 bit bus. One clock cycle per read&/write request No need for burst reads or writes: truly random access possible.

16 HMC vaults and links There are 16 vaults per HMC device. Four are co-located with each link (collection of SERDES lanes). They are interconnected on-chip using a switched network, so any link can access any vault. Naturally, accessing co-located memory is faster than hopping through the switches to get to memory located on other links. Mapping is as you d expect: Link 1: vaults 0,1,2,3 Link 2: vaults 4,5,6,7 Link 3: vaults 8,9,10,11 Link 4: vaults 12,13,14,15 SKARAB has links 2 and 3 connected. Thus, half the memory can be accessed locally, incurring minimum latency. Accessing remote vaults (0-3 and 12-15) will incur additional latency, but the switching network is full crossbar (no reduction in bandwidth).

17 HMC: More on vaults To increase throughput, data must be striped over multiple Vaults. Maximum throughput performance requires you to use all vaults. Each vault has a buffer for transactions. If you keep accessing the same vault continuously, operations will queue and performance will degrade. NNB for matrix-transpose (corner-turner). Vaults operate semi-autonomously, and respond as quickly as they can. Latency, throughput and order of operations thus not guaranteed. You can issue a request to vault 1 and then another to vault 2, and get the response back from vault 2 first and then the reply from vault 1 some time later. Performance heavily dependent upon your access patterns. To keep track of your read requests, you issue 9-bit tags with each read request. Responses contain your tags so you can sort them out again. This can complicate things enormously. Data is also cached in the HMC, so if you issue the same read request twice, you get the second response back very quickly, and possibly before many earlier read requests. Typical latency: ~80 FPGA clock cycles (230MHz) in VACC applications. Typical out-of-order: ranges from 0 to ~230, depending on access patterns and speeds.

HMC yellowblock HMC controller automatically performs POST upon startup. After POST, HMC monitors itself. 6-bit error code reported in event of failure during operation.

18 HMC yellowblock HMC controller automatically performs POST upon startup. After POST, HMC monitors itself. 6-bit error code reported in event of failure during operation. Checks include: flit (SERDES comms) errors ECC in DRAM core Buffer overruns Internal logic errors For best performance: linear access, simultaneous read and write flits Higher-level HMC blocks available in DSP library: Wideband, programmable delay line Corner-Turner (matrix transpose) Vector-accumulator (buffered, with backpressure)

No SKARAB support yet for special instructions (just basic read&write).

19 HMC conclusions & considerations Latency through the chip is not guaranteed. Throughput is not guaranteed, and depends on access patterns. No SKARAB support yet for special instructions (just basic read&write). Most applications will need a reorder block after the HMC to deal with out-of-order responses. If you re doing reads and writes, issue these instructions simultaneously.

20 40G ethernet core, forty_gbe Yellowblock interface exactly like the 10G ethernet core, but with 256b interfaces instead of 64b interfaces. 40G core now does proper RX CRC checking (uses a lot of HW resources, though). No longer managed by tcpborphserver and tgtap software process on PPC. Microblaze softcore manages all network services. Features in place already: DHCP with auto-renew and hostname support based on serial number LLDP reporting and discovery ARP Ping Multicast TX and RX, including subscription to multiple sequential addresses. IGMPv2 signalling. As with 10G core, multicasting RX uses bitmask arrangement. Can only subscribe to contiguous chunks of 2^N addresses. Current status, limitations and work in progress: At the moment, 40G yellowblock is hard-coded for the first QSFP port on the third mezzanine site. 40G yellowblock currently pulls-in microblaze infrastructure, so all designs must contain a 40G core, even if you re not using it!

21 40G Ethernet and HMC resources Hardware resources for 40G ethernet and HMC cores: Total available Per 40G port Per HMC mezzanine card Slices (3.1%) (13.1%) BRAM (1.7%) 116 (7.9%) DSP (0%) 4 (0.1%)

22 Questions & Comments Jason Manley

JASPER and the SKARAB. Wesley New 2017 CASPER workshop

JASPER and the SKARAB. Wesley New 2017 CASPER workshop JASPER and the SKARAB Wesley New 2017 CASPER workshop Hardware Hardware: SKARAB Motherboard Peralex in conjunction with SKA-SA have designed the SKARAB. Based on the Virtex 7, 690T FPGA 53Mb BRAM 3600