CO405H. Department of Compu:ng Imperial College London. Computing in Space with OpenSPL Topic 14: Networking DFEs

Size: px

Start display at page:

Download "CO405H. Department of Compu:ng Imperial College London. Computing in Space with OpenSPL Topic 14: Networking DFEs"

Asher Walton
5 years ago
Views:

CO405H Computing in Space with OpenSPL Topic 14: Networking DFEs

uk/~georgig/ CO405H course page: WebIDE: OpenSPL consor:um page:

1 CO405H Computing in Space with OpenSPL Topic 14: Networking DFEs Oskar Mencer Georgi Gaydadjiev Department of Compu:ng Imperial College London h#p:// h#p:// CO405H course page: WebIDE: OpenSPL consor:um page: h#p://cc.doc.ic.ac.uk/openspl16/ h#p://openspl.doc.ic.ac.uk h#p://

DFEs in the network Networks operate on data streams This seems like natural use scenario for DFEs Some network processing specific problems Possible

2 DFEs in the network Networks operate on data streams This seems like natural use scenario for DFEs Some network processing specific problems Possible soluions Applications: Long Int. Multiplication RSA Cryptography Dynamic Programming Laplace Heat Equation Viterbi Decoder Sound Synthesis Neural Networks

The networking DFE Card {TOP, MID, BOT}: 4 10G Serial Links QSFP TOP QDR

(18MB) QDR II SRAM (18MB) QDR II SRAM (18MB) Reconfigurable Logic QMEM

(logical) JDFE is pin compaible but has 8 serial links: 4 to the switch

3 The networking DFE Card {TOP, MID, BOT}: 4 10G Serial Links QSFP TOP QDR II SRAM (18MB) DDR3 DRAM (8-16GB) QSFP MID QSFP BOT PTP QDR II SRAM (18MB) QDR II SRAM (18MB) QDR II SRAM (18MB) Reconfigurable Logic QMEM LMEM DDR3 DRAM (8-16GB) DDR3 DRAM (8-16GB) PCI Epress 16 (electrical) 8 (logical) JDFE is pin compaible but has 8 serial links: 4 to the switch fabric (ports selectable via JunOS) 4 directly connected to the reconfigurable chip 3

The Software Stack The user applicaion is wri#en using the SLiC API StaIcally linked to libslic.a SLiC relies on MaelerOS s MaRT Linked via libmaeleros.

4 The Software Stack The user applicaion is wri#en using the SLiC API StaIcally linked to libslic.a SLiC relies on MaelerOS s MaRT Linked via libmaeleros.so a shared library MaRT communicates with the driver Driver eposes funcionality through the file-system: /dev/maeler0 /proc/maeler/ MaRT performs ioctls on the device file MaRT also uses file-operaions on the /proc/maeler/dev0/ files The driver communicates with the Hardware via Slave IO EffecIvely Mapped Memory IO Driver writes to a special memory address and the Hardware receives the data The Hardware communicates Directly with the User applicaion using DMA This means the hardware accesses the System RAM directly User app reads/writes from/to System RAM 4

Links to/from CPU The manager can create a special enity for echanging data with the CPU All

data with the CPU: Memory Access From the CPU point of view: To send data to the DFE à Write it

directly in to a given buffer, so the CPU simply polls that memory region and the data will

appear at the output of a special manager block, and go over a link in to a kernel Sending data

5 Links to/from CPU The manager can create a special enity for echanging data with the CPU All communicaions with the CPU is done over PCI Epress Generally, there is only one way to echange data with the CPU: Memory Access From the CPU point of view: To send data to the DFE à Write it to a special memory address (pointer) To read data from the DFE à The DFE writes the data directly in to a given buffer, so the CPU simply polls that memory region and the data will appear there at some point From the DFE point of view: Receiving data from the CPU à Data will appear at the output of a special manager block, and go over a link in to a kernel Sending data to the CPU à Send the data out on a link that is connected to the special manager block DFE System RAM CPU 5

6 Link Interfaces (flow control) Links have 2 types of interfaces: 1. PUSH 2. PULL PUSH Valid / Stall SemanIc Pull Read / (empty/almost empty) SemanIc DirecIon of the Data determines the arrow direcion: Input = Data coming into the block Output = Data going out Source Sink Input Manager Node Output 6

7 Computer Networks (Packets and Frames) IO connected in the Manager code addethernetstream( ) will create all the necessary components for you to be able to receive packets from the network The terms Packets and Frames are used interchangeably but we prefer the term Frame A frame is any data that is presented to the user along side the following metadata: SOF Start Of Frame indicator EOF End of Frame indicator MOD Number of valid bytes on the End Of Frame word 7

8 10G Network Traffic Through a Serial Link 10Gbps à 1 bit every 0.1ns à 64bits every 6.4ns 6.4ns period à MHz 10Gbps (e.g., Fiber) SFP Module 1 10GHz Deserializer MAC MOD EOF SOF Data MHz Kernel inside the Manager 8

Standard Ethernet Interfaces The most commonly used 10G Ethernet interfaces vary simply by data width 64 bits @ 156.25 MHz 32 bits @ 312.5 MHz (It s really 322.265625MHz) This is because: 32b * 312.

9 Standard Ethernet Interfaces The most commonly used 10G Ethernet interfaces vary simply by data width MHz MHz (It s really MHz) This is because: 32b * 312.5MHz = 64b * MHz = 10Gbps 64-bit interface: Data = 64 bits SOF=1 bit EOF=1 bit MOD = BitsToAddress(64/8) = 3 bits 32-bit interface: Data = 32 bits SOF=1 bit EOF=1 bit MOD = BitsToAddress(32/8) = 2 bits SOF 1 bit EOF 1 bit MOD 3 bits Data 64 bits Eample with 69 bit wide bus 9

10 Network Traffic: User Point of View Eample of a 59 byte frame over a 64 bit link running at MHz MOD indicates how many bytes are valid at the last word of the frame SOF EOF MOD MOD is only Valid when EOF=1 Data [64 bits = 8 Bytes] Time [Cycles] 10

Hello World! 1 2 3 4 5 6 7 8 9 10 11 12 H e l l o W o r l d!

11 Hello World! H e l l o W o r l d! SOF EOF MOD Data [64 bits] H e l l o w o r l d! r l d! H e l l o w o Time [Cycles] 11

12 Clock Domains Simply put: A set of blocks in which data can move at a certain rate Typical Clock Domains: PCI Epress (Gen 2.0 8) 250MHz Network (64 bits) 156MHz LMEM 400MHz QMEM 550MHz Stream Clock (default) 100MHz The higher the clock frequency, the faster data can move Data can move between clock domains it might not appear coniguous at the desinaion clock domain At 100MHz At Network (156MHz) Time 12

Clock Domains (cont) Every Manager Block is

like Kernels and State Machines are fleible

domain by the User When different blocks

connected the Manager automaically inserts

transiion 100MHz 250MHz K2 100MHz 400MHz

13 Clock Domains (cont) Every Manager Block is associated with a Clock Domain Some blocks, like Kernels and State Machines are fleible and can be assigned to a specific clock domain by the User When different blocks that belong to different Clock Domains are connected the Manager automaically inserts a Dual-Clock FIFO to help with the domain transiion 100MHz 250MHz K2 100MHz 400MHz 300MHz PCI Epress 250MHz LMEM 400MHz K1 300MHz 13

14 Clock Domains and Throughput A Kernel that generates data on every clock cycle For a fied length of Ime: The higher the clock frequency, the more data the kernel will generate K1 100MHz K2 200MHz Time [ns] 14

15 Store and Forward The act of storing a complete frame before forwarding it downstream for further processing Common for Checksum verificaion Less common: Clock-domain transiioning Input Output Time - EOF - SOF 15

16 Store and Forward Frame Continuity A store and forward block can convert a nonconinuous frame in to a coninuous one This property is important when connecing directly to Ethernet MACs since those can only work with coninuous frames At 100MHz At Network (156MHz) EOF - SOF Ater S&F Time 16

17 Kernel Latency The pipeline depth from a specific input to a specific output For HPC applicaions - Typically in the 1,000s For Networking applicaions Typically in the 100s For Ultra-Low-Latency applicaions Typically in the 10s The numbers in (brackets) are the node latencies 17

18 Stream Hold Data flowing through a graph will only be valid at certain Imes If we re interested in the 4 th data item relaive to the start of frame, we can store it for later use by using a stream hold The streamhold will only remember the value that was at its input when valid & sof = true Only when valid&sof = true, the stream.offset(data, 3) was interesing streamhold(stream.offset(data, 3), valid & sof) Valid SOF Input of StreamHold Output of StreamHold Slice data +3 Hold Valid & Sof Valid & Sof ater Scheduling (1) Slice valid Slice sof (1) a b c d e f d d d & Wall Time [cycles] 18

19 Kernels behavior: Pull in, Push out PULL type inputs PUSH type outputs FIFO queues at each input and output of a kernel FIFOs can normally hold up to 512 data words before being full Pull Input Push Output Read Full/stall Data Kernel Data Empty Valid 19

20 Kernel Flow Control - Stalling Input: Read a#empt but there is no data available Kernel will stop everything unil there is more data! Output: Write a#empt but the output buffer is full Kernel will stop everything unil there is more space! K Pull Input Push Output 20

21 Kernel Flushing I have only 2 data Items: A and B Input Output 21

22 Kernel Flushing A goes in, B is sill outside Input A Output 22

23 Kernel Flushing Both A and B are processed Input has now gone empty! Input B A A Output 23

24 Kernel Flushing A problem! No more inputs Kernel is stalling! B Input A A Output 24

25 How long will a kernel run for? We normally specify the number of epected data items This is called the RunCycleCounter and it s part of the kernel s Flushing Logic Run Cycle Count = 2 25

26 Kernel Flushing With Cycle Count = 2 We got 2 data items, Flushing logic kicks in! B Input Run Cycle Count = 2 A A Output 26

27 Kernel Flushing With Cycle Count = 2 Flushing. Input Run Cycle Count = 2 B B A Output 27

28 Kernel Flushing With Cycle Count = 2 SIll Flushing. Input Run Cycle Count = 2 B Output 28

29 Kernel Flushing With Cycle Count = 2 And we re done! Input Run Cycle Count = 2 Output 29

30 There is, however, a problem This doesn t work with networking How many data items will the kernel epect? Unknown We might receive 1. We might receive 2. An infinite amount or nothing at all! What do we do? 30

31 Non-Blocking Inputs Non-blocking inputs solve this problem They never stall the kernel even when there s no data available! They do this by adding a Valid bit to every incoming word 32 bits 33 bits = 32 bits + 1 valid bit 31

32 Kernel Flushing Non-Blocking Input Input, v=0, v=0 v=0, v=0 Output 32

33 Kernel Flushing Non-Blocking Input Input A, v=1, v=0 v=0, v=0 Output 33

34 Kernel Flushing Non-Blocking Input Input B, v=1 A, v=1 A, v=1, v=0 Output 34

35 Kernel Flushing Non-Blocking Input Input, v=0 B, v=1 B, v=1 A, v=1 Output 35

36 Kernel Flushing Non-Blocking Input Input, v=0, v=0, v=0 B, v=1 Output 36

37 Kernel Flushing Non-Blocking Input The kernel keeps running forever with valid=0 unil more real data arrives!, v=0 Input, v=0, v=0, v=0 Things will furfure improve with Custom Kernels, a new class currently under development Output 37

Network DFE Simulated System MaelerOS comes with a SimulaIon environment It

Iming of dynamic events are completely different, but the inkernel simulaion is

switch between hardware and simulaion The hardware is simulated inside the

38 Network DFE Simulated System MaelerOS comes with a SimulaIon environment It aims to be cycle-accurate when compared to the hardware environment In reality, Iming of dynamic events are completely different, but the inkernel simulaion is very accurate SLiC is always the same so it s trivial for an applicaion to switch between hardware and simulaion The hardware is simulated inside the MaelerOS Sim Daemon ApplicaIon Normal SLiC SimulaIon MaRT MaelerOS SimulaIon Daemon CPU 38

Simulated Networks SimulaIon uses TUN/TAP devices to create virtual NICs in Linu These NICs simulate a network device that has a direct connecion to a physical port on the Simulated DFE Linu can send

39 Simulated Networks SimulaIon uses TUN/TAP devices to create virtual NICs in Linu These NICs simulate a network device that has a direct connecion to a physical port on the Simulated DFE Linu can send and receive packets through the simulated NIC and those would go to/from the simulated DFE s port Simulated DFE TOP MID BOT Simulated Network Simulated NIC Simulated NIC SimulaIon Daemon Simulated NIC Linu 39

40 Simulation in Practice The simulated DFE is completely invisible from Linu s point of view The best way to think about it, is that the DFE is a different computer enirely that happens to be connected to the same network that the Simulated NIC is connected to. This means that a Linu program that uses standard sockets, can send data back and forth to the Simulated DFE using standard network protocols TCP/UDP/ ICMP etc You need to make sure: To assign the Simulated NIC an IP address To assign the Simulated DFE an IP address which is in the same network as the Simulated NIC Simulated DFE TOP Simulated NIC tap0 in Linu

eamples: SOF, EOF, MOD Indicates how to interpret the data as a frame UDP/TCP Socket Tells

41 Metadata Metadata Data about data Links in the Manager are designed to have data stream from one manager enity to the other, metadata is essenial for contetualizing the data Most common eamples: SOF, EOF, MOD Indicates how to interpret the data as a frame UDP/TCP Socket Tells which remote connecion the data belongs to Remote 1 Data Socket = 3 MaTCP Network Remote 2 Remote 3 41

42 Framed Link Fields with Metadata UlImately, a link is just a collecion of wires: data is indisinguishable from metadata from the Hardware s point of view When the link connects to the desinaion block, the link fields, it is viewed as one wide bus in our eample 83 wires. The individual fields are sliced out of this bus First 64 wires are the data Net 14 wires are the socket number etc. SOF 1 bit EOF 1 bit MOD 3 bits Socket 14 bits Data 64 bits The link width is 83 bits includes both data and metadata. That s 83 wires going in to the manager block. 42

43 Summary Network streams can be handled by DFE kernels Networking DFEs have specific requirements There are challenges with flow control Flushing kernels and non-blocking inputs can help Networking DFE kernels care about latency 43

CO405H. Department of Compu:ng Imperial College London. Computing in Space with OpenSPL Topic 5: Programming DFEs, basics

CO405H. Department of Compu:ng Imperial College London. Computing in Space with OpenSPL Topic 5: Programming DFEs, basics CO405H Computing in Space with OpenSPL Topic 5: Programming DFEs, basics Oskar Mencer Georgi Gaydadjiev Department of Compu:ng Imperial College London h#p://www.doc.ic.ac.uk/~oskar/ h#p://www.doc.ic.ac.uk/~georgig/