Ethernet transport protocols for FPGA

Similar documents
Copyright 2013 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol.

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

10Gbps TCP/IP streams from the FPGA

CCNA R&S: Introduction to Networks. Chapter 7: The Transport Layer

Custom UDP-Based Transport Protocol Implementation over DPDK

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Transport layer issues

UDP, TCP, IP multicast

User Datagram Protocol

Two approaches to Flow Control. Cranking up to speed. Sliding windows in action

ECE 650 Systems Programming & Engineering. Spring 2018

Open Source Traffic Analyzer

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Implementation and Analysis of Large Receive Offload in a Virtualized System

Optimizing Performance: Intel Network Adapters User Guide

Data Link Control Protocols

LANCOM Techpaper IEEE n Indoor Performance

original standard a transmission at 5 GHz bit rate 54 Mbit/s b support for 5.5 and 11 Mbit/s e QoS

9th Slide Set Computer Networks

Mobile Transport Layer Lesson 10 Timeout Freezing, Selective Retransmission, Transaction Oriented TCP and Explicit Notification Methods

CSCI-GA Operating Systems. Networking. Hubertus Franke

User Datagram Protocol (UDP):

TCP over Wireless PROF. MICHAEL TSAI 2016/6/3

Pretty Good Protocol - Design Specification

PowerPC on NetFPGA CSE 237B. Erik Rubow

Improving Reliable Transport and Handoff Performance in Cellular Wireless Networks

6.9. Communicating to the Outside World: Cluster Networking

Announcements. IP Forwarding & Transport Protocols. Goals of Today s Lecture. Are 32-bit Addresses Enough? Summary of IP Addressing.

TCP. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli (Slides by Christos Papadopoulos, remixed by Lorenzo De Carli)

ECE697AA Lecture 3. Today s lecture

Lecture 6. TCP services. Internet Transport Layer: introduction to the Transport Control Protocol (TCP) A MUCH more complex transport

Covert TCP/IP network channels using Whitenoise protocol. Michal Rogala.

Introduction to TCP/IP Offload Engine (TOE)

Peer-to-Peer Protocols and Data Link Layer. Chapter 5 from Communication Networks Leon-Gracia and Widjaja

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

An FPGA-Based Optical IOH Architecture for Embedded System

ECE 653: Computer Networks Mid Term Exam all

FPGA Implementation of RDMA-Based Data Acquisition System Over 100 GbE

SWAP and TCP performance

Computer Networking. Reliable Transport. Reliable Transport. Principles of reliable data transfer. Reliable data transfer. Elements of Procedure

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Introduction to TCP/IP networking

Real-Time Course. Video Streaming Over network. June Peter van der TU/e Computer Science, System Architecture and Networking

FPGAs and Networking

CCNA 1 Chapter 7 v5.0 Exam Answers 2013

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Chapter 2 - Part 1. The TCP/IP Protocol: The Language of the Internet

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

I/O Systems. Jo, Heeseung

Advanced Computer Networks. End Host Optimization

Lecture 2: Links and Signaling

_äìé`çêé» UART Host Transport Summary. February 2004

Lecture 3: The Transport Layer: UDP and TCP

Introduction to Networking. Operating Systems In Depth XXVII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

A Network Storage LSI Suitable for Home Network

Operating Systems and. Computer Networks. Introduction to Computer Networks. Operating Systems and

LANCOM Techpaper Routing Performance

Multiple unconnected networks

OSI Transport Layer. Network Fundamentals Chapter 4. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1

Implementation of Software-based EPON-OLT and Performance Evaluation

TCP so far Computer Networking Outline. How Was TCP Able to Evolve

Transport Layer. Application / Transport Interface. Transport Layer Services. Transport Layer Connections

Parallel Computing Trends: from MPPs to NoWs

Communication Networks

Extreme TCP Speed on GbE

ICS 451: Today's plan. Sliding Window Reliable Transmission Acknowledgements Windows and Bandwidth-Delay Product Retransmission Timers Connections

Low-Rate Wireless Personal Area Networks IEEE Fernando Solano Warsaw University of Technology

Layer 4: UDP, TCP, and others. based on Chapter 9 of CompTIA Network+ Exam Guide, 4th ed., Mike Meyers

A CAN Protocol for Calibration and Measurement Data Acquisition

A (Very Hand-Wavy) Introduction to. PCI-Express. Jonathan Heathcote

TCP over wireless links

Network Design Considerations for Grid Computing

The NE010 iwarp Adapter

Introduction to Networks and the Internet

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Managing Caching Performance and Differentiated Services

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Martin Dubois, ing. Contents

Announcements. No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

TOE10G-IP Core. Core Facts

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Direct Link Networks: Building Blocks (2.1), Encoding (2.2), Framing (2.3)

Why Your Application only Uses 10Mbps Even the Link is 1Gbps?

E Copyright VARAN BUS USER ORGANIZATION 06/2015. Real Time Ethernet VARAN Bus

Mobile Transport Layer

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 11, 2018

Transport Over IP. CSCI 690 Michael Hutt New York Institute of Technology

Networked Systems and Services, Fall 2018 Chapter 3

Chapter 6: Network Communications and Protocols

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

ET4254 Communications and Networking 1

Flow control: Ensuring the source sending frames does not overflow the receiver

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Networked Systems and Services, Fall 2017 Reliability with TCP

EE 122: IP Forwarding and Transport Protocols

Communication (III) Kai Huang

CMPE150 Midterm Solutions

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded

OSI Layer OSI Name Units Implementation Description 7 Application Data PCs Network services such as file, print,

Transcription:

Ethernet transport protocols for FPGA Wojciech M. Zabołotny Institute of Electronic Systems, Warsaw University of Technology Previous version available at: https://indico.gsi.de/conferencedisplay.py?confid=3080 Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 1

FPGA and Ethernet is it a good combination? In 2011 in one of FPGA related blogs, there was an article published: Designed to fail: Ethernet for FPGA-PC communication http://billauer.co.il/blog/2011/11/gigabit-etherne t-fpga-xilinx-data-acquisition/ In spite of this multiple efficient Ethernet based solutions for FPGA communication were proposed... Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 2

Why Ethernet at all? There are many other protocols to use with gigabit transceivers... There are some clear advantages: Standard computer (PC, embedded system, etc.) may be the remote site Long distance connection possible Relatively cheap infrastructure (e.g. network adapters, switches) Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 3

Where's the problem? Ethernet is inherently unreliable Ethernet offers high throughput, but also high latency, especially if we consider the software related latency on the computer side of the link. To ensure reliability we must introduce acknowledge/retransmission system, buffering all unacknowledged data To achieve high throughput we need to work with multiple packets in flight, which makes management of acknowledge packets relatively complex... Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 4

Standard solutions... The standard solution for reliable transfer of data via packet network is TCP/IP Due to the fact, that it is suited for operation in public wide area networks, it contains many features not needed in our scenario, but seriously increasing the resources consumption. We don't need: Protection against session hijacking Solutions aimed on coexistence of different protocols in the same physical network Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 5

Alternative solution simplified TCP/IP Article: Gerry Bauer, Tomasz Bawej et.al.: 10 Gbps TCP/IP streams from the FPGA for High Energy Physics http://iopscience.iop.org/1742-6596/513/1/012042 Seriously limited TCP/IP protocol. Unfortunately sources of the solution are not open, so it was not possible to investigate that solution... Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 6

Can we get things simpler? Aim of the protocol Ensure that data are transferred from FPGA to the computer (PC, embedded system...) with following requirements: Reliable transfer Minimal resources consumption in FPGA Minimal CPU usage in the computer Possibly low latency Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 7

Typical use case? Proposed solution may be used in price optimized data acquisition systems Front End Boards (FEBs) based on small FPGAs Standard network cabling and switches used to concentrate data Standard embedded system may be used to concentrate data and send it further via standard network... FEB FEB FEB FEB FEB FEB Switch Switch Independent channel for control Emb. PC Emb. PC TCP/IP based processing grid Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 8

Simplified use case FEB NIC FEB FEB NIC NIC Embedded System Independent channel for control If we have separate NIC cards for each FEB, the task is even mor e simplified We have granted bandwidth for each connection (as long, as the CPU power in the embedded system is sufficient) We mainly need to transfer data, however simple control commands are welcome... Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 9

To keep the FPGA resources consumption low... We need to minimize the acknowledgement latency We need to keep the protocol as simple as possible Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 10

Main problem acknowledge latency If we need to reliably transfer data via unreliable link (e.g. Ethernet) we have to use the acknowledge/retransmit scheme Other approach may be based on forward error correction, but it is less reliable If the transfer speed is equal to Rtransm, and the maximum acknowledge latency is equal to t ack, than the necessary memory buffer in the transmitter must be bigger than: S buf =R transm t ack If we are going to use a small FPGA with small internal memory, we need to minimize the latency Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 11

Standard protocols and latency To minimize latency, we must give up routing of packets. The first node should acknowledge reception of the packet. Routability, which is a big advantage of standard protocols is therefore useless for us. Standard protocols are handled by a complex networking stack in Linux kernel which leads to relatively high acknowledge latency receiver router FEB Routing increases the latency even more... Use of protocols designed for routing (eg. IP) is not justified! Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 12

Finally chosen solution Use of own Layer 3 protocol Use of own protocol handler implemented as a Linux kernel driver Use of memory mapped buffer to communicate with data processing application Use of optimized Ethernet controller IP core in the FPGA communicating directly with Ethernet Phy Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 13

Alternative solution Article: B. Mindur and Ł. Jachymczyk: The Ethernet based protocol interface for compact data acquisition systems http://iopscience.iop.org/1748-0221/7/10/t10004 Similar, but more complex solution (multiple streams with different priorities) Communication with user space via netlink Sources not available, so it was difficult to evaluate the solution... Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 14

Short description of the FADE- 10g protocol Packets sent from PC to FPGA: Data acknowledgements and commands Packets sent from FPGA to PC: Data for DAQ Command responses/acknowledgements (if possible, they are included in the data packets) Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 15

Structure of the packets Command packet (PC to FPGA) TGT (6 bytes), SRC (6 bytes) Protocol ID (0xfade), Protocol ver. (0x0100) 4 bytes Command code (2 bytes), Command sequence number (2 bytes) Command argument (4 bytes) Padding (to 78 bytes) Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 16

Structure of the packets ACK packet (PC to FPGA) TGT (6 bytes), SRC (6 bytes) Protocol ID (0xfade), Protocol ver. (0x0100) 4 bytes ACK code 0x0003 (2 bytes) Packet repetition number (2 bytes) Packet number in the data stream (4 bytes) Transmission delay (4 bytes) Padding (to 78 bytes) Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 17

Structure of the packets Command response (FPGA to PC) TGT (6 bytes), SRC (6 bytes) Protocol ID (0xfade), Protocol ver. (0x0100) 4 bytes Response ID (0xa55a) 2 bytes Filler (0x0000) 2 bytes CMD response 12 bytes Command code (2 bytes), Command seq number (2 bytes) User defined response 8 bytes Padding to 78 bytes (may be shortened to 64 bytes) Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 18

Structure of the packets Data packet (FPGA to PC) TGT (6 bytes), SRC (6 bytes) Protocol ID (0xfade), Protocol ver. (0x0100) 4 bytes Data packet ID (0xa5a5) (2 bytes) Embedded command response (12 bytes) Packet repetition number (2 bytes) Number of packet in the data stream (4 bytes) Data (1024 8-byte words = 8KiB) Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 19

Structure of the packets Last data packet (FPGA to PC) TGT (6 bytes), SRC (6 bytes) Protocol ID (0xfade), Protocol ver. (0x0100) 4 bytes Data packet ID (0xa5a6) (2 bytes) Embedded command response (12 bytes) Packet repetition number (2 bytes) Number of packet in the data stream (4 bytes) Data (1024 8-byte words = 8KiB), but the last word encodes number of used data words (always less than 1024). Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 20

Meaning of repetition numbers To minimize latency, protocol tries to detect lost packets as soon as possible. We assume, that packets are delivered, and ACKs are transmitted in order. If FPGA receives ACK not for the last unacknowledged, but for one of next packets, it immediately knows, that the previous one(s) where lost must be retransmitted. However if the loss of the next packet will be also detected, it is necessary to retransmit only those unconfirmed packets, which have not been retransmitted yet. In case of multiple packets in flight, and multiple retransmissions, things get complicated The repetition numbers are a simple way to avoid unnecessary retransmissions of already retransmitted packets (we retransmit only the packets with repetition number lower than the one of received ACK packet). Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 21

Short description of FPGA core dta Descriptor Manager dta_we Acknowledge and commands FIFO Packet Receiver Ethernet PHY Ethernet LINK dta_ready Packet Buffers Memory Packet Sender Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 22

Short description of protocol handler The packet should be serviced and acknowledged as soon as possible We want to use the network device driver (to assure wide hardware compatibility) The right method to install our handler is dev_add_pack so our packet handler is called as soon as packet is received The dedicated kernel module servicing the packets with Ethernet type 0xfade has been implemented Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 23

Transfer of data to the user space To avoid overhead associated with transferring data to the user space, the driver uses a kernel buffer mapped into the user space Data are copied directly from the sk_buff structure to that buffer, using the skb_copy_bits function. Synchronization of pointers between the kernel driver and the user space application is assured via ioctl command Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 24

Synchronization of the application with the data flow The application may declare the amount of data which should be available, when application is woken up The application uses the poll function to wait for data (or timeout). Control commands sleep, until the commands is executed and confirmed, therefore it is suggested to split application into threads Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 25

Suggested organization of an application Open /dev/l3_fpgax Prepare buffer, set wakeup threshold Wait for data using poll Check for data availability or EOD marker Connect to FPGA Reset FPGA Split application into control and DAQ threads Send START command Send other necessary commands Process data Send STOP command No EOD? Yes Join threads Disconnect from FPGA Close /dev/l3_fpgax Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 26

Synthesis results For KC705 with 32 packet buffers LUT usage: 3.04% BRAM usage: 16.5% For KC705 with 16 packet buffers LUT usage: 3.02% BRAM usage: 9.32% Internal logic working at 156,25MHz Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 27

System used for testing Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz (however during tests it operated at 800MHz- 1GHz) Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection Intel Corporation Ethernet Server Adapter X520-2 Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 28

Results Throughput: 9.80Gb/s CPU load: 29%-42% (each word checked) in a single core handling the reception Acknowledgement latency: ~3µs Comparison TCP/IP (iperf, PC-PC): Throughput: 9.84Gb/s (TCP/IP used longer frames for MTU=9000, packets captured by Wireshark had even 26910 bytes!) CPU load: 42% (just reception!) Acknowledgement latency: ~8µs Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 29

Conclusion... Advantages of the solution: Simple, open source solution, may be easily adjusted and extended (BSD/GPL license) Small resources consumption in FPGA, low CPU load in receiving computer Standard NIC may be used at the receiving side Disadvantages of the solution: Small developer's base Still not fully tested and mature Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 30

Availability of sources Current sources are available at: http://opencores.org/project,fade_ether_protocol Please use the experimental_jumbo_frames_version Projects for KC705 (10Gb/s), for AFCK (10Gb/s multiple links) and for Atlys (1Gb/s) are prepared Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 31

Thank you for your attention! Joint CBM/Panda DAQ developments-"kick-off" meeting 19-20.02.2015 32