Networking with Xilinx Embedded Processors
Goals 2 Identify the major components in a processor-based networking system, and how they interact Understand how to match hardware and software network components to your project requirements, from low cost to high performance Learn how to create a viable networking system in minutes using Xilinx tools
Agenda 3 Processor-driven network essentials A little theory to start off Hardware Describe key components and look for bottlenecks Demo 1 Base System Builder and Web Server Software Stacks, raw API, sockets Performance Hardware and Software influence Demo 2 Performance in 3 parts
Agenda 4 Processor-driven network essentials A little theory to start off Hardware Describe key components and look for bottlenecks Demo 1 Base System Builder and Web Server Software Stacks, raw API, sockets Performance Hardware and Software influence Demo 2 Performance in 3 parts
Network Essentials Topics 5 Network Nodes OSI Network Model Design Considerations Overhead
Basic Networking System 6 TCP/IP Ethernet (802.3, 802.11) Router Router Internet Embedded System
PowerPC Network Node 7 FPGA External DDR Ctlr DDR (16Mx16) DOCM DSPLB OPB Ethernet MAC Ethernet PHY BRAM PowerPC Processor Timer Interrupt Controller UART RS-232 IOCM ISPLB PLB-OPB Bridge 75MHz 66MHz DCM 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port
Ethernet in the OSI Model 8 7. Application Software layers IETF 6. Presentation 5. Session 4. Transport 3. Network Application Layers TELNET, HTTP, SMTP UDP/TCP IP ARP, routing 2. LLC Multiplexing/Demultiplexing Hardware layers IEEE802 2. Data Link 1. Physical Ethernet MAC Ethernet PHY
General Design Considerations 9 Processor cycles General rule of thumb 1 MHz of CPU per 1 Mbits of traffic Out of the box solutions may achieve only 20% of this rate FPGA hardware acceleration Co-processing frees up the CPU to execute control code Memory bandwidth Match data bus width to your memory width Multiple masters on the same bus require arbitration Interrupt processing Too many interrupts creates livelock
General Design Considerations 10 Physical layer 10/100/1000 Mbps full/half duplex Line rates, distance and costs determine transmission media Data link layer MII/GMII/RGMII/SGMII interface to PHY Interrupt line required for high performance Software layers Use only the hardware layers for maximum performance Sockets simplify coding, but add overhead Eliminate per byte overhead for biggest performance boost
TCP/IP Stack Overheads 11. Per-byte overhead Data buffer copies Checksum calculation Per-packet overhead Interrupt overhead Buffer management Protocol processing More opportunity for hardware to affect system performance Per-connection overhead
Agenda 12 Processor-driven network essentials A little theory to start off Hardware Describe key components and look for bottlenecks Demo 1 Base System Builder and Web Server Software Stacks, raw API, sockets Performance Hardware and Software influence Demo 2 Performance in 3 parts
Hardware Topics 13 OSI Hardware Layers Performance Bottlenecks Direct Memory Access / Data Realignment Engine Checksum offload Memory bandwidth
Ethernet Interfaces 14 Ethernet has a set of medium-independent interfaces Separates the Physical and Data Link layers Goes between the PHY and the MAC Interface Name MII GMII RGMII SGMII 1000 Base-X XGMII Speed 10 / 100 Mbps 1 Gbps or Tri-mode* Tri-mode Tri-mode 1 Gbps 10 Gbps Data Bus Width 4 bit 8 bit 4 bit DDR Serial Serial 32 bit * Tri-mode switches to MII protocol for 10 / 100 Mbps
Ethernet MAC Responsibilities 15 Transmission Package the Ethernet frame, and communicate with the Physical Layer with the correct interface Handle flow control (pause frames) and collisions Generate and append the FCS on the Ethernet frame Handle the timing of the Inter-frame Gap and back off Reception Receive and extract the Ethernet frame Check the destination address, and ignore the frame if it is not for this device Can be set to accept all frames (promiscuous mode) Check the FCS and protocol for errors
PowerPC Network Node 16 FPGA External DDR Ctlr DDR (16Mx16) DOCM DSPLB OPB Ethernet MAC DMA + DRE Ethernet PHY Timer BRAM PowerPC Processor Mem Ctlr Interrupt Controller IOCM ISPLB BRAM* PLB-OPB Bridge UART 75MHz 66MHz DCM RS-232 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port BRAM = Block RAM (Memory blocks within the FPGA)
Simple DMA Engine 17 Transfers data between memory locations without processor involvement Data blocks are contiguous Memory 0x1000 Buffer Descriptor Start: 0x1000 Len: 0x4000 Dest: 0x100000 DMA Engine 0x5000 Cache 0x100000 0x105000
Scatter/Gather DMA Engine 18 Transfers data between memory locations without processor involvement Multiple data blocks Buffer Descriptors Start: Len: Dest: Start: Len: Dest: Start: Len: Dest: 0x1000 0x2000 0x100000 0x5000 0x2000 0x102000 0x12000 0x1000 0x104000 DMA Engine TCP Header Memory 0x1000 0x3000 0x5000 Data 0x7000 0x12000 0x13000 CRC Cache TCP Header Data CRC 0x100000 0x105000
Data Realignment Engine 19 DMA engines often require data to start on particular boundaries (i.e. 64 or 128 byte aligned) TCP data can start on any byte boundary If no DRE is available, the CPU must align the data by performing a buffer copy This eliminates the advantage gained from DMA
PowerPC Network Node 20 FPGA External DDR Ctlr DDR (16Mx16) DOCM DSPLB OPB CSO Ethernet MAC DMA + DRE Ethernet PHY Timer BRAM PowerPC Processor Mem Ctlr Interrupt Controller IOCM ISPLB BRAM PLB-OPB Bridge UART 75MHz 66MHz DCM RS-232 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port
Checksum Offload Implementation 21 TCP checksum Integer value computed by summing bytes Used to detect errors incurred during a packet transmission TX LocalLink DataStream TX DMA Descriptor Processor incurs severe penalty for checksum computation Functionality can easily be offloaded into FPGA fabric Simple state machine implements and inserts the checksum logic Software stack needs minor corresponding change Many TCP/IP stacks have pre-built support for checksum offload TX FIFO MUX GMAC Checksum Compute Control Insert CSUM FIFO
PowerPC Network Node 22 FPGA External DDR Ctlr DDR (16Mx16) DOCM DSPLB OPB CSO Ethernet MAC DMA + DRE Ethernet PHY Timer BRAM PowerPC Processor Mem Ctlr Interrupt Controller IOCM ISPLB BRAM PLB-OPB Bridge UART 75MHz 66MHz DCM RS-232 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port
PowerPC Network Node 23 FPGA External MPMC2 Port 1 Port 2 DDR (16Mx16) DOCM DSPLB OPB CSO Ethernet MAC DMA + DRE Ethernet PHY Timer BRAM PowerPC Processor Mem Ctlr Interrupt Controller IOCM ISPLB BRAM PLB-OPB Bridge UART 75MHz 66MHz DCM RS-232 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port
Agenda 24 Processor-driven network essentials A little theory to start off Hardware Describe key components and look for bottlenecks Demo 1 Base System Builder and Web Server Software Stacks, raw API, sockets Performance Hardware and Software influence Demo 2 Performance in 3 parts
Xilinx Platform Studio (XPS) Web Server Demo 25 Build an embedded web server Build the hardware and BSP in XPS / Base System Builder Hardware design uses PPC basic networking system Processor: 300 MHz, Bus: 100 MHz Link speed: 100 Mbps Software design uses LwIP with socket API Demonstrate the Application JTAG Ethernet RS-232
How Sockets Work (TCP) 26 Client - socket -bind - connect -write / read -close Server - socket - bind - listen - accept - write / read -close IP: 214.226.8.24 :5280 IP: 228.209.0.115:80 Request Connection
How Sockets Work (TCP) 27 Client - socket -bind - connect -write / read -close Server - socket - bind - listen - accept - write / read -close IP: 228.209.0.115:7250 IP: 214.226.8.24 :5280 IP: 228.209.0.115:80 Establish Connection
How Sockets Work (TCP) 28 Client - socket -bind - connect -write / read -close Server - socket - bind - listen - accept - write / read -close IP: 228.209.0.115:7250 IP: 214.226.8.24 :5280 IP: 228.209.0.115:80 Read / Write
How Sockets Work (TCP) 29 Client - socket -bind - connect -write / read -close Server - socket - bind - listen - accept - write / read -close IP: 228.209.0.115:7250 IP: 214.226.8.24 :5280 IP: 228.209.0.115:80 Inform App
Agenda 30 Processor-driven network essentials A little theory to start off Hardware Describe key components and look for bottlenecks Demo 1 Base System Builder and Web Server Software Stacks, raw API, sockets Performance Hardware and Software influence Demo 2 Performance in 3 parts
Software Topics 31 TCP/IP Stacks Stack APIs Software Performance
Light-weight Internet Protocol (lwip) 32 Developed in the Open Source Community http://savannah.nongu.org/projects/lwip Directly supported by Libgen Features Compact code size relative to RTOS TCP/IP stacks 90KB (raw mode) Requirements Sockets require Xilinx MicroKernel (Xilkernel) Stack is multi-threaded HW timer must be available
TCP/IP Transport Protocols 33 Transport Control (TCP) Connection-oriented Endpoint to endpoint reliable delivery Example applications FTP, HTTP, NFS Universal Datagram (UDP) Connectionless Delivery is not guaranteed Example applications SMTP, TFTP, BOOTP
Xilinx-compatible Commercial TCP/IP Stacks 34 Operating System Vendor MicroBlaze PowerPC VxWorks Wind River Linux LynuxWorks MontaVista Wind River μclinux LynuxWorks Petalogix Nucleus Plus Mentor/ATI ThreadX Express Logic μc/os-ii Micrium OSE ENEA Integrity Green Hills Neutrino QNX ecos Mind -------- Treck
Socket Interface 35 API developed originally at Berkeley for the BSD Unix Operating System Provides An abstraction for programmers to establish the parameters governing the network transmissions Application portability Programmer portability Buffer copy from user to kernel space LwIP Implementation Allows sequential programs to utilize the stack Requires scheduler to manage execution contexts
Standard Socket Operation 36 Application Create_socket(); Listen(); Accept(); While (read(buffer)); { process_data(); } Close(); User Data Scheduler Timer Interrupt LwIP Stack Data Data Data Data FIFO MAC PHY Data
Raw API 37 Lowest level interface to the stack Least amount of overhead Can bypass buffer copy required by sockets Callback mode allows lwip to alert application to events Application program handles control Single threaded application does not require scheduler More complex to implement Hardware timer required for lwip
Raw Mode Operation 38 Application While (1) { if (condition == true) Retrieve_Data(); } LwIP_ISR Callback Callback { condition = true; return; } LwIP Stack Data FIFO MAC PHY Interrupt Data
Checksum Offload Software Hooks 39 Currently supported by OPB Ethernet Media Access Controller PLB Tri-mode Ethernet Media Access Controller Only available for Raw API mode Activate software support in Platform Settings
CSO Software Platform Parameters 40 Tx/Rx protocols can be selected for CSO separately
Agenda 41 Processor-driven network essentials A little theory to start off Hardware Describe key components and look for bottlenecks Demo 1 Base System Builder and Web Server Software Stacks, raw API, sockets Performance Hardware and Software influence Demo 2 Performance in 3 parts
Summary of Performance Issues 42 Zero copy API and checksum offload High memory bandwidth Multi-port memory controller CPU cannot touch payload DMA engine with realignment
Zero Copy API 43 Special socket API implementation Xilinx lwip target is later this year Stack allocates buffer space for application Application accesses stack buffer via pointers Eliminates buffer copies
Per-Byte vs. Per-Packet Overheads 44 Both checksum offload and elimination of buffer copy must be done to gain maximum benefit Most time spent in copy & checksum Checksum offload only Zero copy only Checksum offload & zero copy http://www.cs.duke.edu/ari/publications/tcpgig.pdf/
Jumbo Frames 45 Extends Ethernet packets to up to 9000 bytes Why 9000 bytes? 32 bit CRC is not effective beyond 12000 bytes 9000 bytes is large enough for an 8KB application datagram + headers (ie. NFS) BRAM intensive EMAC must allocate sufficient buffer space Requires a larger FPGA to implement ML405
Agenda 46 Processor-driven network essentials A little theory to start off Hardware Describe key components and look for bottlenecks Demo 1 Base System Builder and Web Server Software Stacks, raw API, sockets Performance Hardware and Software influence Demo 2 Performance in 3 parts
Performance Demo 47 Demonstrate increasing levels of TCP/IP performance Basic XPS/BSB design (Web Server platform) Add S/G DMA+DRE with CSO (Raw API mode) Gigabit (where available ML405 required) Download FPGA bit stream & start server Start client on host Stats print in Hyper Terminal IPerf Server IPerf Client JTAG Ethernet RS-232
Web Server Platform 48 FPGA External DDR Ctlr DDR (16Mx16) DOCM DSPLB OPB Ethernet MAC Ethernet PHY BRAM PowerPC Processor Timer Interrupt Controller UART RS-232 IOCM ISPLB PLB-OPB Bridge 75MHz 66MHz DCM 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port
100 Mbps Performance Platform 49 FPGA External DDR Ctlr DDR (16Mx16) DOCM DSPLB OPB CSO Ethernet MAC DMA + DRE Ethernet PHY Timer BRAM PowerPC Processor Mem Ctlr Interrupt Controller IOCM ISPLB BRAM PLB-OPB Bridge UART 75MHz 66MHz DCM RS-232 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port
1000 Mbps Performance Platform 50 FPGA External MPMC2 Port 1 Port 2 DDR (16Mx16) DOCM DSPLB OPB CSO Ethernet MAC DMA + DRE Ethernet PHY Timer BRAM PowerPC Processor Mem Ctlr Interrupt Controller IOCM ISPLB BRAM PLB-OPB Bridge UART 75MHz 66MHz DCM RS-232 100 MHz Clock JTAG Port PLB OPB Arbiter Reset Switch JTAG Port
Key Takeaways 51 Start simple, build from there XPS BSB makes creating a basic design easy Tailor your hardware to the performance you require Each hardware device consumes FPGA resources Match your software stack to the hardware For high performance, the stack must provide both a zero-copy API and checksum offload capabilities FPGA implementations defer design decisions Hardware flexibility allows you to customize your system after real data is available
What s Next? 52 Contact your FAE Get Xilinx tools ISE WebPACK can be downloaded free EDK is often bundled with Avnet development boards during Avnet Speedway promotions Get a development board Create your own network design Or attend an Avnet Speedway workshop in your area
Thanks for coming! Any questions?
Supplementary Material
Request Resources 55. Create a socket (returns socket descriptor) sd = socket(pf, type, protocol) Protocol family (Internet) Connection type Datagram (UDP) Stream (TCP) Raw (Custom) IP: 214.226.8.24 Client Requests TCP Socket Descriptor Specific protocol (NULL) IP: 228.209.0.115 Server Requests TCP Socket Descriptor
Create a Connection Endpoint 56 Tie the socket to a local address retcode = bind(sd, localaddr, addrlen) Socket descriptor Structure size IP address (struct ptr) IP address Port IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100
Server: Activate the Socket 57 Server must listen for incoming connections new_sd = listen(sd, qlength) Socket descriptor Queue length For simultaneous requests IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100
Client: Request a Virtual Circuit 58 Establish a virtual circuit retcode = connect(sd, destaddr, addrlen); Socket descriptor Structure size IP address struct ptr IP address Port IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100 SYN
Server: Complete a Virtual Circuit 59 Server must listen for incoming connections sd = accept(sd, addr, addrlen); Socket descriptor Structure size IP address struct ptr IP address Port IP: 228.209.0.115:7349 IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100 ACK Applications connected SYN, ACK
Send Data 60 Transmit Data over a virtual circuit retcode = write(sd, buffer, buflen); Socket descriptor Buffer length Local data buffer IP: 228.209.0.115:7349 IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100 Data / ACK
Receive Data 61 Receive data over a virtual circuit retcode = read(sd, buffer, buflen); Socket descriptor Buffer length Local data buffer IP: 228.209.0.115:7349 IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100 Data / ACK
Client: Terminate the Virtual Circuit 62 Shut down the circuit gracefully retcode = close(sd); Socket descriptor IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100 Inform App IP: 228.209.0.115:7349 FIN FIN, ACK
Server: Terminate the Virtual Circuit 63 Shut down the circuit gracefully retcode = close(socket); Socket descriptor IP: 228.209.0.115:7349 IP: 214.226.8.24 :5280 IP: 228.209.0.115:6100 ACK
Thanks again for coming! Enjoy the rest of the day!
MACs Provide Frame Format 65 User Input Standard VLAN frame Jumbo frame Pause frame
Transmission Overhead 66 Data interpretation is layer specific 24 Bytes Header Data Transport layer: TCP Network layer: IP Header 24 Bytes Data
Ethernet Frames 67 When using protocol stacks, minimize the overhead to data ratio by maximizing the frame data Eliminate 48 bytes of overhead by using a custom application to drive the lower layers Preamble Dest Addr Src Addr Frame Type Frame Data CRC 8 bytes 6 bytes 6 bytes 2 bytes 46-1500 bytes 4 bytes 48 bytes + 26 bytes = 74 bytes overhead
Tx Performance With Checksum Offload 68 Mbps 800 700 600 500 400 300 200 100 Gigabit System Reference Design v2 (GSRD2) Checksum in SW 1.8X Checksum in HW 491 355 2.2X 158 785 0 1500 Byte Packet 9000 Byte Packet - Zero Copy API, 1MB TCP Window - Treck TCP/IP Stack (Jumbo Frames)