Hardware Acceleration for Measurements in 100 Gb/s Networks

Similar documents
Real-Time and Resilient Intrusion Detection: A Flow-Based Approach

Detecting Anomalies in Netflow Record Time Series by Using a Kernel Function

Fault-Tolerant Storage Servers for the Databases of Redundant Web Servers in a Computing Grid

Change Detection System for the Maintenance of Automated Testing

Tacked Link List - An Improved Linked List for Advance Resource Reservation

A Practical Evaluation Method of Network Traffic Load for Capacity Planning

Multimedia CTI Services for Telecommunication Systems

BoxPlot++ Zeina Azmeh, Fady Hamoui, Marianne Huchard. To cite this version: HAL Id: lirmm

Traffic Grooming in Bidirectional WDM Ring Networks

FIT IoT-LAB: The Largest IoT Open Experimental Testbed

Experimental Evaluation of an IEC Station Bus Communication Reliability

The New Territory of Lightweight Security in a Cloud Computing Environment

Service Reconfiguration in the DANAH Assistive System

Setup of epiphytic assistance systems with SEPIA

Quality of Service Enhancement by Using an Integer Bloom Filter Based Data Deduplication Mechanism in the Cloud Storage Environment

Simulations of VANET Scenarios with OPNET and SUMO

Mokka, main guidelines and future

Very Tight Coupling between LTE and WiFi: a Practical Analysis

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

Relabeling nodes according to the structure of the graph

CTF3 BPM acquisition system

Linux: Understanding Process-Level Power Consumption

How to simulate a volume-controlled flooding with mathematical morphology operators?

Robust IP and UDP-lite header recovery for packetized multimedia transmission

Scalewelis: a Scalable Query-based Faceted Search System on Top of SPARQL Endpoints

Blind Browsing on Hand-Held Devices: Touching the Web... to Understand it Better

SIM-Mee - Mobilizing your social network

Towards Bandwidth Estimation Using Flow-Level Measurements

Malware models for network and service management

Every 3-connected, essentially 11-connected line graph is hamiltonian

QuickRanking: Fast Algorithm For Sorting And Ranking Data

Taking Benefit from the User Density in Large Cities for Delivering SMS

Comparator: A Tool for Quantifying Behavioural Compatibility

An FCA Framework for Knowledge Discovery in SPARQL Query Answers

DANCer: Dynamic Attributed Network with Community Structure Generator

UsiXML Extension for Awareness Support

HySCaS: Hybrid Stereoscopic Calibration Software

Study on Feebly Open Set with Respect to an Ideal Topological Spaces

Assisted Policy Management for SPARQL Endpoints Access Control

Comparison of spatial indexes

Future of DDoS Attacks Mitigation in Software Defined Networks

YAM++ : A multi-strategy based approach for Ontology matching task

Zigbee Wireless Sensor Network Nodes Deployment Strategy for Digital Agricultural Data Acquisition

A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications

Open Digital Forms. Hiep Le, Thomas Rebele, Fabian Suchanek. HAL Id: hal

Search-Based Testing for Embedded Telecom Software with Complex Input Structures

YANG-Based Configuration Modeling - The SecSIP IPS Case Study

Generic Design Space Exploration for Reconfigurable Architectures

Managing Risks at Runtime in VoIP Networks and Services

KeyGlasses : Semi-transparent keys to optimize text input on virtual keyboard

Linked data from your pocket: The Android RDFContentProvider

IMPLEMENTATION OF MOTION ESTIMATION BASED ON HETEROGENEOUS PARALLEL COMPUTING SYSTEM WITH OPENC

BugMaps-Granger: A Tool for Causality Analysis between Source Code Metrics and Bugs

Teaching Encapsulation and Modularity in Object-Oriented Languages with Access Graphs

MARTE based design approach for targeting Reconfigurable Architectures

A Methodology for Improving Software Design Lifecycle in Embedded Control Systems

An Experimental Assessment of the 2D Visibility Complex

Structuring the First Steps of Requirements Elicitation

Kernel perfect and critical kernel imperfect digraphs structure

A 64-Kbytes ITTAGE indirect branch predictor

A Voronoi-Based Hybrid Meshing Method

Extended interface ID for virtual link selection in GeoNetworking to IPv6 Adaptation Sub-layer (GN6ASL)

Computing and maximizing the exact reliability of wireless backhaul networks

Stream Ciphers: A Practical Solution for Efficient Homomorphic-Ciphertext Compression

Deformetrica: a software for statistical analysis of anatomical shapes

Application of RMAN Backup Technology in the Agricultural Products Wholesale Market System

Design Methodology of Configurable High Performance Packet Parser for FPGA

Scan chain encryption in Test Standards

X-Kaapi C programming interface

Real-Time Collision Detection for Dynamic Virtual Environments

Development of an ATCA IPMI controller mezzanine board to be used in the ATCA developments for the ATLAS Liquid Argon upgrade

From Microsoft Word 2003 to Microsoft Word 2007: Design Heuristics, Design Flaws and Lessons Learnt

ASAP.V2 and ASAP.V3: Sequential optimization of an Algorithm Selector and a Scheduler

Moveability and Collision Analysis for Fully-Parallel Manipulators

Modularity for Java and How OSGi Can Help

Application of Artificial Neural Network to Predict Static Loads on an Aircraft Rib

Regularization parameter estimation for non-negative hyperspectral image deconvolution:supplementary material

Modelling and simulation of a SFN based PLC network

Efficient implementation of interval matrix multiplication

Framework for Hierarchical and Distributed Smart Grid Management

Catalogue of architectural patterns characterized by constraint components, Version 1.0

A Measurement-Based Model of Energy Consumption for PLC Modems

The Proportional Colouring Problem: Optimizing Buffers in Radio Mesh Networks

Growing Neural Gas A Parallel Approach

MUTE: A Peer-to-Peer Web-based Real-time Collaborative Editor

Reverse-engineering of UML 2.0 Sequence Diagrams from Execution Traces

Using a Medical Thesaurus to Predict Query Difficulty

Natural Language Based User Interface for On-Demand Service Composition

Technical Overview of F-Interop

An Efficient Numerical Inverse Scattering Algorithm for Generalized Zakharov-Shabat Equations with Two Potential Functions

The SANTE Tool: Value Analysis, Program Slicing and Test Generation for C Program Debugging

LaHC at CLEF 2015 SBS Lab

THE COVERING OF ANCHORED RECTANGLES UP TO FIVE POINTS

NP versus PSPACE. Frank Vega. To cite this version: HAL Id: hal

SDLS: a Matlab package for solving conic least-squares problems

A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer

QAKiS: an Open Domain QA System based on Relational Patterns

Hierarchical Multi-Views Software Architecture

Design and Evaluation of Tile Selection Algorithms for Tiled HTTP Adaptive Streaming

Comparison of radiosity and ray-tracing methods for coupled rooms

Transcription:

Hardware Acceleration for Measurements in 100 Gb/s Networks Viktor Puš To cite this version: Viktor Puš. Hardware Acceleration for Measurements in 100 Gb/s Networks. Ramin Sadre; Jiří Novotný; Pavel Čeleda; Martin Waldburger; Burkhard Stiller. 6th International Conference on Autonomous Infrastructure (AIMS), Jun 2012, Luxembourg, Luxembourg. Springer, Lecture Notes in Computer Science, LNCS-7279, pp.46-49, 2012, Dependable Networks and Services.. HAL Id: hal-01529784 https://hal.inria.fr/hal-01529784 Submitted on 31 May 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Distributed under a Creative Commons Attribution 4.0 International License

Hardware Acceleration for Measurements in 100 Gb/s Networks Viktor Puš CESNET, association of legal entities, Zikova 4, 160 00 Prague, Czech Republic pus@cesnet.cz Abstract. The paper presents the NetCOPE experimental platform for evaluation of 100 Gb/s networks. The platform consists of acceleration boards with FPGAs, FPGA firmware and control software. While the NetCOPE for 10 Gb/s networks is a mature platform, multiplying the throughput ten times brings major challenges in the platform design. Keywords: FPGA, 100 Gb/s, hardware acceleration, Ethernet, PCI Express 1 Introduction With the growing amount of Internet traffic, it is inevitable that current 10 Gb/s lines, often used in data centers and backbones, will be eventually replaced by faster technologies. Standard for 100 Gb/s Ethernet was proposed as IEEE standard 802.3ba in 2008 and ratified in June 2010. While it is still waiting for wider deployment, preparing and evaluating the tools and algorithms for this technology is a major research topic. While theoretical works in this area certainly have significant impact, we argue that experimental research will also bring many valuable results. We aim to create and evaluate general measurement platform for 100 Gb/s networks, which can not be examined by means of formal description and analysis. Our work falls into the category of experimental computer science [4]. We expect the first application to employ a limited set of operations, but the platform allows to create wide selection of complex applications. The NetCOPE platform is a team work of several researchers, author s contribution is the overall design of the platform firmware and the design of several firmware modules, such as the packet header parser HFE-M and the packet filter with TCAM. The following chapters introduce the hardware, firmware and software design of the experimental platform together with several selected challenges in the 100 Gb/s application design. 2 Platform Design Our experimental platform employs Virtex-6 FPGA [2] to receive, process and transmit the 100 Gb/s traffic. FPGA is placed on the COMBO family acceleration boards [1]. Daughter card communicates with the mother card via several

high-speed channels and is equipped with the 100 Gb/s CXP connector. The mother card is connected to the host PC via the PCI-Express Gen2 x8 slot. The Linux operating system running on the host PC is supplemented by the device drivers to support configuration accesses as well as high-speed DMA transfers. Structure of the platform is shown in Fig. 1. Host PC User space Kernel space Tools Config SZE Libraries Drivers PCAP PCI-Express Acceleration card Mother card SRAM DRAM Virtex-6 Daughter card CXP 100 Gb/s Ethernet Fig. 1. Structure of the experimental platform. This arrangement requires smart partitioning of the platform functions between firmware and software. Fig. 2 shows the basic arrangement of firmware modules in the FPGA. Mother card is responsible for packet reception, basic processing (compute frame checksum, check MTU, etc.) and eventually packet transmission. Valid packets are passed to the following modules, where additional processing, such as packet parsing and filtering is done. Filtered packets are then sent to the host RAM by the DMA controller via the PCI Express interface. Additional modules can be connected as plug-ins into the system. The software part of the platform can perform various functions such as routing, traffic storage, monitoring, etc. 3 Challenges While the platform design is straightforward and logical, there are several design considerations that must be engineered carefully. This chapter presents some of the most interesting research topics. 3.1 On-chip Communication The100Gb/sEthernetiscapableofpackettransmissionattherateof150million packets per second (6.6 ns/packet) for the shortest 64 B packets. Due to the minimal 20 B inter-frame gap, more data is effectively transmitted with long packets. The example of data bus capable of transmitting the full 100 Gb/s data stream is 512bits wide and runs at 196MHz.

PCI-Express Mother card Virtex-6 PCI-Express bridge DMA controller Header parser Packet filter Input buffer Output buffer Daughter card CXP 100 Gb/s Ethernet Fig. 2. Firmware structure. It is important to use the bus bandwidth effectively. For example, if the first byte of the packet is always aligned to the lowest byte of the data bus, then 65B packet occupies two clock cycles on the bus, limiting its throughput to 50.78%. However, if the packet is allowed to start at eight different positions of the data word (every eight bytes), then the second data word of 65B packet can already carry 56 bytes of the next packet and only seven bytes are unused. Protocol effectivity is 90.27% in the worst case. 3.2 Frame Checksum The Ethernet requires each frame to be followed by the 32-bit CRC frame checksum (FCS) to detect transmission errors. FCS is checked when frame is received and appended when frame is sent. FCS computing is complicated by the fact that the packet may start at different positions of the data word and can also end at any position in the data word. Adding leading or trailing zeros is not acceptable, as it would affect the FCS value. We resolve the unaligned frame starts by using carefully selected CRC initial vector. Our solution for the unaligned frame ends is similar to [5]: unaligned last word of the frame is computed in the tree hierarchy of smaller CRC modules. 3.3 Header Parsing High-speed header parsing is another interesting research topic. Key issue is that the protocol fields appear at various offsets in the packet. This is due to shim protocols (VLAN, MPLS) and variable header lengths (IPv4/IPv6 option headers). Our Modular Header Field Extractor (HFE-M) employs pipelined architecture to determine protocol headers offset. Each protocol parsing module has the protocol start offset as the input and the next protocol type and start offset as

the output. These messages are passed among protocol parsing modules which are connected in the expected order of protocol headers. Once the types and offsets of present protocols are known, extracting the fields of interest from the packet is relatively simply done by multiplexers. 3.4 Packet Filtering The PCI-Express bus supported by the card has the theoretical throughput 32 Gb/s. Therefore most applications require hardware preprocessing of incoming packets. One suitable operation may be packet shortening, which is useful for applications processing only packet headers. Other possibility is to filter out uninteresting traffic and pass only the selected packets into the host RAM. Our first version of packet filtering engine uses the TCAM to classify packets. Our implementation of TCAM in FPGA is inspired by [3]. It uses Look-Up Tables to match the input to the stored value. 6-input LUT matches 6 bits in 1 cycle or 10 bits in two different TCAM rows in two cycles if one bit is used to select the row. For example, TCAM storing 128 IPv6 addresses occupies 2816 LUTs if new matching must be done in each clock cycle, or 1664 LUTs if the matching may take two clock cycles. 4 Conclusion Our experimental work is targeted at post 10 Gb/s network traffic processing. We aim to create universal extensible platform for experiments with high-speed network algorithms and technologies. The platform is currently being intensively developed, using the experience from our previous 1 Gb/s and 10 Gb/s experimental platforms. Acknowledgments This work is supported by the CESNET Large Infrastructure project no. LM2010005 funded by the Ministry of Education, Youth and Sports of the Czech Republic. References 1. COMBO-LXT. CESNET, http://www.liberouter.org/card combolxt.php. 2. Xilinx Virtex 6 FPGA Family. Xilinx, Inc., http://www.xilinx.com/products/silicon-devices/fpga/virtex-6/index.htm. 3. Jean-Louis Brelet. An overview of multiple cam designs in virtex family devices. Xilinx Applixation Note 201, Xilinx Inc., 1999. 4. National Research Council Committee on Academic Careers for Experimental Computer Scientists. Academic Careers for Experimental Computer Scientists and Engineers. The National Academies Press, 1994. 5. M. Walma. Pipelined cyclic redundancy check (crc) calculation. In Computer Communications and Networks, 2007. ICCCN 2007. Proceedings of 16th International Conference on, pages 365 370, aug. 2007.