Optimizing latency in Xilinx FPGA Implementations of the GBT. Jean-Pierre CACHEMICHE

Similar documents
New slow-control FPGA IP for GBT based system and status update of the GBT-FPGA project

FELI. : the detector readout upgrade of the ATLAS experiment. Soo Ryu. Argonne National Laboratory, (on behalf of the FELIX group)

FELIX the new detector readout system for the ATLAS experiment

Development of a digital readout board for the ATLAS Tile Calorimeter upgrade demonstrator

MiniDAQ1 A COMPACT DATA ACQUISITION SYSTEM FOR GBT READOUT OVER 10G ETHERNET 22/05/2017 TIPP PAOLO DURANTE - MINIDAQ1 1

10GBase-R PCS/PMA Controller Core

Optimal Management of System Clock Networks

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

S2C K7 Prodigy Logic Module Series

Scrypt ASIC Prototyping Preliminary Design Document

FELIX: the New Detector Readout System for the ATLAS Experiment

The new detector readout system for the ATLAS experiment

40-Gbps and 100-Gbps Ethernet in Altera Devices

Field Programmable Gate Array (FPGA) Devices

Virtex-6 FPGA GTX Transceiver OTU1 Electrical Interface

Level-1 Data Driver Card of the ATLAS New Small Wheel Upgrade Compatible with the Phase II 1 MHz Readout

A flexible stand-alone testbench for facilitating system tests of the CMS Preshower

Data Acquisition in Particle Physics Experiments. Ing. Giuseppe De Robertis INFN Sez. Di Bari

Virtex-6 FPGA Embedded Tri-Mode Ethernet MAC Wrapper v1.4

Deployment of the CMS Tracker AMC as backend for the CMS pixel detector

Track-Finder Test Results and VME Backplane R&D. D.Acosta University of Florida

Multi-Gigabit Transceivers Getting Started with Xilinx s Rocket I/Os

Peter Alfke, Xilinx, Inc. Hot Chips 20, August Virtex-5 FXT A new FPGA Platform, plus a Look into the Future

LogiCORE IP Serial RapidIO v5.6

Prototyping NGC. First Light. PICNIC Array Image of ESO Messenger Front Page

Physics Procedia 00 (2011) Technology and Instrumentation in Particle Physics 2011

High speed fault tolerant secure communication for muon chamber using FPGA based GBTx emulator

CMX (Common Merger extension module) Y. Ermoline for CMX collaboration Preliminary Design Review, Stockholm, 29 June 2011

OTU2 I.9 FEC IP Core (IP-OTU2EFECI9) Data Sheet

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

100 Gbps/40 Gbps PCS/PMA + MAC IP Core

Copyright 2017 Xilinx.

Improving Packet Processing Performance of a Memory- Bounded Application

Status and planning of the CMX. Wojtek Fedorko for the MSU group TDAQ Week, CERN April 23-27, 2012

802.3bj FEC Overview and Status. 1x400G vs 4x100G FEC Implications DRAFT. IEEE P802.3bs 400 Gb/s Ethernet Task Force. Bill Wilkie Xilinx

LogiCORE IP RXAUI v2.4

OTU2 I.4 FEC IP Core (IP-OTU2EFECI4Z) Data Sheet

Hardware Design with VHDL PLDs IV ECE 443

TSEA44 - Design for FPGAs

SOFTWARE DEFINED RADIO

RT2016 Phase-I Trigger Readout Electronics Upgrade for the ATLAS Liquid-Argon Calorimeters

Virtex-5 GTP Aurora v2.8

Introduction to Field Programmable Gate Arrays

10Gb Ethernet PCS Core

Validating ADI Converters inter-operability with Xilinx FPGA and JESD204B IP

A generic firmware core to drive the Front-End GBT-SCAs for the LHCb upgrade

AN 610: Implementing Deterministic Latency for CPRI and OBSAI Protocols in Altera Devices

Selection of hardware platform for CBM Common Readout Interface

The ALICE TPC Readout Control Unit 10th Workshop on Electronics for LHC and future Experiments September 2004, BOSTON, USA

Terasic THDB-SUM SFP. HSMC Terasic SFP HSMC Board User Manual

DVI I/O FMC Module Hardware Guide

EMU FED. --- Crate and Electronics. ESR, CERN, November B. Bylsma, S. Durkin, Jason Gilmore, Jianhui Gu, T.Y. Ling. The Ohio State University

International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December ISSN

The Gigabit Link Interface Board (GLIB) specifications. v

Token Bit Manager for the CMS Pixel Readout

The Lekha 3GPP LTE FEC IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

SGMII Interface Implementation Using Soft-CDR Mode of Stratix III Devices

Timing distribution and Data Flow for the ATLAS Tile Calorimeter Phase II Upgrade

THE Large Hadron Collider (LHC) will undergo a series of. FELIX: the New Detector Interface for the ATLAS Experiment

A synchronous architecture for the L0 muon trigger

HCAL TPG and Readout

The Design and Testing of the Address in Real Time Data Driver Card for the Micromegas Detector of the ATLAS New Small Wheel Upgrade

Adapter Modules for FlexRIO

Programmable Logic. Any other approaches?

BES-III off-detector readout electronics for the GEM detector: an update

Interlaken IP datasheet

DHCAL Readout Back End

High-Tech-Marketing. Selecting an FPGA. By Paul Dillien

PCI Express 3.0 Characterization, Compliance, and Debug for Signal Integrity Engineers

Virtex-5 FPGA Embedded Tri-Mode Ethernet MAC Wrapper v1.7

The ALICE trigger system for LHC Run 3

ML505 ML506 ML501. Description. Description. Description. Features. Features. Features

The Intelligent FPGA Data Acquisition

The electron/photon and tau/hadron Cluster Processor for the ATLAS First-Level Trigger - a Flexible Test System

TFC update and TFC simulation testbench

2008 JINST 3 S Online System. Chapter System decomposition and architecture. 8.2 Data Acquisition System

Prototype Opto Chip Results

A Protocol for Realtime Switched Communication in FPGA Clusters

Zynq AP SoC Family

CMX Hardware Status. Chip Brock, Dan Edmunds, Philippe Yuri Ermoline, Duc Bao Wojciech UBC

The FTK to Level-2 Interface Card (FLIC)

I/O Choices for the ATLAS. Insertable B Layer (IBL) Abstract. Contact Person: A. Grillo

Level 0 trigger decision unit for the LHCb experiment

Virtex 6 FPGA Broadcast Connectivity Kit FAQ

SDI II Intel FPGA IP User Guide

Acquisition system for the CLIC Module.

Schematic. A: Overview of the Integrated Detector Readout Electronics and DAQ-System. optical Gbit link. 1GB DDR Ram.

Table 1: Example Implementation Statistics for Xilinx FPGAs

A Real Time Implementation of High Speed Data Transmission using Aurora Protocol on Multi-Gigabit Transceivers in Virtex-5 FPGA

Implementing Multipliers in Xilinx Virtex II FPGAs

QPairs QTE/QSE-DP Multi-connector Stack Designs In PCI Express Applications 16 mm Connector Stack Height REVISION DATE: OCTOBER 13, 2004

Streaming Encryption for a Secure Wavelength and Time Domain Hopped Optical Network

Radiation-Hard ASICS for Optical Data Transmission in the First Phase of the LHC Upgrade

TTC/TTS Tester (TTT) Module User Manual

RPC Trigger Overview

HCAL Trigger Readout

Zero Latency Multiplexing I/O for ASIC Emulation

MPC-SP02 Jitter Budget

Muon Port Card Upgrade Status May 2013

Transcription:

Optimizing latency in Xilinx FPGA Implementations of the GBT Steffen MUSCHTER Christian BOHM Sophie BARON Jean-Pierre CACHEMICHE Csaba SOOS (Stockholm University) (Stockholm University) (CERN) (CPPM) (CERN)

Overview Introduction GBT general information (reminder) GBT implementation in an FPGA GBT test design GBT optimization Test results Outlook 1 / 24

Introduction Motivation: GBT Protocol is designed for wide use within the entire LHC Focus of the Starterkit was on utilization Until now no studies about GBT FPGA latency have been available => But! Latency is also one of the important parameters especially for use in trigger What have we done: We analyzed the GBT protocol in terms of latency We optimized the GBT protocol for low latency with different setups We tested the applied changes extensively with a Virtex 6 and started tests on a Stratix II Gx FPGA 2 / 24

GBT- general information (reminder)* Downlink Timing and trigger GBT Timing and trigger DAQ Slow control DAQ Slow control Optical transceiver Timing and trigger FPGA Optical transceiver GBT DAQ Slow control Optical transceiver Timing and trigger GBT Uplink Counting room DAQ Slow control Detector area *Frederic Marin TWEPP 2009 3 / 24

GBT- general information (reminder)* SLHC frame: 120 bits @ 40 MHz 4.8 Gbps H SC 4 4 D FEC 80 32 User field: 82 bits 3.28 Gbps H: Header, 4 bits SC: Slow Control 4 bits GBT control 2 bits (80 Mb/s) Slow control 2 bits (80 Mb/s) D: Data (3.2 Gbps) FEC: Forward Error Correction (32 bits) Line efficiency: 68% *Frederic Marin TWEPP 2009 4 / 24

GBT implementation in a FPGA Data is transferred along a chain of distinct components Each component needs a certain number of clock cycles to operate properly Finding these numbers is essential for reducing the latency 5 / 24

GBT Test system - overview The Virtex 6 test system uses: The complete GBT protocol One pseudo random number generator Two spy memories One micro-blaze for controlling and monitoring 6 / 24

GBT Test system ML605 evaluation Board from Xilinx with Virtex 6 LX240T Speedgrade - 1 (for implementing the test design) ML507 evaluation Board from Xilinx (as clock source for the gigabit transceiver tx-pll) 7 / 24

GBT Test system hardware setup The Virtex 6 test system also uses: A gigabit transceiver wired to SMA connectors Two independent clock sources 8 / 24

GBT Testsystem - clock setup 125MHz clock from ML507 tx-pll MMCM (1)* internal logic 125MHz clock from ML605 rx-pll 250MHz recovered clock MMCM(2) GBT - rx *MMCM Mixed Mode Clock Manager 9 / 24

GBT Testsystem - performance Transmission rate used for testing is 5Gb/s due to clock availability (this corresponds to the maximum rate with speedgrade 1) High quality 125MHz clock is used to feed the tx-pll Jitter 108ps (p-p)* Output signal at 5Gb/s with jitter 60ps (p-p)* *measured with a LeCroy SDA 813Zi 10 / 24

GBT optimization First test in a Virtex 6 showed a latency of 21 clock cycles Evaluation necessary which components contribute most Estimate the latency by looking at the event statements and counters (not exact latency needed) 11 / 24

GBT optimization Results for estimation: Transceiver chain needs min. 9 clock cycles Receiver chain needs min 8 clock cycles Total latency min. 17 clock cycles without serializer and deserializer (GTX) Transceiver adds min. 1 clock cycle depending on the instantiation and the type of FPGA Does not exactly match the experimental results (due to not optimized GTX configuration) Mux 120 to 40 bits and Demux 40 to 120 bits identified as largest sources of latency 12 / 24

GBT optimization Mux 120 to 40 bits and Demux 40 to 120 bits Implemented using dual port RAM modules 13 / 24

GBT optimization Mux 120 to 40 bits and Demux 40 to 120 bits Counters optimized Latency is now 2 clock cycles for each module Pattern search Data is not directly processed Clocking could be bypassed Decoder Does not use a clock, like the Encoder Scrambler and Descrambler Code was optimized, which allowed us to remove one event statement 14 / 24

GBT optimization Resulting latency approximation Transmitter chain 3 clock cycles Receiver chain 4 clock cycles 15 / 24

GBT optimization The utilization in percent refers to a design using V6LX240T with 301440 Slice Registers 150720 Slice LUTs Not optimized GBT-tx Links 1 4 8 12 Slice Registers 260 (0%) 1016 (0%) 2024 (0%) 3032 (1%) Slice LUTs 350 (0%) 1363 (0%) 2460 (1%) 3622 (2%) Optimized GBT-tx Links 1 4 8 12 Slice Registers 93 (0%) 345 (0%) 681 (0%) 1017 (1%) Slice LUTs 337 (0%) 1363 (0%) 2460 (1%) 3571 (2%) 16 / 24

GBT optimization Not optimized GBT-rx Links 1 4 8 12 Slice Registers 414 (0%) 1650 (0%) 3298 (1%) 4946 (1%) Slice LUTs 2147 (1%) 8409 (5%) 15149 (10%) 24000 (15%) Optimized GBT-rx Links 1 4 8 12 Slice Registers 326 (0%) 1298 (0%) 2594 (0%) 3890 (1%) Slice LUTs 2366 (1%) 9097 (6%) 17747 (11%) 26249 (17%) Optimization uses much fewer Registers but slightly more LUTs 17 / 24

GBT optimization Results so far Latency of test design 8 clock cycles including the GTX (protocol 7 clock cycles) Design is running at 5 Gb/s instead of 4.8 Gb/s due to clocking of the ML605 development board Optimized code does not increase the utilization when instantiated No single error during two days of operation => Bit error rate better than 10-15 Next step Try to improve the test design latency below 8 clock cycles => only possible by removing the dual port RAM modules 18 / 24

GBT optimization Using registers instead of dual port RAM => More utilization but less Latency 19 / 24

GBT optimization The use of registers depends on the performance of the FPGA (e.g. Virtex6 LX240T Speedgrade -1 fmax 300MHz) Utilization should not increase because of previously made optimizations 20 / 24

GBT optimization Utilization studies for optimization using registers between 120MHz and 40MHz clock domains Not optimized GBT-tx Links 1 4 8 12 Slice Registers 260 (0%) 1016 (0%) 2024 (0%) 3032 (1%) Slice LUTs 350 (0%) 1363 (0%) 2460 (1%) 3622 (2%) Optimized GBT-tx Links 1 4 8 12 Slice Registers 246 (0%) 974 (0%) 1946 (0%) 2918 (0%) Slice LUTs 381 (0%) 1505 (0%) 2771 (1%) 3859 (2%) 21 / 24

GBT optimization Not optimized GBT-rx Links 1 4 8 12 Slice Registers 414 (0%) 1650 (0%) 3298 (1%) 4946 (1%) Slice LUTs 2147 (1%) 8409 (5%) 15149 (10%) 24000 (15%) Optimized GBT-tx Links 1 4 8 12 Slice Registers 519 (0%) 2070 (0%) 4138 (1%) 6194 (2%) Slice LUTs 1849 (1%) 7396 (4%) 14949 (9%) 22413 (14%) Optimization was successfully tested at 5Gb/s for 2 days without a single error => Bit error rate better than 10-15 22 / 24

Results Final results: The startup test design latency was 21 cycles including GTX The optimized test design with RAM modules had a latency of 8 cycles (7 cycles only for the protocol) The optimized test design with registers had a latency of 6 cycles (5 cycles only for the protocol) The optimized code does not increase the utilization when instantiated All optimizations were tested carefully with a Virtex 6 No single error during two days of operation with both optimizations => Bit error rate better than 10-15 for both optimizations 23 / 24

Outlook Optimization will be available from the GBT-FPGA team soon Three different instantiations planned Optimized code with 7 clock cycles Using dual port memories Using only registers (if one needs to spare resources) Optimized code using only registers with 5 clock cycles Test of optimization on Stratix FPGAs and Virtex 5 First test already started and seem to be promising Optimized code ready for distribution until the end of the year Could be accessed from the GBT FPGA website https://espace.cern.ch/gbt-project/gbt-fpga/default.aspx 24 / 24

Thank you Steffen MUSCHTER Christian BOHM Sophie BARON Jean-Pierre CACHEMICHE Csaba SOOS (Stockholm University) (Stockholm University) (CERN) (CPPM) (CERN)

GBT Testsystem - clock setup (2) Add jitter to the internal logic to test the design for stability 125MHz clock from ML507 tx-pll GPIO (SMA) User clock (SMA) MMCM (1)* internal logic 125MHz clock from ML605 rx-pll *MMCM Mixed Mode Clock Manager

GBT Testsystem performance (2) Clock for internal logic has increased jitter due to loop back through normal GPIOs Jitter 387ps (p-p) Both clock setups were tested with the high quality and this clock signal to test the stability of the protocol *measured with a LeCroy SDA 813Zi

GBT optimization RTL schematic of the Scrambler before and after optimization before after