IBM Network Processor, Development Environment and LHCb Software

Similar documents
Ethernet Networks for the ATLAS Data Collection System: Emulation and Testing

LHCb Online System BEAUTY-2002

ROB IN Performance Measurements

IBM POWER NETWORK PROCESSOR ARCHITECTURE

Work Project Report: Benchmark for 100 Gbps Ethernet network analysis

Commercial Network Processors

A Fast Ethernet Tester Using FPGAs and Handel-C

Update on PRad GEMs, Readout Electronics & DAQ

Tile Processor (TILEPro64)

PXA270 EPIC Computer with Power Over Ethernet & Six Serial Protocols SBC4670

2008 JINST 3 S Online System. Chapter System decomposition and architecture. 8.2 Data Acquisition System

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

IBM PowerNP network processor: Hardware, software, and applications

InfiniBand SDR, DDR, and QDR Technology Guide

MiniDAQ1 A COMPACT DATA ACQUISITION SYSTEM FOR GBT READOUT OVER 10G ETHERNET 22/05/2017 TIPP PAOLO DURANTE - MINIDAQ1 1

Design of a Gigabit Distributed Data Multiplexer and Recorder System

Vertex Detector Electronics: ODE to ECS Interface

Overview. Performance metrics - Section 1.5 Direct link networks Hardware building blocks - Section 2.1 Encoding - Section 2.2 Framing - Section 2.

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

The CMS Event Builder

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Performance Characteristics on Fast Ethernet and Gigabit networks

Topic & Scope. Content: The course gives

Velo readout board RB3. Common L1 board (ROB)

Functional Specification

Systems Design and Programming. Instructor: Chintan Patel

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

Read-out of High Speed S-LINK Data Via a Buffered PCI Card

INT-1010 TCP Offload Engine

Dominique Gigi CMS/DAQ. Siena 4th October 2006

Performance Characteristics on Fast Ethernet, Gigabit and 10 Gigabits networks

Motivation CPUs can not keep pace with network

EMU FED. --- Crate and Electronics. ESR, CERN, November B. Bylsma, S. Durkin, Jason Gilmore, Jianhui Gu, T.Y. Ling. The Ohio State University

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

AVR XMEGA TM. A New Reference for 8/16-bit Microcontrollers. Ingar Fredriksen AVR Product Marketing Director

PC87435 Enhanced IPMI Baseboard Management Controller

PE310G4TSF4I71 Quad Port SFP+ 10 Gigabit Ethernet PCI Express Time Stamp Server Adapter Intel Based

P51: High Performance Networking

Performance Evaluation of Myrinet-based Network Router

DS-251 In-Circuit Emulator

Chapter Seven Morgan Kaufmann Publishers

10Gbps TCP/IP streams from the FPGA

Anand Raghunathan

Module 10 MULTIMEDIA SYNCHRONIZATION

Micetek International Inc. Professional Supplier for PowerPC Development Tools

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Network Design Considerations for Grid Computing

Micetek International Inc. Professional Supplier for PowerPC Development Tools

C6100 Ruggedized PowerPC VME SBC

ECE 3055: Final Exam

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

D Demonstration of disturbance recording functions for PQ monitoring

The FTK to Level-2 Interface Card (FLIC)

IBM POWER8 100 GigE Adapter Best Practices

PowerPC on NetFPGA CSE 237B. Erik Rubow

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

LatticeSCM SPI4.2 Interoperability with PMC-Sierra PM3388

XMOS Technology Whitepaper

Media Access Control (MAC) Sub-layer and Ethernet

Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview Use Cases Architecture Features Copyright Jaluna SA. All rights reserved

HCAL DCC Technical Reference E. Hazen - Revised March 27, 2007 Note: Latest version of this document should be available at:

Modeling Resource Utilization of a Large Data Acquisition System

Software Design Challenges for heterogenic SOC's

A generic firmware core to drive the Front-End GBT-SCAs for the LHCb upgrade

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Stefan Koestner on behalf of the LHCb Online Group ( IEEE - Nuclear Science Symposium San Diego, Oct.

The Nios II Family of Configurable Soft-core Processors

High-Speed Network Processors. EZchip Presentation - 1

System-on-a-Programmable-Chip (SOPC) Development Board

Nios Soft Core Embedded Processor

SGI Challenge Overview

Ettus Research Update

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Workload Characterization and Performance for a Network Processor

IBM PowerPRS TM. A Scalable Switch Fabric to Multi- Terabit: Architecture and Challenges. July, Speaker: François Le Maut. IBM Microelectronics

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

Intel Galileo gen 2 Board

ANIC Host CPU Offload Features Overview An Overview of Features and Functions Available with ANIC Adapters

Network Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN

Hello and welcome to this Renesas Interactive module that provides an architectural overview of the RX Core.

6.9. Communicating to the Outside World: Cluster Networking

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

Lecture 7: Implementing Cache Coherence. Topics: implementation details

PowerPC 740 and 750

FELIX the new detector readout system for the ATLAS experiment

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies

Prime News. Product News. Anouncement

Product Information Sheet PX Channel, 14-Bit Waveform Digitizer

An Intelligent NIC Design Xin Song

Migrating RC3233x Software to the RC32434/5 Device

User Manual. 1000Base-T1_SPY_mini. 1000Base-T1 Standard Ethernet. Version 1.1 August Technica Engineering GmbH Leopoldstr.

INT 1011 TCP Offload Engine (Full Offload)

DESIGN AND IMPLEMENTATION OF AN AVIONICS FULL DUPLEX ETHERNET (A664) DATA ACQUISITION SYSTEM

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

EB-51 Low-Cost Emulator

ROM Status Update. U. Marconi, INFN Bologna

Transcription:

IBM Network Processor, Development Environment and LHCb Software LHCb Readout Unit Internal Review July 24 th 2001 Niko Neufeld, CERN 1

Outline IBM NP4GS3 Architecture A Readout Unit based on the NP4GS3 Dataflow in the NP based RU Sub-event building algorithms Software Development Environment Performance of Sub-event building Calibration of simulation results with measurements on Reference Kit Hardware 2

IBM NP4GS3 4 full duplex Gigabit Ethernet MACs 16 Processors / 2 Hardware threads @ 133 MHz 2128 MIPS to handle up to 4.1x10 6 packets/s 128 kb on-chip input buffer, up to 128 MB DDR RAM output buffer 2 x Switch Interface ( DASL ) @ 4 Gb/s Embedded PPC 405 In production since beginning of this year (R1.1) 3

Readout Unit based on NP4GS3 1 or 2 Mezzanine Cards containing each 1 Network Processor All memory needed for the NP Connections to the external world PCI-bus DASL (switch bus) Connections to physical network layer JTAG, Power and clock PHY-connectors L0 Trigger-Throttle output Power and Clock generation LHCb standard ECS interface (CC-PC) with separate Ethernet connection Board Block Diagram 4

Wrap for single NP only Data flow in the NP4GS3 DASL DASL Access to frame data Ingress Event Building Access to frame data Egress Event Building 5

Sub-event merging software Sub-event building is the main task of the software running on the NP4GS3 when used in a Readout Unit Two locations for frame manipulation (ingress & egress) insinuate two different algorithms with different advantages and possible fields of application Ingress event building for high frequency up to 1.5 MHz, small frames ( ~ 64 bytes) Egress event building for large frames (up to 9000 bytes), or fragments spanning multiple frames 6

Ingress Event Building On chip memory (no wait cycles!) Memory buffers organized via descriptors Only 128 kb of memory Multiple frames more difficult, because they have to be sent in order over the DASL Code is more streamlined, because copying is done on linear memory 7

Egress Event Building 64 MB of external buffer space (Access to memory over 128 bit wide bus). Two copies of 64 MB each, for increased throughput. Only contested resource is the memory bus. Multiple frames are handled easily. For larger fragments only part of the data (ideally 1/8) need to be read. Memory is external (wait cycles!) Buffer data structure is awkward (chunks of 2 x 58 bytes) 8

Software Development Environment Development software consists of: Assembler, Debugger, Simulator and Profiler (the debugger can either be run on the simulator or connect to real hardware via a Ethernet to JTAG interface ( RISC Watch ) attached to the NP4GS3) Rich set of documentation Many examples, such as complete routing software, are available 9

User interface of Simulator and Remote Debugger Thread control Code view with break point and single stepping Coprocessor Registers View of memory buffers Simulator or Real Hardware Debugging as well as processor version 10

Performance for 4:1 Egress Event-Building 350 Maximum Allowed Fragment Rate [khz] 300 250 200 150 100 50 Limit imposed by NP Processing Power Limit imposed by single Gb Output Link Range of possible Level-1 trigger rates Only 16 out of 32 threads simulated hence blue line is highly pessimistic Should increase by at least 50% for 32 threads At any frame size the performance of the RU is limited only by the output link speed of 125 MB/s 0 0 100 200 300 400 500 600 Average Fragment Size [Bytes] 11

Performance for 3:1 Ingress Event- Building Maximum Acceptable L0 Trigger Rate [MHz] 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 110 120 Average Input Fragment Payload [Bytes] Limit imposed by NP Perfomance Limit imposed by single Gb Output Link Max. Level-0 Trigger Rate Only 16 out of 32 threads simulated Overall performance limited by single output link speed only NP performance sufficient for any fragment size 12

Detailed Profiler Information Cycles per Dispatch Thread 10 Stall Cycles Thread 10 2500 7000 Cycles (7.5 ns) 2000 1500 1000 500 gpr bus inst data Cop Exec 6000 5000 4000 3000 2000 Cycles 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Thread 10 dispatch 1000 0 Exec Cop Data Inst Bus GPR COP Stalls Thread 10 3000 2500 2000 1500 1000 500 0 CLP DS TSE0 TSE1 CAB ENQ CHK STR POL CNTR SEM Cycles Details of stall ( enforced NOP ) cycles They are assigned to the coprocessor, which controls access to the resource in question (and is causing the stall in that case) 13

Reference Kit Hardware Our reference kit has 4 x 2 port Gigabit Ethernet I/O cards 14

Test Set-up Tigon2 NIC Features Gigabit Ethernet Network Interface Card (optical/copper) Fully programmable Baseline for final stage event building ( Smart NICs ) Up to 620 khz fragment rate 1 µs resolution timer 4 x Tigon2 NIC 1000SX 4 x 1000SX Full Duplex Ports IBM NP4GS3R1.1 Reference Kit RISC Watch = JTAG via Ethernet PPC 405 Measurement Procedure Download code into NP4GS3 via RISC Watch (JTAG) Send special frame to NP4GS3 to trigger synchronization frame being sent to Tigons Tigons start sending fragments. They add their internal time-stamp to each frame (1 µs resolution). NP4GS3 processes and adds its internal timestamp (1 ms resolution). Tigons receive frames and calculate elapsed time 15

Calibrating the simulation by measurements Event building with a single thread enabled and 4 sources Event building with a single source and multiple threads (16) enabled With release 1.1 version of the NP4GS3 can unfortunately not run easily multiple sources on multiple threads (no semaphore coprocessor) 16

Results from Measurements (Ingress Event-Building) Round-trip time (Tigon-out Tigon-in: 1.7 ms per fragment (frame of 60 Bytes) Simulation says that 600 ns/fragment are used in the NP4GS3. This is not really measurable with our setup. 1 source 1 thread 4 sources 1 thread Measurement [µs/fragment] 6.6(4.9) 4.5(2.8) Simulation [µs/fragment] 4.9 3.2 NOTE: Since the system is pipelined, times smaller than the intrinsic overhead of the Tigon, (i.e. 1.7 ms), cannot be measured accurately! 1 source 16 threads 1.7 (0.0) 0.5 We can trust the timing results obtained with the simulator. 17

Conclusion The software development environment makes the full power of this complex chip available to the software developer Two sub-event merging codes for large fragments @ rates of a few 100 khz and small fragments @ rates of 1 MHz have been developed and benchmarked using simulation. The results for the more demanding ( ingress ) of the two has been verified using the IBM NP4GS3 reference kit hardware The simulation results show that the performance requirements on a readout unit are fully met an NP4GS3-based implementation. 18

Future Work With the upgrade of the NP4GS3 reference kit to version 2.0 of the processor verify (again) code for egress and ingress event building With more Tigon NICs and faster fragment generation code, test also e.g. 7 to 1 multiplexing. Develop and test layer 2 switching application (much simpler than event building due to static routing tables) Develop code for the communication between Linux operated embedded PPC 405 and the CC-PC 19

NP4GS3 Architecture 128 kb on chip Up 64 MB 128 bit wide 4 Gb/s Accessible from PCI 20

EPC Block-Diagram IBM NP4GS3 Embedded Processor Complex (EPC) Block-Diagram 21

Performance for 4:1 Egress Event-Building Fragment Rate [MHz] normalized to 16 threads 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Performance Dependance on Fragment Size max. Wire Speed at Output Average Payload 100 Byte 200 Bytes 500 Bytes Data Rate 100 Byte Data Rate 500 Bytes Data Rate 100 Bytes 0 5 10 15 20 Number of Active Threads 800 700 600 500 400 300 200 100 0 Output Bandwith [MB/s] Throughput as a function of the number of active threads and the payload. The throughput is normalised to the one achieved with 16 threads for comparability Contention for resources such as locks and memory prevents linear scaling Limit of Gigabit Ethernet 125 Mbyte/s 22

Performance for 4:1 Ingress Event- Building Fragment rate and output bandwidth as a function of active threads for various amounts of payload. (A 36 byte transport header is always included) Fragment Rate [MHz] normalized to 16 threads 8 7 6 5 4 3 2 1 0 Performance Dependance on Fragment Size 0 5 10 15 20 Number of Active Threads Average Payload 1 Byte 30 Bytes 100 Bytes Data Rate 1 Byte Data Rate 30 Bytes Data Rate 100 Bytes max. Wire Speed at Output 450 400 350 300 250 200 150 100 50 0 Output Bandwidth [MB/sec] 64 bytes = 1.9 MHz For all practical working points beyond wire-speed 23