Improving Packet Processing Performance of a Memory- Bounded Application

Similar documents
FELI. : the detector readout upgrade of the ATLAS experiment. Soo Ryu. Argonne National Laboratory, (on behalf of the FELIX group)

The new detector readout system for the ATLAS experiment

FELIX the new detector readout system for the ATLAS experiment

A new approach to front-end electronics interfacing in the ATLAS experiment

FELIX: the New Detector Readout System for the ATLAS Experiment

Modeling Resource Utilization of a Large Data Acquisition System

10GE network tests with UDP. Janusz Szuba European XFEL

Modeling and Validating Time, Buffering, and Utilization of a Large-Scale, Real-Time Data Acquisition System

Trigger and Data Acquisition at the Large Hadron Collider

Data Acquisition in Particle Physics Experiments. Ing. Giuseppe De Robertis INFN Sez. Di Bari

Level-1 Data Driver Card of the ATLAS New Small Wheel Upgrade Compatible with the Phase II 1 MHz Readout

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

2008 JINST 3 S Online System. Chapter System decomposition and architecture. 8.2 Data Acquisition System

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

ROM Status Update. U. Marconi, INFN Bologna

RT2016 Phase-I Trigger Readout Electronics Upgrade for the ATLAS Liquid-Argon Calorimeters

Update on PRad GEMs, Readout Electronics & DAQ

High-Throughput and Low-Latency Network Communication with NetIO

SNAP Performance Benchmark and Profiling. April 2014

Electronics, Trigger and Data Acquisition part 3

NAMD Performance Benchmark and Profiling. January 2015

A generic firmware core to drive the Front-End GBT-SCAs for the LHCb upgrade

ARISTA: Improving Application Performance While Reducing Complexity

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

CPMD Performance Benchmark and Profiling. February 2014

MILC Performance Benchmark and Profiling. April 2013

The CMS Computing Model

XPU A Programmable FPGA Accelerator for Diverse Workloads

arxiv: v1 [physics.ins-det] 16 Oct 2017

The Power of Batching in the Click Modular Router

N V M e o v e r F a b r i c s -

LS-DYNA Performance Benchmark and Profiling. October 2017

The ATLAS Data Acquisition System: from Run 1 to Run 2

Benchmarking message queue libraries and network technologies to transport large data volume in

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

OpenFOAM Performance Testing and Profiling. October 2017

Xilinx Answer QDMA Performance Report

LS-DYNA Performance Benchmark and Profiling. April 2015

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

OCTOPUS Performance Benchmark and Profiling. June 2015

Development and test of a versatile DAQ system based on the ATCA standard

Work Project Report: Benchmark for 100 Gbps Ethernet network analysis

New slow-control FPGA IP for GBT based system and status update of the GBT-FPGA project

Advanced Computer Networks. End Host Optimization

AMD EPYC Processors Showcase High Performance for Network Function Virtualization (NFV)

FPGA Implementation of RDMA-Based Data Acquisition System Over 100 GbE

Advances of parallel computing. Kirill Bogachev May 2016

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

LS-DYNA Performance Benchmark and Profiling. October 2017

The CMS Event Builder

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Advanced Parallel Programming I

Entry Stage for the CBM First-level Event Selector

CMS High Level Trigger Timing Measurements

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Control Center 15 Performance Reference Guide

L1 and Subsequent Triggers

InfiniBand SDR, DDR, and QDR Technology Guide

PacketShader: A GPU-Accelerated Software Router

A generic firmware core to drive the Front-End GBT-SCAs for the LHCb upgrade

Video capture using GigE Vision with MIL. What is GigE Vision

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Feedback on BeeGFS. A Parallel File System for High Performance Computing

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Overview. About CERN 2 / 11

The GAP project: GPU applications for High Level Trigger and Medical Imaging

The SPARTA Platform: Design, Status and. Adaptive Optics Systems (ESO)

Timing distribution and Data Flow for the ATLAS Tile Calorimeter Phase II Upgrade

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

ClearStream. Prototyping 40 Gbps Transparent End-to-End Connectivity. Cosmin Dumitru! Ralph Koning! Cees de Laat! and many others (see posters)!

FPGAs and Networking

Ethernet Networks for the ATLAS Data Collection System: Emulation and Testing

Improving DPDK Performance

Network Design Considerations for Grid Computing

Computing Infrastructure for Online Monitoring and Control of High-throughput DAQ Electronics

MiniDAQ1 A COMPACT DATA ACQUISITION SYSTEM FOR GBT READOUT OVER 10G ETHERNET 22/05/2017 TIPP PAOLO DURANTE - MINIDAQ1 1

100 Gbps Open-Source Software Router? It's Here. Jim Thompson, CTO, Netgate

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Intelligence Elements and Performance of the FPGA-based DAQ of the COMPASS Experiment

Consolidating Microsoft SQL Server databases on PowerEdge R930 server

PCIe40 output interface 01/08/2017 LHCB MINIDAQ2 WORKSHOP - PCIE - PAOLO DURANTE 1

Upgrading the ATLAS Tile Calorimeter electronics

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

Pexip Infinity Server Design Guide

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

LHCb Online System BEAUTY-2002

LegUp: Accelerating Memcached on Cloud FPGAs

P51: High Performance Networking

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Storage and I/O requirements of the LHC experiments

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Status and Prospects of LHC Experiments Data Acquisition. Niko Neufeld, CERN/PH

Transcription:

Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team

LHCb CMS ATLAS LHC ring: 27 km circumference ALICE 2

# Channels: 100 10 6 Weight: 7000t Collaborators: >3000 ATLAS: General-purpose particle detector, designed to observe phenomena involving high-energy particle collisions 3

# Channels: 100 10 6 Weight: 7000t Collaborators: >3000 Raw data produced: Data recorded to disk: ~60 TB/s 1-2 GB/s (after filtering) Data collection, selection, processing, monitoring, etc. requires tens of thousands of distributed applications, processing in quasi-realtime ATLAS: General-purpose particle detector, designed to observe phenomena involving high-energy particle collisions 3

In this talk 1 Integration of a new Distributed -Based System in the ATLAS experiment 2 Experience in analysis and optimization of a packet-based software 4

ATLAS DAQ: Detector Cavern 40 MHz FE FE FE FE FE Frontend Service Cavern Datacenter Today Run 2 Run 3 Run 4 2015 2017 2019 2021 2023 2025 5

ATLAS DAQ: 40 MHz Detector Cavern Service Cavern ~100 servers FE FE FE FE FE Detector-specific optical links Frontend ReadOut ROD ROD ROD ROD ROD Driver 100 khz / ~200 GB/s ~1800 links Common optical links ROS ROS ROS ROS ROS ReadOut System (Data Buffer) Custom electronic components Datacenter Today Run 2 Run 3 Run 4 2015 2017 2019 2021 2023 2025 5

ATLAS DAQ: 40 MHz Detector Cavern Service Cavern ~100 servers FE FE FE FE FE Detector-specific optical links Frontend ReadOut ROD ROD ROD ROD ROD Driver 100 khz / ~200 GB/s ~1800 links Common optical links ROS ROS ROS ROS ROS ReadOut System (Data Buffer) Custom electronic components Ethernet PCs (COTS) Datacenter ~1500 servers High-Level Trigger Farm Today Run 2 Run 3 Run 4 2015 2017 2019 2021 2023 2025 5

Custom Electronics Challenging maintenance and operation: Advantages in using COTS components 6 (PC Technology)

Processing in Software In an event-based mode of interaction components communicate by generating and receiving event notifications [ ] An event notification service [ ] mediates between the components of an event-based system (EBS) and conveys notifications from producers [ ] to consumers [ ] [ ] The notification service decouples the components so that producers unaware of any consumers and consumers rely only on the information published, but not on where or by whom it is published. [ ] The eventbased style carries the potential for easy integration of autonomous, heterogeneous components into complex systems that are easy to evolve and scale. 7

Processing in Software In an event-based mode of interaction components communicate by generating and receiving event notifications [ ] An event notification service [ ] mediates between the components of an event-based system (EBS) and conveys notifications from producers [ ] to consumers [ ] [ ] The notification service decouples the components so that producers unaware of any consumers and consumers rely only on the information published, but not on where or by whom it is published. [ ] The eventbased style carries the potential for easy integration of autonomous, heterogeneous components into complex systems that are easy to evolve and scale. 7

Processing in Software In an event-based mode of interaction components communicate by generating and receiving event notifications [ ] An event notification service [ ] mediates between the components of an event-based system (EBS) and conveys notifications from producers [ ] to consumers [ ] [ ] The notification service decouples the components so that producers unaware of any consumers and consumers rely only on the information published, but not on where or by whom it is published. [ ] The eventbased style carries the potential for easy integration of autonomous, heterogeneous components into complex systems that are easy to evolve and scale. 7

Processing in Software In an event-based mode of interaction components communicate by generating and receiving event notifications [ ] An event notification service [ ] mediates between the components of an event-based system (EBS) and conveys notifications from producers [ ] to consumers [ ] [ ] The notification service decouples the components so that producers unaware of any consumers and consumers rely only on the information published, but not on where or by whom it is published. [ ] The eventbased style carries the potential for easy integration of autonomous, heterogeneous components into complex systems that are easy to evolve and scale. 7

Processing in Software In an event-based mode of interaction components communicate by generating and receiving event notifications [ ] An event notification service [ ] mediates between the components of an event-based system (EBS) and conveys notifications from producers [ ] to consumers [ ] Still missing today? [ ] The notification service decouples the components so that producers unaware of any consumers and consumers rely only on the information published, but not on where or by whom it is published. [ ] The eventbased style carries the potential for easy integration of autonomous, heterogeneous components into complex systems that are easy to evolve and scale. 7

An Distribution System for ATLAS Requirements: Simple operation and maintenance (PC technology over custom designed electronics) Scalability (Switched networks over point-to-point links) Interface to radiationhard detector links Heterogeneous workloads FrontEnd LInk exchange 8

ATLAS DAQ: Detector Cavern Service Cavern 40 MHz FE FE FE FE FE New Systems, Common Link Technology ROD ROD ROD Custom electronic components ROS ROS ROS HPC Ethernet Network PCs (COTS) Datacenter ~1500 servers 2018 Run 2 High-Level Trigger Farm Run 3 Run 4 2015 2017 2019 2021 2023 2025 9

ATLAS DAQ: Detector Cavern 40 MHz FE FE FE FE FE New Systems, Common Link Technology ROD ROD ROD Custom electronic components Service Cavern 40 GbE, Infiniband HPC Network ROS ROS ROS HPC Ethernet Network PCs (COTS) Datacenter ~1500 servers 2018 Run 2 High-Level Trigger Farm Run 3 Run 4 2015 2017 2019 2021 2023 2025 9

ATLAS DAQ: Detector Cavern 40 MHz FE FE FE FE FE New Systems, Common Link Technology ROD ROD ROD Custom electronic components Service Cavern 40 GbE, Infiniband HPC Network Readout Readout ROS ROS ROS HPC Ethernet Network PCs (COTS) Datacenter ~1500 servers 2018 Run 2 High-Level Trigger Farm Run 3 Run 4 2015 2017 2019 2021 2023 2025 9

ATLAS DAQ: Detector Cavern 40 MHz FE FE FE FE FE Common optical links ~10,000 links ~200 systems Service Cavern HPC Network less than 10 TB/s Readout Readout Readout Readout Readout HPC Ethernet Network PCs (COTS) Datacenter 2023 Run 2 High-Level Trigger Farm Run 3 Run 4 2015 2017 2019 2021 2023 2025 10

Surface Datacenter HPC Network 100m Central Data Distribution Layer Service Cavern Readout Readout Readout HPC Network Detector Cavern FE FE FE 11

: Detector-to-Network Data Routing Scalable architecture Routing of multiple traffic types: physics events, detector control, configuration, calibration, monitoring Industry standard links: data processors/handlers can be SW in PCs - Less custom electronics, more COTS components Reconfigurable data path, multi-cast, cloning, QoS Automatic failover and load balancing 12

CERN GBT Link Technology (GigaBit Transceiver) Point-to-point link technology developed at CERN, progressively replaces detector-specific links in ATLAS Designed for common High-Energy Physics environments (high radiation, magnetic fields, ) Typical raw bandwidth 4.8 Gbps or 9.6 Gbps Supports variable-width virtual links ( elinks ) for mixed traffic types GBT Link elink elink elink (Optional) Forward Error Correction 13

Development Platform HiTech Global PCIe development Xilinx Virtex-7 PCIe Gen-2/3 x8 24 bi-directional links http://hitechglobal.com/boards/pcie-cxp.htm With custom TTCfx FMC SuperMicro X10DRG-Q 2x Haswell CPU, up to 10 cores 6x PCIe Gen-3 slots 64 GB DDR4 Memory http://supermicro.com/products/motherboard/xeon/c600/x10drg-q.cfm Mellanox ConnectX-3 VPI FDR/QDR Infiniband 2x10/40 GbE http://www.mellanox.com/page/products_dyn?product_family=119&mtag =connectx_3_vpi 14

Memory CPU DMA TTC FMC FPGA Card 64 Gb/s MSI-X Custom Device Driver FELIX Application DMA, IRQ Config PCIe Gen-3 24 48 links Optical Links GBT Elink router Large Buffers per group of elinks 64 Gb/s PCIe Gen-3 NIC 2 4 40-Gb/s ports Optical Links FELIX Architectural Overview FELIX 15

Memory CPU DMA TTC FMC FPGA Card 64 Gb/s MSI-X Custom Device Driver FELIX Application DMA, IRQ Config PCIe Gen-3 24 48 links Optical Links GBT Elink router Large Buffers per group of elinks 64 Gb/s PCIe Gen-3 NIC 2 4 40-Gb/s ports Optical Links FELIX Architectural Overview FELIX 15

Read Blocks from File Descriptors Block Decode CPU Data Processing Pipeline Statistics Extract Meta Information Route Load Balancing, Error Handling Send To Network 16

Read Blocks from File Descriptors Block Decode Program DMA transfers and read fixed blocks of data that have been encoded for the transfer over PCIe Decode into variable sized chunks for transmission over network e-link streams share the same pipeline CPU Data Processing Pipeline Statistics Extract Meta Information elink 0 Route Load Balancing, Error Handling Send To Network elink 1 elink 2 17

Fixed Block Decoding Stream of Blocks transmitted over PCIe: Stream of reconstructed chunks with meta information for further processing: H H H Packets are analyzed, routed and transmitted to network destinations 18

Optimizations (single-threaded) Target: 0.125us per block to process PCIe Gen-3 x8 data at full rate Isolated benchmarks of the packet processing benchmarks with in-memory input data Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz 19

Optimizations (single-threaded) Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz emplace instead of push_back (Construction of objects in-place in containers) Moving support data structures to class-scope to avoid initializations on each algortithm call Pre-reserve memory for std::vector on initialization NUMA-aware memory allocations (using libnuma) Data prefetching using GCC s builtin_prefetch Compiler option tuning 20

Optimizations (single-threaded) New data structure is used only for chunks that fit into a single block Avoids keeping track of past blocks altogether Especially useful for small chunk sizes (much more likely that a chunk fits in a block) Downside: 2x more code Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz 21

Optimizations (single-threaded) > 10x (for small chunksizes) New data structure is used only for chunks that fit into a single block Avoids keeping track of past blocks altogether Especially useful for small chunk sizes (much more likely that a chunk fits in a block) Downside: 2x more code Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz 21

Multi Threading Chunksize = 16 Byte PCIe Gen-3 x8 Measurements done on Intel Xeon E5645 2x6 cores @ 2.40 GHz 22

Multi Threading Chunksize = 16 Byte Linear Scale-up PCIe Gen-3 x8 Measurements done on Intel Xeon E5645 2x6 cores @ 2.40 GHz 22

Multi Threading Chunksize = 16 Byte Linear Scale-up Saturation PCIe Gen-3 x8 Measurements done on Intel Xeon E5645 2x6 cores @ 2.40 GHz 22

Multi Threading Chunksize = 16 Byte HyperThreading Linear Scale-up Saturation PCIe Gen-3 x8 Measurements done on Intel Xeon E5645 2x6 cores @ 2.40 GHz 22

Memory Performance L1 Cache Each line represents a different memory access pattern L2 Cache 16, 32, 64, 128, 256 Byte accesses Read vs. Write SSE (different versions), AVX Scanning vs. Random Access L3 Cache RAM 23

Memory Performance L1 Cache Each line represents a different memory access pattern L2 Cache L3 Cache We are here: RAM 24

Memory Performance RAM Memory Access Pattern: Sequential reads (Scanning),but with gaps (only trailers read, not actual data) 25

Memory Performance RAM Memory Access Pattern: Sequential reads (Scanning),but with gaps (only trailers read, not actual data) 25

Roofline Analysis Performance in CPU Operations/Cycle Operational Intensity = Number of useful instructions performed per byte read Measures the value of data transfers 26

Roofline Analysis Number of instructions the CPU can compute per Cycle Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz 27

Roofline Analysis Performance limited by memory speed Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz 28

Roofline Analysis Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz 29

Roofline Analysis Measurements done on Intel Core i7-3770 CPU 4 cores @ 3.40 GHz 30

Roofline Model: Critic Hard to obtain accurate data, lots of guessing Result can only be seen as a first-order approximation But: Good way to visualize the results that can be confirmed by a performance analysis tool like Intel VTune 31

Summary FELIX: A new central event distribution layer for the ATLAS experiment Optimizations and analysis of the application bottleneck (packet processing) 32