Modeling and Validating Time, Buffering, and Utilization of a Large-Scale, Real-Time Data Acquisition System

Similar documents
Modeling Resource Utilization of a Large Data Acquisition System

THE ATLAS DATA ACQUISITION SYSTEM IN LHC RUN 2

The ATLAS Data Flow System for LHC Run 2

Improving Packet Processing Performance of a Memory- Bounded Application

FELI. : the detector readout upgrade of the ATLAS experiment. Soo Ryu. Argonne National Laboratory, (on behalf of the FELIX group)

The ATLAS Data Acquisition System: from Run 1 to Run 2

2008 JINST 3 S Online System. Chapter System decomposition and architecture. 8.2 Data Acquisition System

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

The FTK to Level-2 Interface Card (FLIC)

CouchDB-based system for data management in a Grid environment Implementation and Experience

The CMS Computing Model

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Automated load balancing in the ATLAS high-performance storage software

Trigger and Data Acquisition at the Large Hadron Collider

Software Switching for High Throughput Data Acquisition Networks

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Detector Control LHC

Benchmarking message queue libraries and network technologies to transport large data volume in

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Approaching Incast Congestion with Multi-host Ethernet Controllers

The Baseline DataFlow System of the ATLAS Trigger & DAQ

Ethernet Networks for the ATLAS Data Collection System: Emulation and Testing

RT2016 Phase-I Trigger Readout Electronics Upgrade for the ATLAS Liquid-Argon Calorimeters

The GAP project: GPU applications for High Level Trigger and Medical Imaging

Overview. About CERN 2 / 11

CSCS CERN videoconference CFD applications

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

L1 and Subsequent Triggers

PoS(High-pT physics09)036

An overview of the ATLAS high-level trigger dataflow and supervision

High-Throughput and Low-Latency Network Communication with NetIO

Work Project Report: Benchmark for 100 Gbps Ethernet network analysis

IEEE Nuclear Science Symposium San Diego, CA USA Nov. 3, 2015

The Application of DAQ-Middleware to the J-PARC E16 Experiment

The ATLAS Trigger System: Past, Present and Future

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

The CMS Event Builder

The new detector readout system for the ATLAS experiment

Trigger and Data Acquisition: an ATLAS case study

Electronics, Trigger and Data Acquisition part 3

CMS High Level Trigger Timing Measurements

LHCb Online System BEAUTY-2002

Verification and Diagnostics Framework in ATLAS Trigger/DAQ

A Fast Ethernet Tester Using FPGAs and Handel-C

Fast pattern recognition with the ATLAS L1Track trigger for the HL-LHC

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

An ATCA framework for the upgraded ATLAS read out electronics at the LHC

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

2008 JINST 3 T The ATLAS ROBIN TECHNICAL REPORT

A generic firmware core to drive the Front-End GBT-SCAs for the LHCb upgrade

The AAL project: automated monitoring and intelligent analysis for the ATLAS data taking infrastructure

The ALICE High Level Trigger

A generic firmware core to drive the Front-End GBT-SCAs for the LHCb upgrade

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Evaluation of network performance for triggering using a large switch

Storage and I/O requirements of the LHC experiments

An Upgraded ATLAS Central Trigger for 2015 LHC Luminosities

Development and test of the DAQ system for a Micromegas prototype to be installed in the ATLAS experiment

A Lossless Switch for Data Acquisition Networks

Ethernet for the ATLAS Second Level Trigger Franklin Saka

HLT Infrastructure Commissioning

National Instruments Approach

The Intelligent FPGA Data Acquisition

Direct photon measurements in ALICE. Alexis Mas for the ALICE collaboration

Tracking and flavour tagging selection in the ATLAS High Level Trigger

Copyright 2014 Shaw-Pin Chen

CBMnet as FEE ASIC Backend

Upgrading the ATLAS Tile Calorimeter electronics

The data-acquisition system of the CMS experiment at the LHC

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Data Intensive Science Impact on Networks

Teză de doctorat. Aplicaţii de timp real ale reţelelor Ethernet în experimentul ATLAS. Ethernet Networks for Real-Time Use in the ATLAS Experiment

arxiv: v1 [physics.ins-det] 11 Jul 2015

SCRAMNet GT. A New Technology for Shared-Memor y Communication in High-Throughput Networks. Technology White Paper

CMS Conference Report

Trigger/DAQ design: from test beam to medium size experiments

Quad Module Hybrid Development for the ATLAS Pixel Layer Upgrade

CMS FPGA Based Tracklet Approach for L1 Track Finding

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Summary of the LHC Computing Review

10Gbps TCP/IP streams from the FPGA

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Status and Prospects of LHC Experiments Data Acquisition. Niko Neufeld, CERN/PH

ATLAS Experiment and GCE

Tackling tomorrow s computing challenges today at CERN. Maria Girone CERN openlab CTO

Grid Computing: dealing with GB/s dataflows

ATLAS TDAQ RoI Builder and the Level 2 Supervisor system

Dataflow Monitoring in LHCb

Update on PRad GEMs, Readout Electronics & DAQ

Virtualizing a Batch. University Grid Center

Upgrade of the ATLAS Level-1 Trigger with event topology information

Atlantis MultiRob Scenario

THE ALFA TRIGGER SIMULATOR

BSDCan 2015 June 13 th Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support. Midori Kato

PRESERVE DATABASE PERFORMANCE WHEN RUNNING MIXED WORKLOADS

The Phase-2 ATLAS ITk Pixel Upgrade

Tuning Pipelined Scientific Data Analyses for Efficient Multicore Execution

Velo readout board RB3. Common L1 board (ROB)

Transcription:

Modeling and Validating Time, Buffering, and Utilization of a Large-Scale, Real-Time Data Acquisition System Alejandro Santos, Pedro Javier García, Wainer Vandelli, Holger Fröning The 2017 International Conference on High Performance Computing & Simulation (HPCS 2017) 1

Outline Motivation Introduction to ATLAS and ATLAS DAQ Model construction and implementation Input data Results Related work Outlook 2

Motivation Future upgrades of the ATLAS Trigger and Data Acquisition system (TDAQ) involve a 30-fold increase of the input data rate Current Current DAQ system handles an input bandwidth of ~160 GB/s, the future system will need to handle 5 TB/s The upgrade is scheduled to be implemented in the year ~2025 It is not always possible to predict future technologies We need a methodologically sound approach to study the effect of some design techniques 6 5 4 3 Data rates (TB/s) 2 1 0 Year 2016 Year 2025 3

The plan We created a simulation model for the existing ATLAS Data Acquisition system Implemented in a discrete event simulator framework It follows the data acquisition process, and then outputs metrics for the buffering system We validate this simulation model with existing ATLAS operational data from the year 2016 Once we gain confidence in the model, we will use this model to explore the possible design of the future system 4

ATLAS The ATLAS experiment at CERN is one of the four major experiments of the Large Hadron Collider Circular accelerator of protons, 27 km in circumference, 100 meters underground In ATLAS, data are produced at very high rates for periods of ~20 hours ATLAS observes the proton collisions delivered by the LHC Each collision produces ~1.6 MB of data 5

ATLAS Data Acquisition Heterogeneous system, mixing custom and COTS hardware as well as real-time and quasi-real time operation The main functions is to extract, transport, process, filter and store data We can t store all produced data It is mission critical since we cannot stop/start the LHC on demand High efficiency is vital when looking for rare processes, such as production of Higgs Bosons, which have a probability of 10^13 of being created in a collision 6

ATLAS Data Acquisition Detector sensors First level trigger Readout system (ROS) Second level trigger, the HLT Current ATLAS Trigger and Data Acquisition System 7

ATLAS Data Acquisition Two level filtering system Each level is a Trigger First level implemented in custom hardware Second level implemented with a off-the-shelf computing system, connected with a multi-gigabit Ethernet network Between the two levels there is a buffering service, the Readout System (ROS) We are interested in studying this buffering component because it is a key component in the future ATLAS data acquisition system 8

ATLAS Data Acquisition Readout System (ROS) Buffers data ( fragments ) between trigger levels A fragment is a subset of each collision event Composed of ~100 computers The Data Collection Manager (DCM) Manages the network access for one single HLT machine One for each HLT machine (~2200) The HLT Supervisor Single machine controlling the entire HLT process The HLT Runs the second level algorithms to select collision data ~2200 multicore machines, ~40k computer cores Each core executes one application instance 9

Simulation Model We implemented a simulation model for the existing ATLAS data acquisition system The simulation is validated with real, operational data Long term goal: extend the model to the future ATLAS data acquisition system Simplified version of the ATLAS TDAQ System Model assumes network to be ideal, infinite in capacity, and no packet loss. Currently, the model only accounts for data processing latency, which is the main contribution to the total latency 10

Simulation Model Implemented in OMNeT++ framework Widely accepted by the scientific community Robust and user-friendly discrete simulation framework Simulation building blocks are defined as modules, implemented in C++ classes Configuration done in specialized configuration files Modular environment, graphical and command line interface for running many simulations in parallel [OMNeT++] A. Varga, Omnet++, in Modeling and Tools for Network Simulation. Springer, 2010, pp. 3559. 11

Simulation Methodology Multiple simulations are executed over two hours of input data 24 simulations, each covering 5 minutes of input data Simulations are executed independently Each simulation is executed for 60 simulated seconds in 6 hours, taking ~2 GB of RAM Four machines with Intel Xeon E5645 2.4 GHz and 24 GB of RAM 12

Input data The implemented simulation is driven by ATLAS operational data The data were recorded in the year 2016 Input parameters are derived from ATLAS operational data Also, the results are validated against ATLAS operational data Input values are averaged over five minutes Suppress instantaneous fluctuations Values that are already averaged over a different time interval need to be normalized Some values are derived from the data, for example the individual size of elements are taken from the bandwidth over the rate Simulation input parameters Simulation assumes that over this period the conditions are constant 13

Simulation Results Single simulation results for ROS output bandwidth Average simulation error of ~5% Group of ROS with large TCP retransmissions Main outlier is a ROS with a small fragment size 14

Simulation Results Number of fragments results are within ~5% of real data Systematic bias in the results that represent ~10ms of missing simulation latencies Additional latencies missing in the simulation are two network messages, three software applications, and event building. 15

Simulation Results Each point corresponds to the average results of one simulation Real bandwidth data resolution is limited to one hour Reason why the Real Data is a step function Outlier at minute ~70 Input data rate changed due to external conditions. Simulation assumes constant rate 16

Simulation Results Each point corresponds to the average results of one simulation Average number of fragments in all ROS system The bias is present in all simulations 17

Related Work Packet-level network simulations of the ATLAS data acquisition can be found in: T. Colombo, H. Fröning, P. J. Garc ııa, and W. Vandelli, Modeling a large dataacquisition network in a simulation framework, 2015 M. Bonaventura, D. Foguelman, and R. Castro, Discrete event modeling and simulation-driven engineering for the ATLAS data acquisition network, 2016 Simulations related to the initial ATLAS TDAQ design can be found in: J. Vermeulen, S. Hunt, C. Hortnag, F. Harris, A. Erasov, R. Dankers, and A. Bogaerts, Discrete event simulation of the atlas second level trigger, 1998 R. Cranfield, P. Golonka, A. Kaczmarska, K. Korcyl, J. Vermeulen, and S. Wheeler, Computer modeling the atlas trigger/daq system performance, 2004 The overall concern was the understanding of system design choices: queuing and packet loss effects In this work, we study the overall performance on a specific system component, validated with real data 18

Conclusion, Future Work We presented a simulation model for an existing data acquisition system, reproducing with high accuracy real data from previous operation of the system Simulation errors were understood, motivating the next stage of the work with the inclusion of missing latencies on the system and further validation of results over larger periods of time The current model provides a very good basis for this study The simulation can be used as the basis for the study of the behavior of the future ATLAS architecture of the new data acquisition system 19