Automated load balancing in the ATLAS high-performance storage software

Similar documents
Modeling Resource Utilization of a Large Data Acquisition System

Modeling and Validating Time, Buffering, and Utilization of a Large-Scale, Real-Time Data Acquisition System

THE ATLAS DATA ACQUISITION SYSTEM IN LHC RUN 2

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

The Oracle Database Appliance I/O and Performance Architecture

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

Summary of the LHC Computing Review

Operating Systems 2010/2011

An Oracle White Paper April 2010

The ATLAS Data Flow System for LHC Run 2

L1 and Subsequent Triggers

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Network Design Considerations for Grid Computing

SurFS Product Description

Google File System. Arun Sundaram Operating Systems

Silberschatz and Galvin Chapter 12

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

COMPARING COST MODELS - DETAILS

Lecture 13 Input/Output (I/O) Systems (chapter 13)

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

Chapter 13: I/O Systems

Virtual Security Server

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

操作系统概念 13. I/O Systems

People, Process, Technology Transforming the Enterprise Desktop To An Enterprise Virtual Desktop

The Google File System

Chapter 13: I/O Systems

Input/Output Systems

The FTK to Level-2 Interface Card (FLIC)

Vblock Infrastructure Packages: Accelerating Deployment of the Private Cloud

JMR ELECTRONICS INC. WHITE PAPER

Distributed Video Systems Chapter 3 Storage Technologies

Data Storage Institute. SANSIM: A PLATFORM FOR SIMULATION AND DESIGN OF A STORAGE AREA NETWORK Zhu Yaolong

I/O CANNOT BE IGNORED

Introduction Optimizing applications with SAO: IO characteristics Servers: Microsoft Exchange... 5 Databases: Oracle RAC...

NuttX Realtime Programming

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

HIGH PERFORMANCE STORAGE SOLUTION PRESENTATION All rights reserved RAIDIX

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

Computer Science 4500 Operating Systems

Operating Systems, Fall

Getafix: Workload-aware Distributed Interactive Analytics

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

EMC Performance Optimization for VMware Enabled by EMC PowerPath/VE

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

The ATLAS Data Acquisition System: from Run 1 to Run 2

Database Systems II. Secondary Storage

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

Accelerating Microsoft SQL Server 2016 Performance With Dell EMC PowerEdge R740

Toward Seamless Integration of RAID and Flash SSD

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Chapter 13: I/O Systems. Objectives. I/O Hardware. A Typical PC Bus Structure. Device I/O Port Locations on PCs (partial)

Lecture 23. Finish-up buses Storage

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Nexenta Technical Sales Professional (NTSP)

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

T H. Runable. Request. Priority Inversion. Exit. Runable. Request. Reply. For T L. For T. Reply. Exit. Request. Runable. Exit. Runable. Reply.

Delivering High-Throughput SAN with Brocade Gen 6 Fibre Channel. Friedrich Hartmann SE Manager DACH & BeNeLux

Exam Name: Midrange Storage Technical Support V2

SoftNAS Cloud Performance Evaluation on Microsoft Azure

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

EMC DATA DOMAIN OPERATING SYSTEM

FELI. : the detector readout upgrade of the ATLAS experiment. Soo Ryu. Argonne National Laboratory, (on behalf of the FELIX group)

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

Vblock Architecture Accelerating Deployment of the Private Cloud

Evolution of Cloud Computing in ATLAS

Improving Packet Processing Performance of a Memory- Bounded Application

I/O CANNOT BE IGNORED

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

ECE 598 Advanced Operating Systems Lecture 22

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

Optimizing Flash-based Key-value Cache Systems

EMC Virtual Infrastructure for Microsoft Exchange 2010 Enabled by EMC Symmetrix VMAX, VMware vsphere 4, and Replication Manager

Some Joules Are More Precious Than Others: Managing Renewable Energy in the Datacenter

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Performance of ORBs on Switched Fabric Transports

CMPS 111 Spring 2003 Midterm Exam May 8, Name: ID:

Cisco HyperFlex HX220c M4 Node

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

Flash In the Data Center

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Virtual Leverage: Server Consolidation in Open Source Environments. Margaret Lewis Commercial Software Strategist AMD

Webinar Series: Triangulate your Storage Architecture with SvSAN Caching. Luke Pruen Technical Services Director

SoftNAS Cloud Performance Evaluation on AWS

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Course Syllabus. Operating Systems

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Transcription:

Automated load balancing in the ATLAS high-performance storage software Fabrice Le Go 1 Wainer Vandelli 1 On behalf of the ATLAS Collaboration 1 CERN May 25th, 2017

The ATLAS Experiment 3 / 20

ATLAS Trigger and Data Acquisition System 4 / 20

Data Logger System HBA1 Server 1 HBA2 Transient storage system to: Decouple online and oine operations Cope with disruption of permanent storage service or its connection Scale-out system, currently: 4 local-attached storage, 2 servers each 500 HDs, 430 TB, 8 GB/s Fully-redundant: no data loss caused in 2016 Server 2 HBA1 HBA2 Controller1 Controller2 Disks Expansion Disks Expansion Disks Expansion Disks Expansion HBA: Host Bus Adapter 5 / 20

Software Distributed in-house application (C++) Tasks: Receive selected events data Write data to disks Compute data checksum: le-by-le Adler32 Data-driven: events are distributed in classes called streams One le by stream New streams appear roughly every minute Stream distribution may vary rapidly Workload not uniform at all, cannot be fairly distributed Multi-threaded through task-oriented framework Typical Stream Bandwidth Distribution Another independent application sends the data to permanent storage 6 / 20

Threading Model Input threads IT1 IT2 ITN Events in streams Writing Manager ------- creates File to thread bookkeeping ----------- S0 OT1 S1 OT2 S2 E0 E2 E7 E17 E3 E14 E10 E1... S0 S1 S0 S2 S0 S2 S0 S1 S1 S2 S2 S1 Writing Task ---- 1 event to 1 file Output threads File B OT1 assigned to OT2 File A File C OTM File A E0 E7 E3 E10... File B E0 E3 E2 E1... File C E2 E7 E17E14...... The workload distribution is controlled by the le-to-thread assignment policy The application performance can be limited by one thread 7 / 20

Current Assignment Policy Round-robin: each new le is assigned to the next thread in a circular thread buer Simple implementation, very low overhead Deterministic behavior but events come with no specic order: non-deterministic assignment of les to thread The application's instantaneous performance is not predictable: The assignment of major streams to the same thread will degrade the application performance 8 / 20

Problem Modication in the operational conditions: higher throughput, dierent stream distribution 2015 2016 Peak throughput 1.4 GB/s 3.2 GB/s S1: 80 % S1': 70 % Stream distribution S2: 6 % S2': 7 % S3: 3 % S3': 5 % Random assignment of major streams to the same thread will now degrade the application performance Synthetic test conrmed performance degradation: Conditions Writing rate Performance Loss No joint assignment 865 MB/s reference S1' and S2' together 797 MB/s - 8 % S1', S2' and S3' together 760 MB/s - 12 % 9 / 20

Weighted Assignment Policy A new workload distribution strategy was needed to restore performance and predictability Requirements: Data-driven: e.g. cannot assume any pattern in stream distribution Responsive: must cope with rapid evolution of stream distribution Low CPU and memory footprint Idea: Compute a load for thread: last-n-second sliding window of amount of processed data Assign a new le to the thread with the lowest load 10 / 20

Weighted Assignment Policy: Step 1 Threads load vs. time with assignments Zoom on the assignments: problematic in red Real-time load is ineective for close-enough assignments Reducing sliding window length: but cannot be too small, would be too sensitive to local uctuations (typical: 5 seconds) Another component needed to be added: Compute a load for the streams: same sliding-window amount of processed data by class of streams Add the stream load to the thread load upon assignment 11 / 20

Weighted Assignment Policy: Step 2 Threads load vs. time with assignments Zoom on the assignments Decisions are reected immediately: the likelihood of a thread to be selected again just after decision is inverse proportional to the load of the assigned stream 12 / 20

Testing Test in controlled environment with emulated data ow: Stream distribution and upstream event processing time emulated from 2016 monitoring data No wrong decision for + 40-hour runs Policy Writing rate Performance Gain Round-robin 865 MB/s reference Weighted 882 MB/s + 2 % Test on the actual ATLAS TDAQ infrastructure Used during ATLAS commissioning tests and cosmic data taking sessions 13 / 20

Conclusion The transient storage system of ATLAS TDAQ is a key component enabling for decoupling of online and oine operations Its workload is heavily unbalanced and cannot be fairly distributed In 2016 a new strategy was required to handle recent changes in operation conditions New workload distribution strategy: sensitive and self-adaptive to fast-evolving operation conditions and modications of the event selection process Validated in both test and production environments: proved to better use the parallel processing capabilities of modern CPUs for our workload This development will be part of the 2017 data-taking session 14 / 20

2015 Real-time Streams Writing Rate Figure: Instantaneous stream bandwidth for data collected on 28/10/2015. All streams are shown and each line represent a dierent stream. The highest line labeled "Global" is the sum of all streams representing the total bandwidth of selected events data. 16 / 20

2015 Stream Bandwidth Distribution Figure: Stream bandwidth distribution for data collected on 28/10/2015. Each bar represent the fraction of the total bandwidth for one stream over the considered period. 17 / 20

2015 Stream Bandwidth Distribution Figure: Bandwidth distribution between dierent streams for data collected on 28/10/2015. The four highest bandwidth streams are shown seperately and all other streams are summed together as "Other streams". 18 / 20

2016 Real-time Streams Writing Rate Figure: Instantaneous stream bandwidth for data collected on 24 and 25/10/2016. All streams are shown and each line represent a dierent stream. The highest line labeled "Global" is the sum of all streams representing the total bandwidth of selected events data. 19 / 20

2016 Stream Bandwidth Distribution Figure: Stream bandwidth distribution for data collected on 24 and 25/10/2016. Each bar represent the fraction of the total bandwidth for one stream over the considered period. 20 / 20