System level traffic shaping in disk servers with heterogeneous protocols

Similar documents
QBone Scavenger Service Implementation for Linux

Streamlining CASTOR to manage the LHC data torrent

Kernel Korner. Analysis of the HTB Queuing Discipline. Yaron Benita. Abstract

Multimedia Communication. Project 6: Intelligent DiffServ

Dropping Packets in Ubuntu Linux using tc and iptables

Use of containerisation as an alternative to full virtualisation in grid environments.

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Network Design Considerations for Grid Computing

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Traffic Engineering 2015/2016. Fernando M. Silva. Instituto Superior Técnico. QoS, Scheduling & priority queues.

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

CASTORFS - A filesystem to access CASTOR

Introduction to TCP/IP Offload Engine (TOE)

CERN European Organization for Nuclear Research, 1211 Geneva, CH

Data transfer over the wide area network with a large round trip time

Accelerate Applications Using EqualLogic Arrays with directcache

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

A Next Generation Home Access Point and Router

The INFN Tier1. 1. INFN-CNAF, Italy

CMS users data management service integration and first experiences with its NoSQL data storage

Overview of ATLAS PanDA Workload Management

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Experiments on TCP Re-Ordering March 27 th 2017

Data preservation for the HERA experiments at DESY using dcache technology

Monitoring of large-scale federated data storage: XRootD and beyond.

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

CMS High Level Trigger Timing Measurements

Benchmark of a Cubieboard cluster

The State of the Art in Bufferbloat Testing and Reduction on Linux

Milestone Solution Partner IT Infrastructure Components Certification Report

QuickSpecs. HP Z 10GbE Dual Port Module. Models

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

Virtual Switch Acceleration with OVS-TC

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Worksheet 8. Linux as a router, packet filtering, traffic shaping

Wireless Networks (CSC-7602) Lecture 8 (15 Oct. 2007)

A 10 years journey in Linux firewalling Pass the Salt, summer 2018 Lille, France Pablo Neira Ayuso

Presented by: Nafiseh Mahmoudi Spring 2017

Storage Resource Sharing with CASTOR.

The Google File System

Evaluation of the Huawei UDS cloud storage system for CERN specific data

Optimize and Accelerate Your Mission- Critical Applications across the WAN

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

Measuring Application's network behaviour

How Scalable is your SMB?

Traffic Engineering 2: Layer 2 Prioritisation - CoS (Class of Service)

10 Gbit/s Challenge inside the Openlab framework

EqualLogic Storage and Non-Stacking Switches. Sizing and Configuration

Improving TCP Performance over Wireless Networks using Loss Predictors

Implementation of Differentiated Services over ATM

Configuring priority marking 63 Priority marking overview 63 Configuring priority marking 63 Priority marking configuration example 64

Bandwidth Management

Clustering and Reclustering HEP Data in Object Databases

Evolution of Database Replication Technologies for WLCG

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware

TVA: A DoS-limiting Network Architecture L

GUARANTEED END-TO-END LATENCY THROUGH ETHERNET

B. V. Patel Institute of Business Management, Computer &Information Technology, UTU

Grandstream Networks, Inc. GWN7000 QoS - VoIP Traffic Management

6.9. Communicating to the Outside World: Cluster Networking

dcache tape pool performance Niklas Edmundsson HPC2N, Umeå University

Addressing the Challenges of Web Data Transport

Automated Storage Tiering on Infortrend s ESVA Storage Systems

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Basics (cont.) Characteristics of data communication technologies OSI-Model

QoS on Low Bandwidth High Delay Links. Prakash Shende Planning & Engg. Team Data Network Reliance Infocomm

High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman

NIC TEAMING IEEE 802.3ad

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Mohammad Hossein Manshaei 1393

SolidFire and Ceph Architectural Comparison

Implementation and Analysis of Large Receive Offload in a Virtualized System

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

Promoting the Use of End-to-End Congestion Control in the Internet

Data oriented job submission scheme for the PHENIX user analysis in CCJ

Shaping Process Semantics

Securing Grid Data Transfer Services with Active Network Portals

Optimizing Performance: Intel Network Adapters User Guide

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays

The Network Layer and Routers

Performance of popular open source databases for HEP related computing problems

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

Chapter 13: I/O Systems

Geant4 Computing Performance Benchmarking and Monitoring

GStat 2.0: Grid Information System Status Monitoring

QoS Configuration. Overview. Introduction to QoS. QoS Policy. Class. Traffic behavior

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Active Adaptation in QoS Architecture Model

Network Support for Multimedia

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Episode 5. Scheduling and Traffic Management

INT-1010 TCP Offload Engine

Transcription:

Journal of Physics: Conference Series OPEN ACCESS System level traffic shaping in disk servers with heterogeneous protocols To cite this article: Eric Cano and Daniele Francesco Kruse 14 J. Phys.: Conf. Ser. 513 0407 Related content - The Repack Challenge Daniele Francesco Kruse - Streamlining CASTOR to manage the LHC data torrent G Lo Presti, X Espinal Curull, E Cano et al. - Disk storage at CERN: Handling LHC data and beyond X Espinal, G Adde, B Chan et al. View the article online for updates and enhancements. This content was downloaded from IP address 37.44.5.122 on 01/01/18 at 16:43

System level traffic shaping in disk servers with heterogeneous protocols Eric Cano and Daniele Francesco Kruse European Organization for Nuclear Research CERN, CH-1211 Genève 23, Switzerland E-mail: eric.cano@cern.ch, daniele.francesco.kruse@cern.ch Abstract. Disk access and tape migrations compete for network in CASTORs disk servers, over various protocols: RFIO, Xroot, root and GridFTP. As there are a limited number of tape drives, it is important to keep them busy all the time, at their nominal speed. With potentially 100s of user read streams per server, the for the tape migrations has to be guaranteed to a controlled level, and not the fair share the system gives by default. Xroot provides a prioritization mechanism, but using it implies moving exclusively to the Xroot protocol, which is not possible in short to mid-term time frame, as users are equally using all protocols. The greatest commonality of all those protocols is not more than the usage of TCP/IP. We investigated the Linux kernel traffic shaper to control TCP/ IP. The performance and limitations of the traffic shaper have been understood in test environment, and satisfactory working point has been found for production. Notably, TCP offload engines negative impact on traffic shaping, and the limitations of the length of the traffic shaping rules were discovered and measured. A suitable working point has been found and the traffic shaping is now successfully deployed in the CASTOR production systems at CERN. This system level approach could be transposed easily to other environments. 1. Introduction 1.1. CASTOR s disk staging and access patterns The CERN Advanced STORage manager (CASTOR) [1] is used to archive to tape the physics data of past and present physics experiments at CERN. The current size of the tape archive is more than 91 PB. In 12 alone, about 30 PB of raw data were stored. CASTOR organises data in files, which are stored in a disk staging area before migration to tape or after being recalled from tape. Each file is stored complete on one file system of one of the disk servers. The staging area is also used as a cache/work area where files are kept as long as possible. A garbage collector manages deletions. This double usage leads to a double access pattern, where the tape system competes with user access (interactive or batch processing) for accessing data in the disk servers. A central scheduler, the stager, organises all data transfers to and from the staging areas. The tape system is scheduled with a freight train approach, where long lists of file transfers are sent to the tape servers for execution. The time delay between a scheduling decision and the actual transfer can be minutes. The precise location of the file on one disk server is pre-determined at scheduling time. The stager limits the number of concurrent user accesses to disk servers with a slot system. But this does not guarantee efficient transfer from the user. Some experiment s frameworks Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1

open many files during the whole analysis session, where the actual data transfer for a given file is just a burst in the middle of a long, mostly idle session. This led operators to choose a high slot quota, and disk servers can end up with around 300 user connections. With all of those active, tape migrations can go as low as 1 MB/s. 1.2. Competition between end users and tape system for disk server With the time delay between decision and actual transfer, the stager cannot guarantee an efficient data transfers for the tape server. Yet, an efficient tape transfer is fundamental as the tape drives are a scarce resource, shared by all users, and central for efficient migration to new tape media[2]. Disk pools are dedicated to a given experiment or task, so an inefficient access pattern has a controlled scope, however tape drives stuck waiting for data from disk servers have a global effect. We therefore needed a mean to guarantee the served to the tape drives and give them high priority. In addition, the disk servers contain a good number of disk spindles, but many of them are still connected to the network with a single 1 Gbit/s interface: the network is then the resource being competed for. Today s tape drives typically achieve 250 MB/s[3] [4]. The tape servers buffer files in memory, and read/write up to 3 files in parallel, if they are small enough, so we need to guarantee between 250 MB/s 3 = 83 MB/s and 250 MB/s from any disk server. This will not be achieved fully with a single 1 Gbit/s interface, but with a fair share as low as 1 MB/s, we had a long way to go. The CASTOR disk servers are accessed using several protocols: RFIO, XRoot, root and GridFTP. Tape servers use RFIO, but so do some user accesses. RFIO has no means to reserve for a given user. On the other hand the tape servers are dedicated machines. The IP addresses are hence an appropriate criterion to discriminate between tape servers and other accesses. 2. Traffic shaping as a mean to favour the tape servers As no application level means, neither central nor disk server level can guarantee the tape, we investigated the system level. All the CASTOR infrastructure runs on Linux [5]. The Linux kernel, beyond the classic IP chains, contains many powerful network control features, allowing detailed routing and traffic control [6]. The traffic control delays network packets on the output of an interface. As a consequence, this methodology only allows control for one direction out of two (migrations to tape, and not tape recalls), but securing the data to tape is the most important. 2.1. Traffic control objectives We tried to guarantee 90% of the to the tape servers in case of mixed disk and tape traffic. We also want to still provide all of the to any class of traffic in the absence of the other. 2.2. Traffic control configuration By default the Linux kernel uses a priority FIFO queuing discipline called (pfifo_fast). This queuing discipline prioritises the IP packets based on the type of service (TOS) bits. In order to retain a behaviour as close as possible to the system defaults, we used the priority (prio) queuing discipline, mimicking this default. The best effort TOS (which is where all the normal TCP traffic is) FIFO of this queuing discipline is in turn sent in a class base queuing queuing discipline or cbq. Our cbq configuration contains 2 classes; one taking all traffic by default, and getting 10% of the (with possibility to borrow more from other idle classes). 2

A second class gets the 90%. And finally a classifier assigns packets bound to tape servers to the second, high priority class. The classifier is a list of rules, which can select a single IP or a contiguous IP range. In our case, there are no significant bidirectional data streams. So TCP packets carrying the ACK bit are usually small. Those packets are added to the high priority class in order to avoid penalizing incoming data transmission. Finally ssh protocol is also given high priority. The detailed final configuration can be seen in appendix. 2.3. Test environment The production environment does not provide a stable enough workload to provide reliable results. The workload was hence simulated with one machine identical to the production disk servers (with a single 1 Gbit/s interface) and 2 client computers simulating respectively the tape and user clients, with a variable number of streams for each. The kernel used is Linux 2.6.18-274.17.1.el5, the test disk server had dual Intel Xeon E51500 @ 2.00GHz processor, with a Intel 80003ES2LAN Gigabit card (one link out of 2 was used). The three machines were dedicated to the test, and the was measured through the increase of the bytes transfer statistics in /proc/net/dev over 2 s. 3. Results Bandwidth (MB/s) 140 1 100 80 60 40 0 0 1 2 5 10 50 0 1 2 5 10 50 0 1 2 5 10 50 0 1 2 5 10 50 0 1 2 3 Number of user (top) and tape (bottom) streams Figure 1. Default distribution (stacked) User Tape During tests, various parameters and queuing disciplines were tested. Most notably, both borrowing and limitation to 90% could not be achieved with other queuing disciplines. Additionally, the usage of TCP segmentation offload (TSO) engine with the traffic shaper led to unstable, and a ratio close to the non-traffic shaped one. Turning off the TSO costs more in processing time: the kernel and CPU see packets of network MTU (standard Ethernet MTU of 1500 B in our case), and has to process and filter proportionality more packets. With the TSO on, a network capture showed packets of 64 kb. The ratio for various numbers of tape and user streams, for default set-up (where we get the expected fair share distribution), and with traffic control (TSO off) are shown in figures 1 and 2. The values are the averages over several runs of the test. The result is not exactly the expected 90% ratio: with only one tape stream, we only reach a ratio of 65%, and with at least two, the ratio becomes a stable 84%. 3

Bandwidth (MB/s) 140 1 100 80 60 40 0 0 1 2 5 10 50 0 1 2 5 10 50 0 1 2 5 10 50 0 1 2 5 10 50 0 1 2 3 Number of user (top) and tape (bottom) streams Figure 2. Traffic shaped distribution (stacked) User Tape Time per packet (μs) 100 80 60 40 0 0 100 0 300 400 500 Number of routing rules Figure 3. Time per packet against number of classification rules The number of hosts listed in the tape host list also had an impact on performance, especially with a raised packet rate in the absence of TSO. We artificially increased the number of IP prioritization rules up to 00 to measure their costs and assess the practical limits for the number of IP addresses. The average time per packet was deduced from the measurements (making the assumption that the majority of packets were a full 1500 B). This measurement is shown on figure 3. The measurements show an initial plateau where the time to process the classification rules is hidden by the time to send the packet over the wire ( 1500 B = 12 µs). We then have 1 Gbit/s a clean linear progression. The slope yields a processing time per rule of 0 ns. This gives a budget of about 60 rules, not taking into account other processing overhead for packet processing. The measurements show that the limit is in our case closer to 50 rules at most. With faster interfaces, the affordable number of rules decreases linearly, to the point of becoming impractical (about 5 only) for a 10 Gbit/s. 4. Deployment to production At CERN, we had 122 tape servers at deployment time, 11 network range based rules were enough to discriminate tape traffic. The traffic control script as show in appendix has been turned into a system service and deployed on live systems. The traffic control rules can be turned on an off at any moment without disrupting the network data flow. Long term logs of the tape activity showed that before applying the traffic shaping, many days saw a significant fraction (10 30%, sometimes more) of the data volume transferred to tape at less than 10% of the maximum speed. This fraction dropped to 0.5% on the worst days after applying the traffic shaping. No degradation of performance from the user s perspective was documented. 4

5. Conclusions During a crunch in operations, the Linux kernel s powerful network control features proved invaluable to work around system overload. Some features had to be understood in a test environment, due to imprecise documentation, but the expected result was met. This solution remains short term and only covers migrations to tape, but not recalls from tape. For the longer term, we are currently looking into additional tape staging area, allowing fully controlled scheduling, and distributed storage like a reliable array of inexpensive nodes (RAIN) like Ceph, where traffic would be evened out over many nodes, eliminating hotspots [7]. Appendix: the configuration script 1 # Turn off TCP segmentation offload : kernel sees the details 2 # of the packets routing 3 / sbin / ethtool -K eth0 tso off 4 5 # Flush the existing rules ( gives an error when there are none ) 6 tc qdisc del dev eth0 root 2> / dev / null 7 8 # Duplication of the default kernel behavior 9 tc qdisc add dev eth0 parent root handle 10: prio bands 3\ 10 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 11 12 # Creation of the class based queuing 13 tc qdisc add dev eth0 parent 10:1 handle 101: cbq 1 gbit avpkt 1500 14 tc class add dev eth0 parent 101: classid 101:10 cbq weight 90 split 101:\ 15 defmap 0 1 gbit \ 16 prio 1 rate 900 mbit maxburst minburst 10 avpkt 1500 17 tc class add dev eth0 parent 101: classid 101: cbq weight 10 split 101:\ 18 defmap ff 1 gbit \ 19 prio 1 rate 100 mbit maxburst minburst 10 avpkt 1500 21 # Prioritize ACK packets 22 tc filter add dev eth0 parent 101: protocol ip prio 10 u32 \ 23 match ip protocol 6 0 xff \ 24 match u8 0 x05 0 x0f at 0\ 25 match u16 0 x0000 0 xffc0 at 2\ 26 match u8 0 x10 0 xff at 33\ 27 flowid 101:10 28 29 # Prioritize SSH packets 30 tc filter add dev eth0 parent 101: protocol ip prio 10 u32 \ 31 match ip sport 22 0 xffff flowid 101:10 32 33 # Prioritize network ranges of tape servers 34 tc filter add dev eth0 parent 101: protocol ip prio 10 u32 match ip\ 35 dst <Network1 >/ < bits1 > flowid 101:10 36 tc filter add dev eth0 parent 101: protocol ip prio 10 u32 match ip\ 37 dst <Network2 >/ < bits2 > flowid 101:10 38 # <etc.. > Listing 1. Traffic control setting script References [1] CASTOR homepage http://cern.ch/castor [2] Kruse D F 13 The Repack Challenge Unpublished paper presented at CHEP 13, Amsterdam [3] IBM enterprise tape drives http://ibm.com/systems/storage/tape/drives/ [4] Oracle StorageTek T10000C Tape Drive http://www.oracle.com/us/products/servers-storage/storage/tapestorage/t10000c-tape-drive/overview/index.html [5] Scientific Linux CERN http://cern.ch/linux/scientific.shtml [6] Linux advanced routing and traffic control http://www.lartc.org/lartc.pdf [7] Weil S. A. 07 Ceph: Reliable, scalable, and high-performance distributed storage http://ceph.com/papers/weil-thesis.pdf 5