Prague Tier-2 storage experience

Similar documents
Performance of popular open source databases for HEP related computing problems

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

Data oriented job submission scheme for the PHENIX user analysis in CCJ

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

The INFN Tier1. 1. INFN-CNAF, Italy

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

FLASHARRAY//M Business and IT Transformation in 3U

Assessing performance in HP LeftHand SANs

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

The Oracle Database Appliance I/O and Performance Architecture

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

Grid Computing Activities at KIT

Experience with ATLAS MySQL PanDA database service

Use of containerisation as an alternative to full virtualisation in grid environments.

Vess. Architectural & Engineering Specifications For Video Surveillance. A2600 Series. Version: 1.2 Sep, 2012

Exam : Title : Storage Sales V2. Version : Demo

Storage Evaluations at BNL

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

StorMagic SvSAN: A virtual SAN made simple

RAID4S: Improving RAID Performance with Solid State Drives

HP EVA P6000 Storage performance

EMC Integrated Infrastructure for VMware. Business Continuity

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Data storage services at KEK/CRC -- status and plan

IBM Storwize V7000 Unified

Technical Brief: Specifying a PC for Mascot

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

Current Status of the Ceph Based Storage Systems at the RACF

Storage validation at GoDaddy Best practices from the world s #1 web hosting provider

Tier2 Centre in Prague

EMC Business Continuity for Microsoft Applications

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Monte Carlo Production on the Grid by the H1 Collaboration

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

HighPoint Out of Band RAID Solution. A Convergence of Networking and RAID

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Deduplication Storage System

EMC VMAX 400K SPC-2 Proven Performance. Silverton Consulting, Inc. StorInt Briefing

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Delivering unprecedented performance, efficiency and flexibility to modernize your IT

SvSAN Data Sheet - StorMagic

dcache tape pool performance Niklas Edmundsson HPC2N, Umeå University

The Microsoft Large Mailbox Vision

Deep Storage for Exponential Data. Nathan Thompson CEO, Spectra Logic

DELL EMC CX4 EXCHANGE PERFORMANCE THE ADVANTAGES OF DEPLOYING DELL/EMC CX4 STORAGE IN MICROSOFT EXCHANGE ENVIRONMENTS. Dell Inc.

Patriot Hardware and Systems Software Requirements

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

BC31: A Case Study in the Battle of Storage Management. Scott Kindred, CBCP esentio Technologies

Analysis of internal network requirements for the distributed Nordic Tier-1

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

CouchDB-based system for data management in a Grid environment Implementation and Experience

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Virtualizing a Batch. University Grid Center

File Access Optimization with the Lustre Filesystem at Florida CMS T2

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

FlashArray//m. Business and IT Transformation in 3U. Transform Your Business. All-Flash Storage for Every Workload.

Storage Resource Sharing with CASTOR.

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

Solid State Storage is Everywhere Where Does it Work Best?

Software-defined Shared Application Acceleration

Vess A2000 Series. NVR Storage Appliance. Sony RealShot Advanced VMS. Version PROMISE Technology, Inc. All Rights Reserved.

NetVault Backup Client and Server Sizing Guide 2.1

Experiences with HP SFS / Lustre in HPC Production

DELL POWERVAULT MD FAMILY MODULAR STORAGE THE DELL POWERVAULT MD STORAGE FAMILY

Vess A2000 Series. NVR Storage Appliance. SeeTec Surveillance Solution. Version PROMISE Technology, Inc. All Rights Reserved.

Scientific Cluster Deployment and Recovery Using puppet to simplify cluster management

Oracle Database 11g Direct NFS Client Oracle Open World - November 2007

Xcellis Technical Overview: A deep dive into the latest hardware designed for StorNext 5

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

vstart 50 VMware vsphere Solution Specification

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

Vendor must indicate at what level its proposed solution will meet the College s requirements as delineated in the referenced sections of the RFP:

MAHA. - Supercomputing System for Bioinformatics

Chapter 12: Mass-Storage Systems. Operating System Concepts 8 th Edition,

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Dell PowerVault MD Family. Modular storage. The Dell PowerVault MD storage family

Data services for LHC computing

IBM System Storage DS5020 Express

Data transfer over the wide area network with a large round trip time

Exchange 2010 Tested Solutions: 500 Mailboxes in a Single Site Running Hyper-V on Dell Servers

IST346. Data Storage

VMware vsphere 6.5 Boot Camp

Models PDC/O5000 9i W2K Cluster Kit B24

WHITEPAPER. Disk Configuration Tips for Ingres by Chip nickolett, Ingres Corporation

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

Performance Report: Multiprotocol Performance Test of VMware ESX 3.5 on NetApp Storage Systems

Vess A2000 Series NVR Storage Appliance

Transcription:

Journal of Physics: Conference Series Prague Tier-2 storage experience To cite this article: Ji Hork et al 2011 J. Phys.: Conf. Ser. 331 052011 View the article online for updates and enhancements. Related content - Deska: Tool for Central Administration of a Grid Site Jan Kundrát, Martina Krejová, Tomáš Hubík et al. - The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers D Bonacorsi, I Cabrillo Bartolomé, I González Caballero et al. - Xrootd Storage Element Deployment at GridKa WLCG Tier-1 Site Artem Trunov This content was downloaded from IP address 148.251.232.83 on 22/04/2018 at 16:34

Prague Tier-2 storage experience Ji Hork, Ji Chudoba, Tom Kouba E-mail: koubat@fzu.cz Institute of Physics, AS CR, Na Slovance 22, 182 21, Prague, Czech Republic Abstract. Main computing and storage facilities for LHC computing in the Czech Republic are situated in Prague Tier-2 site. We participate in grid activities since the beginning of European Data Grid. The recent years were significant in the growth of our storage capacities. In this article we present the decisions in storage solutions, comparison of configurations of the hardware and performance tuning of the software. Our storage facilities vary from enterpriselevel SAN infrastructure to in-house built cheap storage servers. We present a comparison of these solutions from the performance, configuration, management and monitoring point of view. 1. About Prague Tier-2 The Prague Tier-2 site was established in 2002 and it participated in the grid activities from the very beginning. First 34 computers were equipped with two Pentium III processors and 1GB RAM each and the first storage array contained 15 SCSI disks with total capacity of 1. In the middle of 2010 we have 336 computing nodes with 2630 CPU cores (with approx. computing power of 20500 HEPSpec units). The disk storage for grid usage is about 350 available via DPM system [1] and 70 available via NFS. 2. Overview of storage procurements Our Tier-2 center usually buys new computing equipment every year in one big tender. Over the last 4 years we have obtained various storage solutions and currently we run the following big 1 storage solutions: Year Product Infrastructure Type of disks Number and size of disks 2007 HP EVA 6000 FC Fibre 64*0.5 + 49*1 2009 Overland Ultamus 4800 FC SATA 144*2 2010 Nexsan Satabeast FC SATA 126*2 2.1. HP EVA 6000 This disk array was the first SAN solution installed in our infrastructure. The front-end machines are HP DL360s. The schema of this SAN is depicted in the figure 1. There are four main parts of this SAN: Front-end machines (ha1, ha2, ha3, sam3, storage5, goliasx98) 1 Regarding the year of the purchase Published under licence by IOP Publishing Ltd 1

Figure 1. First SAN infrastructure Fibre channel switches (fcswr071, fcswr072) Disk array controllers (two HSV210 units) Storage Management Appliance (SMA - evickat on the figure 1) The front-ends serve for the following purposes: Virtual machine images (on host machines ha1,ha2,ha3) - using 1.3 Sam cache[7] for D0 experiment (host sam3) - using 6 NFS server with user home directories and experiment software areas (storage5) - using 18 DPM disk server (goliasx98) - using 27 The rest of the raw capacity is used as the redundant space for ensuring reliability by the HP s VRAID5 technology. We are very satisfied with this storage solution. It is pretty stable and reliable, we replaced only 6 disks over 3 years and we had one power failure in one of the two FC switches (neither of these problems caused downtime of the disk array). With 113 disks this means 494940 hours MF 2 which is less than expected 1 million hours MF (estimated by the manufacturer), but it is still better value than for other (following) disk array procurements. The management interface runs on Windows server (part of the array itself, SMA-evickat in figure 1) and it is well-arranged and intuitive. 2.2. Overland Ultamus 4800 This system consists of 2 full boxes and one expander box. It was purchased in a tender in 2009 oriented on DPM storage for HEP experiments and tape storage for backup archives. After a good experience with reliability and performance of FC SAN we decided to go down this road again. This time the front-end machines are IBM x3650 servers. Two of them export the disk space for DPM and one is the front-end for the backup system. As another part of this SAN 2 Mean time between failures 2

Figure 2. Second SAN infrastructure infrastructure (depicted in figure 2) we have smaller disk array Ultamus 1200 (13*2 disks), which serves as a backup cache for the tape storage. Front-end servers for DPM system (dpmpool1, dpmpool2) - using 114 Front-end server for Legato backup system (backup) - using 6 from ultamus1200 as a prestage cache and tape robot as the archiving medium Fibre channel switches (fcswr081, fcswr082) Disk array boxes (ultamus4800down, ultamus4800up, ultamus1200) This array is not as reliable as HP EVA. We have had no data loss but there were more issues in its lifetime: first delivered box was not working there were few firmware upgrades needed (mainly for confusing false alarms) the management card looses network connection once in few months and complete reboot of the array is needed the management web interface is not comfortable (runs on port 9292, uses pop-up windows) 2.3. Nexsan Satabeast The third SAN infrastructure in our computing center is built on Nexsan hardware. We have 3 storage boxes with 42 disks each and four front-end servers. Two of these servers are IBM x3650m2 and two are HP DL360 G6. Out of these four front-ends three work as DPM disk nodes and they publish 5/6 of the entire disk space. The rest is published via one NFS front-end. Front-end servers for DPM (dpmpool4, dpmpool5, dpmpool6) - using 170 Front-end server for NFS (storage7) - using 34 Fibre channel switches (fcswl021, fcswl022) Disk array boxes (Satabeast1, Satabeast2, Satabeast3) The reliability of this storage system is satisfying. There were only two disk replaced during the ten months of the production period (this includes the first few weeks of burn-in tests). 3

Figure 3. Third SAN infrastructure 3. How we measure storage performance 3.1. The main test When deciding what storage should be bought we try to find a storage solution that gives as much disk space as possible for the least price possible but with reasonable performance. This golden rule is way too general. It is difficult to say what is the reasonable performance for our workload although we know what kind of I/O should we expect (mainly D0, Alice and Atlas jobs). It is also important to formalize the measurement properly so we can give the competing companies a clear instructions what they should measure for us. We decided to use iozone[2] tool. After several tenders we chose iozone benchmark for its stability, reliability and portability. It is also widely used tool so some rough performance results can be found for several storage systems. The evaluated disk arrays can have very different size of the available storage per front-end server, which also means different storage size available via given network connection. So our requirements are compiled from different requests but still simple enough: Read speed at least 6.4MB/s per usable (as reported by iozone) Write speed at least 4.8MB/s per usable (as reported by iozone) Sufficient network connectivity on every front-end server. Server presenting - less than 10 usable must have 2 * 1Gb 10-20 usable must have 4 * 1Gb more than 20 usable must have 10Gb NIC Health monitoring: SNMP (or Nagios sensor) Hot-plug disks Results for the storage solutions already mentioned: 4

Product Network Read performancformance Write per- Monitoring Hot-plug EVA 6000 4*2Gb + 2*1Gb not measured not measured SNMP Yes Ultamus 4800 2*10Gb 8.65MB/s per 4.48MB/s per SNMP Yes Satabeast 4*10Gb 9.42MB/s per 8.75MB/s per SNMP Yes Notes: 2Gb means two 1Gb interfaces bonded together the performance is measured on the front-end servers 4. Bunch of small inexpensive 1U storage servers When these critical requirements have been established and formalized in tender rules we realized we can try to meet such requirements with an in-house built storage server. Recently there have been good options to buy 1U chassis, simple motherboard with C7 or Atom CPU and SATA controller with enough (at least 4) SATA channels. 4.1. First try As the first attempt in 2009 we purchased two instances of: VIA Epia SN 18000EG: 3GB RAM, 1x Via C7 1800MHz (only 32bit not advisable for DPM) 4GB CF Card as the system disk 4x 1 (WD Caviar Black) in RAID5 - /data and /var no hot-plug only one 1Gb NIC The performance numbers are: 38.8MB/s per for reading and 24.5MB/s per for writing. This relative performance delivers much more data per terabyte than traditional big array with high throughput. It shows that this approach is very promising. 4.2. Second try We were satisfied with the performance of the VIA solution but we wanted the new servers to support 64bit instruction set so we can install DPM from glite 3.2 on it. So we purchased three servers with the following configuration: MSI IM-945GC, 2GB RAM, Intel Atom 230 (64bit suitable for DPM) the only MSI mini-itx with Atom and 4 SATA ports available at the time of purchase 4GB CF Card as the system disk 4x 1.5 (WD Caviar Green) in RAID5 - /data and /var 2x 1Gb NIC still no hot-plug (no suitable 1U chassis available) Two of these servers work as NFS servers for experiment data (Atlas, Auger) and one as a DPM disk node. The performance is not as good as with VIA chip-set but still better than our tender requirements: 30.0MB/s per for reading and 8.18MB/s per for writing. The latter number is not so good because the controller chip 82801GB/GR/GH gives worse performance when writing to all 4 disks. In this case, our request for 64bit platform with 4 SATA ports resulted in motherboard with poor controller so the performance went down comparing to the previous year. But it is still above our tender requirements. 5

Figure 4. Strange behavior of WD Caviar Green WD15EADS-00P8B0 4.3. Management challenges There were no problems installing and managing these nodes. We simply put them into our existing infrastructure. We installed them from PXE with minimal kickstart, we configured them via CFengine[3]. We monitor them by nagios and its plugin check linux raid.pl There were no problems with the performance of SW RAID on C7/Atom CPUs. There is one issue with new 2 disks: they sometimes claim they use 0.5kB blocks internally but in fact they use 4kB blocks so it is important to set up the software raid correctly and to align its metadata properly. To ensure this, we use the following command: mdadm --create /dev/md1 -e 1.0 --level=5 /dev/sd{a,b,c,d}1. 4.4. Problems encountered During the performance measurement and production run we encountered one big problem on the servers with the MSI chip-set. The performance was sometimes very low. We suspected bad sectors on disks or one completely bad disk. The problematic behavior can be seen in figure 4 (generated by zcat utility). It turned out that the problem is in the disks. The performance of the disk was significantly degraded (e.g. the reading performance was about 5kB/s) when more than one thread accessed the disk. This behavior was not easily predictably but it appeared more often when the I/O load was heavy. According our experience (and user discussion at [4]) the whole series is bad. 4.5. Conclusion and Future work We are well satisfied with the reliability and performance of the hardware, but there are two obstacles that prevent us in broader deployment: Missing hot-plug support Missing remote management (BMC) Recently there have been new products presented that should solve these issues. The first is a chassis with hot-plug support for SATA disks [5] and the second is a motherboard with better 6

I/O throughput and support for additional BMC card [6]. So we hope to address the mentioned issues this year and deploy this kind of storage solution more widely. Acknowledgments This work was supported by Academy of Sciences of the Czech Republic. References [1] Jean-Philippe Baud Disk Pool Manager https://svnweb.cern.ch/trac/lcgdm/wiki/dpm [2] Don Capps, William D Norcott IOzone http://www.iozone.org/ [3] Mark Burgess CFengine http://www.cfengine.org/ [4] Western Digital WD community web http://community.wdc.com/t5/desktop/ WD15EADS-00P8B0-Really-slow-Or-am-I-just-crazy/td-p/1547 [5] mini-itx Travla TE-1160 http://www.mini-itx.com/store/?c=33#p3819 [6] mini-itx ASUS Hummingbird http://www.mini-itx.com/store/?c=65#hummingbird [7] Fermi National Accelerator Laboratory The SAMGrid Project http://projects.fnal.gov/samgrid/ WhatisSAM.html 7