Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Similar documents
SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Cluster Setup and Distributed File System

The INFN Tier1. 1. INFN-CNAF, Italy

The Legnaro-Padova distributed Tier-2: challenges and results

Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

PROOF-Condor integration for ATLAS

ARC integration for CMS

The Global Grid and the Local Analysis

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

Austrian Federated WLCG Tier-2

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

UW-ATLAS Experiences with Condor

The National Analysis DESY

High Throughput WAN Data Transfer with Hadoop-based Storage

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Long Term Data Preservation for CDF at INFN-CNAF

Edinburgh (ECDF) Update

Andrea Sciabà CERN, Switzerland

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

Service Availability Monitor tests for ATLAS

NAF & NUC reports. Y. Kemp for NAF admin team H. Stadie for NUC 4 th annual Alliance Workshop Dresden,

A scalable storage element and its usage in HEP

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

High Energy Physics data analysis

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

ISTITUTO NAZIONALE DI FISICA NUCLEARE

A High Availability Solution for GRID Services

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Status of KISTI Tier2 Center for ALICE

Understanding StoRM: from introduction to internals

HammerCloud: A Stress Testing System for Distributed Analysis

The ATLAS Distributed Analysis System

File Access Optimization with the Lustre Filesystem at Florida CMS T2

The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers

CMS users data management service integration and first experiences with its NoSQL data storage

Optimization of Italian CMS Computing Centers via MIUR funded Research Projects

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Data Access and Data Management

Data Transfers Between LHC Grid Sites Dorian Kcira

Comparative evaluation of software tools accessing relational databases from a (real) grid environments

150 million sensors deliver data. 40 million times per second

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

The PanDA System in the ATLAS Experiment

Beob Kyun KIM, Christophe BONNAUD {kyun, NSDC / KISTI

and the GridKa mass storage system Jos van Wezel / GridKa

Lessons Learned in the NorduGrid Federation

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)

CMS Belgian T2. G. Bruno UCL, Louvain, Belgium on behalf of the CMS Belgian T2 community. GridKa T1/2 meeting, Karlsruhe Germany February

Tier2 Centre in Prague

LCG data management at IN2P3 CC FTS SRM dcache HPSS

The EU DataGrid Testbed

Outline. ASP 2012 Grid School

Ganga The Job Submission Tool. WeiLong Ueng

An Introduction to GPFS

Using Hadoop File System and MapReduce in a small/medium Grid site

Experience with Data-flow, DQM and Analysis of TIF Data

CHIPP Phoenix Cluster Inauguration

Overview of ATLAS PanDA Workload Management

(Tier1) A. Sidoti INFN Pisa. Outline: Tasks and Goals The analysis (physics) Resources Needed

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Benoit DELAUNAY Benoit DELAUNAY 1

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

ACCRE High Performance Compute Cluster

Clouds in High Energy Physics

Clouds at other sites T2-type computing

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

ATLAS operations in the GridKa T1/T2 Cloud

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

CMS HLT production using Grid tools

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

P.Buncic, A.Peters, P.Saiz for the ALICE Collaboration

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

AMGA metadata catalogue system

Support for multiple virtual organizations in the Romanian LCG Federation

Philippe Charpentier PH Department CERN, Geneva

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

The German National Analysis Facility What it is and how to use it efficiently

Batch system usage arm euthen F azo he Z J. B T

The ATLAS Production System

XRAY Grid TO BE OR NOT TO BE?

PoS(ACAT2010)039. First sights on a non-grid end-user analysis model on Grid Infrastructure. Roberto Santinelli. Fabrizio Furano.

Data Movement & Tiering with DMF 7

ANSE: Advanced Network Services for [LHC] Experiments

Challenges of the LHC Computing Grid by the CMS experiment

Grid-related related Activities around KISTI Tier2 Center for ALICE

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

Grid Computing Activities at KIT

IEPSAS-Kosice: experiences in running LCG site

Using a dynamic data federation for running Belle-II simulation applications in a distributed cloud environment

INDIGO-DataCloud Architectural Overview

AliEn Resource Brokers

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase. Chen Zhang Hans De Sterck University of Waterloo

Transcription:

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

outlook Alice Examples Atlas Examples CMS Examples

Alice Examples ALICE Tier-2s at the moment do not support interactive analysis not foreseen by ALICE computing model for Tier-2s No purely local (non-grid) access to Tier-2 batch queues (CEs) and SEs The idle-time of the INFN-Bari resources can be used for non-grid jobs submitted by local community for testing and debugging the code developing Interactive analysis is done with PROOF at Analysis Facilities (--> D.Berzano) CAF (mainly), SKAF, LAF, GSIAF Very efficient for small datasets (< 1 M events) ALICE INFN Tier-2 federation is considering the possibility to provide AF services Prototype of virtual AF at Torino Tier-2

Alice Examples ALICE grid/analysis solutions offer good degree of interactive functionality: Access to data via xrootd Allows to read data on any SE from ROOT prompt TGrid::Connect( alien:// ); TFile::Open( alien:///alice/cern.ch/. ); Very useful to test new code Automatization of grid job submission and output retrieval Alien plugin (--> M.Masera) Do everything from ROOT prompt: data selection, software version, job submission and splitting (data/se driven), output merging and local copy

Atlas Requirements Grid access (Ganga, pathena) Interactive facilities (coding and debugging) Local (fast) job submission Efficient I/O from analysis jobs This often means a better CPU efficiency resulting in a better usage of the available resources Fast and reliable SRM interface for WAN transfers DDM Tools (DQ2, efficient FTS channel)

Overview of Milan farm Local and grid facilities separated with different batch servers 2 CEs and ~450 cores accessible via grid Balanced pool of 3 UIs for interactive use and grid job submission One dedicated UI for submission to a local PBS cluster with three dedicated WN A pool of three nodes running proof in clustered mode ( xproof ) A GPFS common file system serving home, local and grid (with StoRM interface) storage areas Software area served by two NFS server to all the local and grid nodes

Atlas Example Test on PROOF performance poi copiate nello Storage Element locale usando i vari tools di GRID!! Test sono stati fatti a Milano con contributi importanti da UD/TS e PI, utilizzando 4 tipi diversi di codice: Dario Barberis: Tier-3 Task Force 6

StoRM in Milan Two frontends balanced via DNS Four gridftp interfaces, balanced via DNS Used mainly by FTS channels Single backend

GPFS file system Two master clusters of NSD servers 20 machines serving all the available storage, connected via fibre channel Different file systems for grid areas, local storage and users home directories Client machines organized in several slave clusters according to their function (UIs, grid worker nodes, local worker nodes, proof nodes...) More stable Easier to configure Common parameters for all the machines in the cluster Not all the clients need to see all the exported filesystems The tests on both single node and clustered proof show that, with real analysis code, GPFS is not a bottleneck N.B. s/w area exported via NFS, not GPFS

CMS Example Motivations Provide an environment supporting: Local use of CMSSW (main releases + on demand) local running (small tasks) code development Submission of jobs to the distributed infrastructure Storage of limited amount of data for code validation for specific studies (e.g. detector studies) Storage of personal data needed for interactive analysis Interactive activities analysis with root work environment (mail, browser, document editing,...) Backup of critical files Hardware failure or accidental deletion P Claudio Grandi INFN Bologna CMS-Italia (Pisa) 29 January 2010 2

Overview of INFN-BARI farm ~10 different VO supported >1064 core ~140 WN ~500TB of Lustre storage SRM/gridftp supported Xrootd supported 2 multi VO frontend machine + 1 CMS dedicated frontend phedex node 1 LCG-CE + 1 CREAM-CE P

General overview of the existing farm configuration at the INFN-BARI P Logical View

Network infrastructure One of the most important pieces of the infrastructure is the network: P We tried to find a solution that maximize the bandwidth keeping the cost as low as possible => the goal is to provide ~1Gbit/s per each port without using a completely flat topology We took into account the typical behaviour that each node has into the farm: wn usually read from storage servers while storage server provide data on the network so we can put WN and storage server on the same edge switch in order to better use of channel among the core switch and the edges ones

General overview of the existing farm configuration at the INFN-BARI P The network Monitoring network Border router Department network

Storage overview of the INFNBARI computing centre P The storage area is unified in order to reduce the manpower requested to keep the farm running: Single infrastructure for experiment data and home The storage servers could be different The storage area is unified in order to reduce the manpower requested to keep the farm running: Single infrastructure for experiment data and home The storage servers could be different The Lustre cluster

Feature of this infrastructure: Interactive Jobs P The classic interactive cluster requires: Ad-hoc configuration to find the right size of the dedicated resource as the number of users increments It could happens that the cluster got overloaded when a lot of users are working on it We tried another solution: The user can submit local batch jobs But it could also use available WNs as interactive resource The batch manager choose the right CPU to execute job This guarantee to the user to have a dedicated CPU during all his work There is only one cluster to be configured and maintained This reduced the requested manpower The cluster could increase in size, dinamically, depending on the user requests

P Feature of this infrastructure: Interactive Jobs This feature is obtained by means of Interactive jobs on Torque LFS has similar functionality too The maui configuration is tuned a bit in order to guarantee the high priority on those jobs An home made daemon is used to be sure that the queue time of an interactive job is always less than 1min Interactive jobs can be suspended and recovered afterwards Using screen No hard limit on number of concurrent interactive sessions You can run also multicpu interactive jobs

P

Feature of this infrastructure The /store/user/crab_name data are available as a local file-system with posix access User can access/write/modify those data Or publish locally produced data on DBS/phedex under the standard paths /store/ Use of default ACLs on storage areas: setfacl -d -m g:cms:rw /lustre/cms/store setfacl -d -m u:defilippis:rw /lustre/cms/store/user/ndefilip/ It is possible to use CRAB local PBS submission for fast high-priority tasks using the tier2 CPUs Lustre caches accessed files on the client memory: running many times on the same files do not requires reading data from the disk servers each time It is possible to mount (i.e. via sshfs) the /lustre storage on its own desktop We do daily backup (in a separated disk server) of a small critical area (~50GB for each user) P

88 utenti locali registrati > 4500 job locali eseguiti al giorno > 1500 per CMS 2000 login per mese Statistics of the infrastructure P 5 disk servers 4x1Gbit/s each 190 TB on-line 600 Concurrent Job

88 utenti locali registrati > 4500 job locali eseguiti al giorno > 1500 per CMS 2000 login per mese Statistics of the infrastructure P 5 disk servers 4x1Gbit/s each 190 TB on-line 600 Concurrent Job

Future works Using the same infrastructure we will try to exploit PROOF [This will be of interest for both Alice and CMS local users] Jobs will be submitted to torque batch system and act as pilot jobs: they will connect to PROOF master and execute the task The files will be accessed with the standard access patterns Lustre for CMS and Xrood for Alice Working on reliability and fault tolerance for some highly critical storage area P

Acknowledgement Alice: Andrea Dainese, Antonio Franco, Nico Di Bari Atlas: David Rebatto, Roberto Agostino Vitillo CMS: Claudio Grandi, Giacinto Donvito, Vincenzo Spinoso