CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Similar documents
Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

UW-ATLAS Experiences with Condor

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

First Experience with LCG. Board of Sponsors 3 rd April 2009

Introduction to Distributed HTC and overlay systems

Data Transfers Between LHC Grid Sites Dorian Kcira

The CMS Computing Model

Towards Network Awareness in LHC Computing

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

ATLAS Tier-3 UniGe

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

CMS Belgian T2. G. Bruno UCL, Louvain, Belgium on behalf of the CMS Belgian T2 community. GridKa T1/2 meeting, Karlsruhe Germany February

arxiv: v1 [physics.ins-det] 1 Oct 2009

The LHC Computing Grid

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

Experience with Data-flow, DQM and Analysis of TIF Data

Andrea Sciabà CERN, Switzerland

Intro to Computers in Arabic!!!!! MAI ELSHEHALY

HSC Data Processing Environment ~Upgrading plan of open-use computing facility for HSC data

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

(Tier1) A. Sidoti INFN Pisa. Outline: Tasks and Goals The analysis (physics) Resources Needed

Department Administration

OBTAINING AN ACCOUNT:

The ATLAS Distributed Analysis System

The INFN Tier1. 1. INFN-CNAF, Italy

Configuring Servers and Services in Helm A guide to configuring Helm to effectively use your multi-server environment.

Comparing File (NAS) and Block (SAN) Storage

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Distributed Monte Carlo Production for

Virtualizing a Batch. University Grid Center

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

High-Energy Physics Data-Storage Challenges

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

DESY at the LHC. Klaus Mőnig. On behalf of the ATLAS, CMS and the Grid/Tier2 communities

150 million sensors deliver data. 40 million times per second

Triton file systems - an introduction. slide 1 of 28

Take Back Lost Revenue by Activating Virtuozzo Storage Today

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

Surveillance Dell EMC Storage with Verint Nextiva

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Surveillance Dell EMC Storage with Aimetis Symphony

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

The MapReduce Abstraction

The Future of Virtualization: The Virtualization of the Future

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

CS 3733 Operating Systems:

The Global Grid and the Local Analysis

Storage Resource Sharing with CASTOR.

CSCS CERN videoconference CFD applications

Data Movement & Storage Using the Data Capacitor Filesystem

Maintenance Plan MAINTENANCE PLAN JOLA USA. 68 Jay Street Brooklyn, New York JolaUSA.com.

How to Cloud for Earth Scientists: An Introduction

Huawei FusionCloud Desktop Solution 5.1 Resource Reuse Technical White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01.

CMS Computing Model with Focus on German Tier1 Activities

High Performance Computing Cluster Advanced course

IBM Virtualization Engine TS7700 Series Best Practices. TS7700 Hybrid Grid Usage V1.1

Challenges of the LHC Computing Grid by the CMS experiment

Use to exploit extra CPU from busy Tier2 site

HPC at UZH: status and plans

ANSE: Advanced Network Services for [LHC] Experiments

Introduction to MapReduce

Introduction to BioHPC

Austrian Federated WLCG Tier-2

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

RUSSIAN DATA INTENSIVE GRID (RDIG): CURRENT STATUS AND PERSPECTIVES TOWARD NATIONAL GRID INITIATIVE

INTRODUCTION TO THE CLUSTER

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

Introduction. Kevin Miles. Paul Henderson. Rick Stillings. Essex Scales. Director of Research Support. Systems Engineer.

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge

Outline. Course Administration /6.338/SMA5505. Parallel Machines in Applications Special Approaches Our Class Computer.

Understanding the T2 traffic in CMS during Run-1

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Security Systems. Security Technologies Database RUTGERS PUBLIC SAFETY. November 16, 2007 Volume 1, Issue 1. Procedures for Database Utilization

Chapter 8. Operating System Support. Yonsei University

CouchDB-based system for data management in a Grid environment Implementation and Experience

Batch Services at CERN: Status and Future Evolution

Surveillance Dell EMC Storage with IndigoVision Control Center

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

LHC and LSST Use Cases

Raspberry Pi Cloud. Pete Stevens Mythic Beasts Ltd.

Technical Brief SUPPORTPOINT TECHNICAL BRIEF MARCH

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

XRAY Grid TO BE OR NOT TO BE?

Surveillance Dell EMC Isilon Storage with Video Management Systems

Grid Scheduling Architectures with Globus

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Alteryx Technical Overview

vstart 50 VMware vsphere Solution Specification

Summary of the LHC Computing Review

Transcription:

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster Vaikunth Thukral Department of Physics and Astronomy Texas A&M University 1

Outline Grid Computing with CMS: PhEDEx and CRAB Our Local Computing center: Brazos/T3_US_TAMU Performance and Monitoring Data Transfers Data Storage Jobs Summary 2

Introduction to Grid Computing Cluster Multiple computers in a Local Network The Grid Many clusters connected by a Wide Area Network Resources expanded for thousands of users as they have more access to distributed computing and disk CMS Grid: Tiered Structure (Mostly about size & location) Tier 0: CERN Tier 1: A few National Labs Tier 2: Bigger University Installations for national use Tier 3: For local use (Our type of center) 3

Next: Define PhEDEx and CRAB which are CMS ways of managing data and running jobs on the grid Jobs - Breaking up the data analysis into lots of parallel pieces 4

PhEDEx Physics Experiment Data Export Data is spread around the world Transport tens of Terabytes of data to A&M per month 5

CRAB CMS Remote Analysis Builder Jobs are submitted to the grid using CRAB CRAB decides how and where these tasks will run Same tasks can run anywhere the data is located Output can be sent anywhere you have permissions 6

Advantages of Having a CMS Tier 3 Computing Center at TAMU Don t have to compete for resources CPU priority - Even though we only bought a small amount of CPUs, can periodically run on many more CPUs at the cluster at once Disk space - Can control what data is here With a standardized Tier 3 on a cluster, can run same here as everywhere else Physicists don t do System Administration 7

T3_US_TAMU as part of Brazos Brazos cluster already established at Texas A&M Added our own CMS Grid Computing Center within the cluster Named T3_US_TAMU as per CMS conventions 8

T3_US_TAMU added CPU and Disk to Brazos as our way of joining Disk Brazos has a total of ~150TB of storage space ~30 TB is assigned to our group Space is shared amongst group members N.B. Another 20TB in the works CPU Brazos has a total of 307 compute nodes/2656 cores 32 nodes/256 cores added by T3_US_TAMU Since we can run 1 job on each core 256 jobs at any one time, more when cluster is underutilized, or by prior agreement 184,320 (256 x 24 x 30) dedicated CPU hours/month 9

Grid Computing at Brazos Summary Tier 3 is fully functional on the cluster Instructions on how to use it can be found at: http://collider.physics.tamu.edu/tier3/ Updating our "Best Practices" on how to bring over data and run Jobs : http://collider.physics.tamu.edu/tier3/best_practice 10

Grid Computing at Brazos Summary Next: Move to describe how well the system is working by showing results from our Online Monitoring Pages(*) (*) Thanks to Dr. Joel Walker for leading this effort 11

Three Main Topics Tier 3 Functionality Data Transfers (PhEDEx) Data Storage: PhEDEx Dataset + Local User Storage Running Jobs (CRAB) Need to test and monitor all of these CMS provides some monitoring tools We have designed additional Brazos-specific/ custom monitoring tools 12

PhEDEx at Brazos PhEDEx performance is continually tested in different ways: LoadTests Transfer Quality Transfer Rate 13

PhEDEx at Brazos (cont.) LoadTest Acts as a test of the handshake between TAMU and linked sites in Taiwan, Europe, US etc. 14

PhEDEx at Brazos (cont.) Transfer Quality Monitors whether the transfers we have requested are actually coming across successfully Transfers from Italy, Taiwan, UK etc. 15

PhEDEx Transfers PhEDEx Data Transfer Performance (Peak at 320 MB/s) Network and client settings optimized 20-fold increase in average transfer speeds from January to June ~10 MB/s ~ 200 MB/s Other T3 sites average between 50-100 MB/s 16

PhEDEx Transfers (cont.) Can download from multiple locations at once and for extended periods of time 17 Capable of transferring large volumes consecutively In principle, could download up to 25 TB in one day! Have done 10TB just yesterday Last month we brought over ~45TB

Data Storage and Monitoring Monitor PhEDEx and User files HEPX User Output Files PhEDEx Dataset Usage 18 Note that this is important for selfimposed quotas. Need to know if we are keeping below our 30TB allocation. Will expand to 50TB soon. Will eventually be sending email if we get near our limit.

Running CRAB jobs on BRAZOS Have set up two fully functional ways for the convenience of users condor_g run jobs locally glite - can submit from anywhere in the world! More as needed PBS - in the process of making this work Have created standard test jobs CRAB Admin Test Suite (CATS?) - These test both condor_g and glite, output to FNAL and Brazos, big and small outputs, as well as large numbers of jobs. 19

Current Status of CATS Validation test jobs (CATS) All work Working on automating these to run periodically Then add to monitoring site 20

Already running LOTS of CRAB Jobs CRAB Usage by different groups The A&M group has used a large amount of CPU Fraction Hours HEPX group usage in June Fraction Hours Not all CPU available to use has been used by members of the Aggie Family We have provided a lot of CPU to outside users on CMS 21

More on CPU used and Jobs Run 6000 5000 4000 CRAB Performance Job success rates have improved Larger number of use-cases accommodates more jobs Current trend indicates exponential usage, and we can still use MANY more CPU hours CPU Hours 30000 25000 20000 Jobs 3000 2000 1000 15000 CPU Hours 10000 5000 Jobs 0 Jan Feb March April May June 0 Jan Feb March April May June 22

Future Plans and Upgrades for Brazos/ Monitoring Add ability to run jobs via PBS Add more monitoring of jobs and CPU Add more disk Automate the running of CATS regularly, report results on Monitoring page Automate the checking of the monitoring to send mail on a failure or critical condition (disk space nearly fully, jobs failing, PhEDEx transfers failing, etc.) 23

Summary Grid Computing is a central part of CMS Analysis around the world Our own Grid Computing Center gives us high priority access to disk space and CPU which is an important competitive advantage in the search for Supersymmetry and the Higgs T3_US_TAMU at Brazos is fully functional and has already provided useful resources to the group We are constantly working to improve the monitoring of the system More resources will be added as we max out current ones 24

BACKUP SLIDES 25

Resources Future expansion Adding 20 TB more to current disk space Can continue to get more as needs increase Possibly upgrade to a Tier 2 site The Brazos Team 26

Operation Procedures Submitting jobs at T3_US_TAMU Many use cases tested extensively Some cases work better than others Best way to run tasks on Brazos: http://collider.physics.tamu.edu/tier3/best_practice An Example 27

Operation Procedures Important Configuration Parameters Schedulers Datasets White Lists Scheduler Options at T3_US_TAMU Condor_g Quick, suited for local submissions glite Slower, suited for grid submissions PBS Currently being tested 28

Optimization Testing CRAB Construct test jobs These jobs test every aspect of the process Currently run 8 test jobs Particular cases tested: o Scheduler type o Output file size (Small/Large) o Local output destination o Remote output destination o Number of jobs (Small/Large) 29

Monitoring Tools Provided by the Grid SAM tests PhEDEx webpage 30