Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Similar documents
CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

The CMS Computing Model

High-Energy Physics Data-Storage Challenges

CERN s Business Computing

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Travelling securely on the Grid to the origin of the Universe

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

IEPSAS-Kosice: experiences in running LCG site

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Grid Computing a new tool for science

Storage and I/O requirements of the LHC experiments

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

The LHC Computing Grid

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Reprocessing DØ data with SAMGrid

CERN and Scientific Computing

Batch Services at CERN: Status and Future Evolution

Overview. About CERN 2 / 11

Volunteer Computing at CERN

Improving Network Infrastructure to Enable Large Scale Scientific Data Flows and Collaboration (Award # ) Klara Jelinkova Joseph Ghobrial

Experience of the WLCG data management system from the first two years of the LHC data taking

Virtualizing a Batch. University Grid Center

Grid Computing at the IIHE

Opportunities A Realistic Study of Costs Associated

150 million sensors deliver data. 40 million times per second

A scalable storage element and its usage in HEP

International Cooperation in High Energy Physics. Barry Barish Caltech 30-Oct-06

UW-ATLAS Experiences with Condor

Distributed e-infrastructures for data intensive science

Storage Resource Sharing with CASTOR.

CSCS CERN videoconference CFD applications

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

First Experience with LCG. Board of Sponsors 3 rd April 2009

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

LHCb Computing Resource usage in 2017

High Performance Computing Course Notes Grid Computing I

Bill Boroski LQCD-ext II Contractor Project Manager

CC-IN2P3: A High Performance Data Center for Research

Andrea Sciabà CERN, Switzerland

Grid Computing: dealing with GB/s dataflows

DESY at the LHC. Klaus Mőnig. On behalf of the ATLAS, CMS and the Grid/Tier2 communities

CouchDB-based system for data management in a Grid environment Implementation and Experience

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Towards Network Awareness in LHC Computing

CERN openlab Communications

High Performance Computing on MapReduce Programming Framework

Computing at the Large Hadron Collider. Frank Würthwein. Professor of Physics University of California San Diego November 15th, 2013

Grid Computing Activities at KIT

LHCb Computing Strategy

The CMS data quality monitoring software: experience and future prospects

Spark and HPC for High Energy Physics Data Analyses

The Grid: Processing the Data from the World s Largest Scientific Machine

The LHC computing model and its evolution. Dr Bob Jones CERN

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

Distributed Monte Carlo Production for

Data Analysis in Experimental Particle Physics

Overview of ATLAS PanDA Workload Management

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

CMS Computing Model with Focus on German Tier1 Activities

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

arxiv: v1 [physics.ins-det] 1 Oct 2009

Big Data Analytics and the LHC

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

Status of KISTI Tier2 Center for ALICE

ATLAS Experiment and GCE

High Throughput WAN Data Transfer with Hadoop-based Storage

The grid for LHC Data Analysis

e-science for High-Energy Physics in Korea

Clustering and Reclustering HEP Data in Object Databases

How to Cloud for Earth Scientists: An Introduction

The ATLAS Distributed Analysis System

Muon Collider Input ideas and proposals for the European Strategy Update

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Visita delegazione ditte italiane

Existing Tools in HEP and Particle Astrophysics

Continuation Report/Proposal: Vanderbilt CMS Tier 2 Computing Project Reporting Period: November 1, 2010 to March 31, 2012

Data Processing on Large Clusters. By: Stephen Cardina

Key metrics for effective storage performance and capacity reporting

Introduction to Distributed HTC and overlay systems

The Fermilab HEPCloud Facility: Adding 60,000 Cores for Science! Burt Holzman, for the Fermilab HEPCloud Team HTCondor Week 2016 May 19, 2016

LHC and LSST Use Cases

TCP Performance Analysis Based on Packet Capture

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

To Compete You Must Compute

Data Transfers Between LHC Grid Sites Dorian Kcira

Fault Detection using Advanced Analytics at CERN's Large Hadron Collider

Computing / The DESY Grid Center

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Deep Learning Photon Identification in a SuperGranular Calorimeter

a new Remote Operations Center at Fermilab

Welcome to the XSEDE Big Data Workshop

Gigabyte Bandwidth Enables Global Co-Laboratories

Transcription:

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy Texas A&M Big Data Workshop October 2011 January 2015, Texas A&M University Research Topics Seminar 1

Outline Overview of Particle Physics and Astronomy and the Mitchell Institute Collider Physics Computing Needs Dark Matter Computing Need Phenomenology Requirements and Lessons Learned Status and Future Big Computing and the Mitchell Institute for 2

Overview of Particle Physics and Astronomy and the Mitchell Institute Mitchell Institute covers both Particle Physics and Astronomy Both theory and experiment in both, includes String Theory Many of these groups need significant computing, although in many different ways Our primary work khorse has been the Brazos Cluster (brazos.tamu.edu) Big usage from 4 user/groups CMS Experiment at the Large Hadron Collider (LHC) Dark Matter Search using the CDMS Experiment High Energy Phenomenology Other (mostly CDF experiment at Fermilab, and Astronomy group) Big Computing and the Mitchell Institute for 3

CMS Experiment at the LHC Collider Physics at CERN/Fermilab have often been the big computing drivers in the world (brought us the WWW) LHC was a driving force in creating Grid computing LHC experiments have a 3-tiered, distributed computing model which requires 10 s of petabytes and millions of CPU hours (T1, T2 and T3 for CMS of which A&M is a member institution) Not well suited to Supercomputing at A&M because of interesting Requirements Ability to run on the GRID as an international collaboration Jobs and data from around the world Firewall issues for external users Automated data distribution and local jobs regularly accessing remote databases The computing needs here are high THROUGHPUT, not high PERFORMANCE Just run LOTS of independent jobs on multi-tb datasets Big Computing and the Mitchell Institute for 4

Dark Matter Searches with the CDMS Experiment Much smaller experiment (~100 scientists) Smaller scale computing, but many of the same issues as we interface with other national labs Stanford Linear Accelerator Center (SLAC) Fermi National Accelerator Laboratory (FNAL or Fermilab) Sudbury Neutrino Observatory (SNOLab) Big Computing and the Mitchell Institute for 5

Particle Theory/ Phenomenology Particle Phenomenologist's at MIST do calculations, but also do VERY large simulations of collider data to see what can be discovered at LHC Again, just need high throughput and lots of memory Jobs don t need to talk to each other Big Computing and the Mitchell Institute for 6

Overview of Brazos and Why we use it (and not somewhere else) We bought into the Stakeholder model on Brazos: 309 Compute nodes/2096 cores for the cluster, Institute owns 700 cores, can run on other cores opportunistically(!) ~200 Tb of disk, Institute owns about 100Tb, can use extra space if available Can get ~1Tb/hour from Fermilab, 0.75Tb/hr from SLAC Links to places around the world (CERN-Switzerland, DESY-Germany, CNAF-Spain, UK, FNAL-US, Pisa, CalTech, Korea, France Etc.) Big wins for us Well suited for High Throughput (lots of parallel jobs looking at millions of separate collisions to see if they look like a new particle was produced) Accepts jobs from Open Science Grid (OSG) Excellent support from Admins Almes (PI), Dockendorf and Johnson Fun use case: Typically bring big data copies from CERN, run on it to make local processed data, then delete the local copy of the raw data More detail on how we run at: http://collider.physics.tamu.edu/mitchcomp Big Computing and the Mitchell Institute for 7

Online Monitoring Constantly interrogate the system Disks up? Jobs running? Small data transfers working? Run short dummy-jobs for various test cases Both run local jobs as well as accept automated jobs from outside Automated alarms for the first line of defense team (not Admins), as well as the Admin team Send email as well as make the monitoring page Red More detail about our monitoring at http://hepx.brazos.tamu.edu/all.html Big Computing and the Mitchell Institute for 8

Fun Plots about how well we ve done Cumulative CPU Hours CPU-hrs per month Picked up speed with new operating system and sharing rules Big Computing and the Mitchell Institute for 9

Top 5 Months More fun numbers 1. 1,750,719 719 core hours - January 2015 2. 1,702,351 core hours - December 2014 3. 670,049 049 core hours - November 2014 4. 584,213 core hours - April 2013 5. 503,757 core hours - May 2013 Top 5 Users of the Month 1. 1,487,169 169 core hours K. Colletti (CDMS)- 2. 1,234,387 core hours K. Colletti (CDMS)- Dec 2014 3. 476,798 core hours K. Colletti (CDMS)- Nov 2014 4. 439,777 core hours S. Wu (Pheno)- Dec 2014 5. 382,759 core hours A. Perloff (CMS)- Apr 2013 Big Computing and the Mitchell Institute for 10

Some Lessons Learned Monitoring how quickly data gets transferred can tell you if there are bad spots in the network locally as well as around the world Found multiple bad/flaky boxes in Dallas using PerfSonar Monitoring how many jobs each user has running tells you how well the batch system is doing fair-share and load balancing Much harder than it looks, especially since our users are very bursty : They don t know exactly when they need to run, and when they need to run they have big needs NOW (telling them to plan doesn t help) Experts that know both the software and the Admin is a huge win Useful to have users interface with local software experts (my students) as the first line of defense before bugging g Admins National labs are much better set up for collaborative work since they trust collaborations Upside to working at the lab: Much more collective disk and CPU, important data stored locally Downside: No one gets much of the disk or CPU (most of our users could use both, but choose to work locally if they can) Different balancing security with ability to get work done is difficult Big Computing and the Mitchell Institute for 11

Conclusions Mitchell Institute physicists effectively use the Brazos cluster for our High Throughput needs More disk would be good, but the ability to get data here quickly has ameliorated that (and been the envy of our colleagues) More CPU/Supercomputing would be good, but the ability to accept jobs from the GRID is more important; running opportunistically has worked fine The amount of red tape to get jobs in, and allow our non-a&m colleagues to run, has been significant (but not insurmountable) Bottom line: Been happy with the Brazos cluster (thanks Admins!) as they helped us discover the Higgs Boson Looking forward to when the LHC turns on in March! Big Computing and the Mitchell Institute for 12