Grid Computing: dealing with GB/s dataflows

Similar documents
Grid Computing: dealing with GB/s dataflows

De BiG Grid e-infrastructuur digitaal onderzoek verbonden

Grid: data delen op wereldschaal

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

IEPSAS-Kosice: experiences in running LCG site

Storage and I/O requirements of the LHC experiments

Accelerating Throughput from the LHC to the World

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

The CMS Computing Model

Netherlands Institute for Radio Astronomy. May 18th, 2009 Hanno Holties

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

First Experience with LCG. Board of Sponsors 3 rd April 2009

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Overview. About CERN 2 / 11

e-research Infrastructures for e-science Axel Berg SARA national HPC & e-science support center RAMIRI, June 15, 2011

The LHC Computing Grid

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Grid Computing a new tool for science

A short introduction to the Worldwide LHC Computing Grid. Maarten Litmaath (CERN)

High Performance Computing on MapReduce Programming Framework

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Travelling securely on the Grid to the origin of the Universe

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

Visita delegazione ditte italiane

CC-IN2P3: A High Performance Data Center for Research

Virtualizing a Batch. University Grid Center

Grid Computing at the IIHE

High Performance Computing Course Notes Grid Computing I

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

The Grid: Processing the Data from the World s Largest Scientific Machine

Opportunities A Realistic Study of Costs Associated

AGIS: The ATLAS Grid Information System

Summary of the LHC Computing Review

Computing at the Large Hadron Collider. Frank Würthwein. Professor of Physics University of California San Diego November 15th, 2013

Data Transfers Between LHC Grid Sites Dorian Kcira

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Figure 1: cstcdie Grid Site architecture

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

N. Marusov, I. Semenov

Andrea Sciabà CERN, Switzerland

EGI-InSPIRE. Security Drill Group: Security Service Challenges. Oscar Koeroo. Together with: 09/23/11 1 EGI-InSPIRE RI

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

CSCS CERN videoconference CFD applications

CERN s Business Computing

Deferred High Level Trigger in LHCb: A Boost to CPU Resource Utilization

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

150 million sensors deliver data. 40 million times per second

High Throughput WAN Data Transfer with Hadoop-based Storage

The ATLAS PanDA Pilot in Operation

Experience of the WLCG data management system from the first two years of the LHC data taking

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

The National Analysis DESY

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 21 July 2011

PoS(EGICF12-EMITC2)106

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

Data Intensive Science Impact on Networks

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

High-Energy Physics Data-Storage Challenges

A Simulation Model for Large Scale Distributed Systems

Distributed e-infrastructures for data intensive science

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

National R&E Networks: Engines for innovation in research

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 27 July 2010

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

The LHC computing model and its evolution. Dr Bob Jones CERN

Data Issues for next generation HPC

Exploring cloud storage for scien3fic research

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Pan-European Grid einfrastructure for LHC Experiments at CERN - SCL's Activities in EGEE

GÉANT Mission and Services

Big Data Analytics and the LHC

Programmable Information Highway (with no Traffic Jams)

Tackling tomorrow s computing challenges today at CERN. Maria Girone CERN openlab CTO

SARA Overview. Walter Lioen Group Leader Supercomputing & Senior Consultant

T0-T1-T2 networking. Vancouver, 31 August 2009 LHCOPN T0-T1-T2 Working Group

Philippe Laurens, Michigan State University, for USATLAS. Atlas Great Lakes Tier 2 collocated at MSU and the University of Michigan

The LHC Computing Grid

Optical Networking Activities in NetherLight

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

The ATLAS EventIndex: Full chain deployment and first operation

CHIPP Phoenix Cluster Inauguration

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

Support for multiple virtual organizations in the Romanian LCG Federation

CouchDB-based system for data management in a Grid environment Implementation and Experience

HEP replica management

Introduction to Grid Infrastructures

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Federated data storage system prototype for LHC experiments and data intensive science

WORK PROJECT REPORT: TAPE STORAGE AND CRC PROTECTION

ICN for Cloud Networking. Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory

Connectivity Services, Autobahn and New Services

Batch Services at CERN: Status and Future Evolution

Visualization and clusters: collaboration and integration issues. Philip NERI Integrated Solutions Director

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Prompt data reconstruction at the ATLAS experiment

High Performance Computing from an EU perspective

Transcription:

Grid Computing: dealing with GB/s dataflows Jan Just Keijser, Nikhef janjust@nikhef.nl David Groep, NIKHEF 21 March 2011 Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/

LHC Computing Large Hadron Collider the worlds largest microscope quarks 'looking at the fundamental forces of nature 27 km circumference CERN, Genève 10-15 m atom nucleus ~ 20 PByte of data per year, ~ 60 000 modern PC style computers

Atlas Trigger Design Level 1 Hardware based, online Accepts 75 KHz, latency 2.5 ms 160 GB/s Level 2 500 Processor farm Accepts 2 KHz, latency 10 ms 5 GB/s Event Filter 1600 processor farm Accepts 200 Hz, ~1 s per event Incorporates alignment, calibration 300 MB/s From: The ATLAS trigger system, Srivas Prasad

Balloon (30 Km) Signal/Background 10-9 Stack of CDs w/ 1 year LHC data! (~ 20 Km) Data volume (high rate) X (large number of channels) X (4 experiments) 20 PetaBytes new data per year Concorde (15 Km) Compute power (event complexity) X (number of events) X (thousands of users) 60.000 processors Mt. Blanc (4.8 Km)

Scientific Compute e-infrastructure Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. Data parallelism (also known as loop-level parallelism) is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes. From: Key characteristics of SARA and BiG Grid Compute services

What is BiG Grid? Collaborative effort of the NBIC, NCF and Nikhef. Aims to set up a grid infrastructure for scientific research. This research infrastructure contains compute clusters, data storage, combined with specific middleware and software to enable research which needs more than just raw computing power or data storage. We aim to assist scientists from all backgrounds in exploring and using the opportunities offered by the Dutch e-science grid. http://www.biggrid.nl

Nikhef (NDPF) 2500 2000 160 processor cores TByte disk Gbps network SARA (GINA+LISA) 4800 1800 2000 160 processor cores TByte disk TByte tape Gbps network RUG-CIT (Grid) 120 8 800 processor cores 10 TByte disk Gbps network Philips Research Ehv 1600 100 1 processor cores GByte disk Gbps network

Grid organisation National Grid Initiatives & European Grid Initiative At the national level a grid infrastructure is offered to national and international users by the NGIs. BiG Grid is (de facto) the Dutch NGI. The 'European Grid Initiative' coordinates the efforts of the different NGIs and ensures interoperability Circa 40 European NGIs, with links to South America and Taiwan Headquarter of EGI is at the Science Park in Amsterdam

Cross-domain and global e-science grids The communities that make up the grid: not under single hierarchical control, temporarily joining forces to solve a particular problem at hand, bringing to the collaboration a subset of their resources, sharing those at their discretion and each under their own conditions.

Challenges: scaling up Grid especially means scaling up: Distributed computing on many, different computers, Distributed storage of data, Large amounts of data (Giga-, Tera-, Petabytes), Large number of files (millions). This gives rise to interesting problems: Remote logins are not always possible on the grid, Debugging a program is a challenge, Regular filesystems tend to choke on millions of files, Storing data is one thing, searching and retrieving turn out to be even bigger challenges.

Challenges: security Why is security so important for an e-science Infrastructure? e-science communities are not under a single hierarchical control; As grid site administrator you are allowing relatively unknown persons to run programs on your computers; All of these computers are connected to the internet using an incredibly fast network: This makes the grid a potentially very dangerous service on the internet

Lessons Learned: Data Management Storaging Petabytes of data is possible, but... Retrieving data is harder than you would expect; Organising such amounts of data is non-trivial; Applications are much smaller than the data they need to process always bring your application to the data, if possible; The data about the data (metadata) becomes crucial: location, experimental conditions, date and time Storing the metadata in a database can be a life-saver.

Lessons Learned: Job efficiency A recurring complaint heard about grid computing is low job efficiency (~94%). It is important to know that: Failed jobs almost always did so due to data access issues; If you remove data access issues, job efficiency jumps to ~99%, which is on par with cluster and cloud computing. Mitigation strategies: Replicate files to multiple storage systems; Pre-stage data to specific compute sites; Program for failure.

Lessons Learned: Network bandwidth All data taken by the LHC in CERN is replicated out to 11 Tier-1 centres around Europe. BiG Grid serves as one of those Tier-1's. We always thought and knew we have a good network, but Having a dedicated optical network (OPN) from CERN to the data storage centres (Tier-1s) turned out to be crucial; It turns out that the Network bandwidth between storage and compute clusters is equally important

Questions? http://www.nikhef.nl