IWR GridKa 6. July GridKa Site Report. Holger Marten

Similar documents
The INFN Tier1. 1. INFN-CNAF, Italy

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

Andrea Sciabà CERN, Switzerland

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Grid Computing Activities at KIT

The LHC Computing Grid

Tier-2 structure in Poland. R. Gokieli Institute for Nuclear Studies, Warsaw M. Witek Institute of Nuclear Physics, Cracow

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Computing for LHC in Germany

The LHC Computing Grid

and the GridKa mass storage system Jos van Wezel / GridKa

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Tier2 Centre in Prague

Edinburgh (ECDF) Update

Lessons Learned in the NorduGrid Federation

Service Level Agreement Metrics

Summary of the LHC Computing Review

LCG data management at IN2P3 CC FTS SRM dcache HPSS

CMS Belgian T2. G. Bruno UCL, Louvain, Belgium on behalf of the CMS Belgian T2 community. GridKa T1/2 meeting, Karlsruhe Germany February

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Operating the Distributed NDGF Tier-1

Computing / The DESY Grid Center

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

First Experience with LCG. Board of Sponsors 3 rd April 2009

Virtualizing a Batch. University Grid Center

RUSSIAN DATA INTENSIVE GRID (RDIG): CURRENT STATUS AND PERSPECTIVES TOWARD NATIONAL GRID INITIATIVE

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

VMs at a Tier-1 site. EGEE 09, Sander Klous, Nikhef

Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

DESY. Andreas Gellrich DESY DESY,

A scalable storage element and its usage in HEP

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

EGEE and Interoperation

Data Transfers Between LHC Grid Sites Dorian Kcira

The Legnaro-Padova distributed Tier-2: challenges and results

ARC integration for CMS

Challenges of the LHC Computing Grid by the CMS experiment

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Geographical failover for the EGEE-WLCG Grid collaboration tools. CHEP 2007 Victoria, Canada, 2-7 September. Enabling Grids for E-sciencE

Grid Computing at Ljubljana and Nova Gorica

Service Availability Monitor tests for ATLAS

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Pan-European Grid einfrastructure for LHC Experiments at CERN - SCL's Activities in EGEE

COD DECH giving feedback on their initial shifts

Global grid user support building a worldwide distributed user support infrastructure

Outline. ASP 2012 Grid School

Availability measurement of grid services from the perspective of a scientific computing centre

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

CHIPP Phoenix Cluster Inauguration

Distributed Monte Carlo Production for

<Insert Picture Here> Enterprise Data Management using Grid Technology

LCG Conditions Database Project

: Administration of Symantec IT Management Suite 8.0 SCS Exam. Study Guide v. 1.0

Access the power of Grid with Eclipse

Monitoring ARC services with GangliARC

ALICE Grid Activities in US

Travelling securely on the Grid to the origin of the Universe

Phire Frequently Asked Questions - FAQs

Grids and Security. Ian Neilson Grid Deployment Group CERN. TF-CSIRT London 27 Jan

Clouds at other sites T2-type computing

irods usage at CC-IN2P3: a long history

ELFms industrialisation plans

DESY at the LHC. Klaus Mőnig. On behalf of the ATLAS, CMS and the Grid/Tier2 communities

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

INFORMATION TECHNOLOGY NETWORK ADMINISTRATOR ANALYST Series Specification Information Technology Network Administrator Analyst II

CMS Computing Model with Focus on German Tier1 Activities

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Distributing storage of LHC data - in the nordic countries

LHCb Computing Strategy

Open data and scientific reproducibility

Benoit DELAUNAY Benoit DELAUNAY 1

Scientific data management

Austrian Federated WLCG Tier-2

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

Site Report. Stephan Wiesand DESY -DV

Remote power and console management in large datacenters

Sparta Systems TrackWise Digital Solution

Outline. Infrastructure and operations architecture. Operations. Services Monitoring and management tools

CERN s Business Computing

LHCb Distributed Conditions Database

Monitoring tools in EGEE

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Exam : Title : High-End Disk for Open Systems V2. Version : DEMO

150 million sensors deliver data. 40 million times per second

REPORT 2015/149 INTERNAL AUDIT DIVISION

Status of KISTI Tier2 Center for ALICE

A distributed tier-1. International Conference on Computing in High Energy and Nuclear Physics (CHEP 07) IOP Publishing. c 2008 IOP Publishing Ltd 1

Storage and I/O requirements of the LHC experiments

LAMBDA The LSDF Execution Framework for Data Intensive Applications

EUROPEAN MIDDLEWARE INITIATIVE

Solution Pack. Managed Services Virtual Private Cloud Managed Database Service Selections and Prerequisites

XRAY Grid TO BE OR NOT TO BE?

EMI Deployment Planning. C. Aiftimiei D. Dongiovanni INFN

NAF & NUC reports. Y. Kemp for NAF admin team H. Stadie for NUC 4 th annual Alliance Workshop Dresden,

Oracle Java SE Advanced for ISVs

Oracle Database 11g: RAC Administration Release 2 NEW

The ATLAS Production System

Transcription:

GridKa Site Report 51 st Session of the GridKa TAB, 6.07.2006 Holger Marten Forschungszentrum Karlsruhe GmbH Institut für Wissenschaftliches Rechnen, IWR Postfach 3640 D-76021 Karlsruhe Content 1. Input for discussion on manpower issues 2. Status of new hardware 3. Other services & follow-ups from last TAB 4. Problems since last TAB Corrections: Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 1

1. Input for discussion on manpower issues Don t interpret this table without reading explanations on the next page! Name Status FTE on GridKa payroll FTE for GridKa operations main tasks GIS Alef (s, p) 1,0 1,0 front-ends, WNs, Linux, electricity, cooling, education of BA students Epting (t, p) 1,0 0,05 ISSeG, EGEE, D-Grid, CA, security Ernst (t, p) 1,0 1,0 batch, accounting, certificates Gabriel (s, t) 1,0 1,0 PPS & production grid services Garcia Marti (s, t) ISSeG 0,05 ISSeG Halstenberg (s, t) 1,0 1,0 FTS, LFC, tape, dcache, srm, tape, (experiment data bases) Heiss (s, t) 1,0 1,0 Tier-2, experiment and SC contact & coordination Hermann (s, t) 1,0 0,05 EGEE ROC management DECH, EGEE SA1 tasks DECH Hoeft (t, p) 1,0 0,75 ISSeG, LAN, WAN, FTS, security, education of students from foreign countries Hohn (t, p) 1,0 1,0 Linux, server/os installations, development & implementation of fast recovery tools, mail, education of BA students Jaeger (t, p) 1,0 1,0 Linux, ROCKS packaging, Ganglia, Nagios, infrastructure installation Koerdt (s, t) EGEE 0,05 EGEE SA1 tasks DECH, deputy ROC management DECH, Marten (s, p) 1,0 1,0 GridKa management, financing, Meier (t, t) 1,0 1,0 disk, file server operation Motzke (s, t) from 7/06 (1,0) (1,0 planned) experiment data bases, Oracle, LFC Ressmann (s, p) 1,0 1,0 dcache, srm, tape Schäffner (t, p) 1,0 0,5 EGEE, D-Grid, VO management, certificates, web pages, Sharma (t, t) 1,0 1,0 administrative support, wiki & cms documentation systems Stannosek (t, t) 1,0 1,0 hardware setup, repair, exchange van Wezel (s, p) 1,0 1,0 disk storage + almost all other technical issues Verstege (t, p) 1,0 1,0 Linux, ROCKS packaging, Ganglia, Nagios, infrastructure installation NN1 (s, t) from 11/06? (1,0) (1,0 planned) PPS & production grid services (planned companion for Gabriel) NN2 (t, t) from 10/06? (1,0) (1,0 planned) LAN, WAN, security (planned companion for Hoeft) DASI Antoni (s, t) 0,3 0,3 GGUS development Dres (t, t) from 9/06 (1,0) (? Tbd) GGUS development, ticket handling Glöer (s, t) 0,2 0,2 tape system management Grein (t, t) from 9/06 (1,0) (? Tbd) GGUS development, ticket handling Heathman (t, p) 0,3 0,3 marketing, conference contributions Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 2

Wochele (s, p) 0,15 0,15 Oracle, experiment DBs company 1,0 1,0 LAN technical support Sum 19,95 +(5,0) 17,40 +(3,0) This is sensitive information. Please don t distribute, and don t use this to directly contact respective persons in case of problems. Instead, please use trouble ticket systems to avoid problems in cases of vacation etc. Status (s/t, p/t) means: (scientist / technician, permanent / temporary contract). No distinction is made between real technicians and engineers. FTE on GridKa payroll means: fraction of FTE per person financially accounted for project GridKa. FTE for GridKa operations means: local operation in the widest sense, i.e. including management, planning, installation & operation of hard and software, optimization, ticket solutions etc. in the GridKa environment. It does not include activities that are formally run under the GridKa flag but don t directly contribute to local operations tasks (like e.g. EGEE ROC management for DECH). However, it should be emphasized that there are definitely synergy among these different projects. Since it is difficult to count these in numbers, an estimation of 0,05 FTE is given in these cases. What are the potential issues? 1. In the current phase of permanent upgrades, improvements, new requirements, new functionalities (of hardware, services, experiment requirements etc.) many of these tasks explicitly require expert knowledge. These experts that maintain the services are the same persons that are needed in meetings, workshops, for reports, conference contributions, publications, funding proposals etc. and a deputy is not always guaranteed. 2. There is a significant fraction of experts sitting on temporary positions and not having an adequate full deputy. There is a high risk of loosing these people and thus of the know-how if the expert decides to move to another institute / job. 3. The communication between GridKa staff and experiments through boards, meetings (esp. TAB) etc. is not bad, but during some phases not intense enough, sometimes leading to misunderstandings of requirements, priorities and objections on both sides. Possibilities for improvements 1. Wide(r) spread of expert knowledge among GridKa staff. This is already done in the following ways: Central documentation of the GridKa setup and operations procedures through internal wiki, inventory and other databases. Already existing and permanently extended and updated. Well defined communication channels through several internal and partially internal mailing lists and ticket process workflows. Already existing. Two permanent weekly meetings plus technical meetings at fixed daily times on demand (every admin can ask for a meeting on the following day). Already existing. Identification of recurring (and not too complex) expert tasks that could be done by other people to unburden and deputize the experts. This process has just started and we ll see how far we can get with it. These methods are quite obvious and strengthen the collaboration of people, but it is also clear that there are natural limitations, especially during phases with very dynamic changes of (external) software and requirements. Detailed expert knowledge will always be needed; Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 3

information exchange cannot replace experience. It is also obvious that this is a continuous process that needs permanent improvements and time (learning curve). 2. Short-term contributions by / involvement of external experts would be very much appreciated in the following fields: srm/dcache; SAM/Monitoring; configuration of storage subsystems; xrootd; experiment specific SFTs, connectivities (to other centres), tests and ticket solutions. However, the emphasis here lies on short-term and experts that have a deep insight into and knowledge of these systems and tasks and of the GridKa environment and requirements (I guess, at least the latter could be gained fast). In the current situation we don t think that we can cope with additional people that have to be trained by ourselves. This might be even more time consuming than helpful. 3. Better information exchange with experiments. We didn t get the impression that huge workshops with extremely tight agendae and several dozens of attendees are always successful, and we don t want to suggest yet other regular meetings. However, ad hoc phone conferences with a few experts on each end of the line focusing on one or two hot topics and without preparing high gloss transparencies are extremely effective and satisfactory for both sides. 4. Medium-term contributions by experiment people. Again, this kind of collaboration is helpful for experiments as well as sites. The application for funding extra people and for a virtual institute for sure is a good step into the right direction. 2. Status of new hardware OPN 10 Gbps to CERN Light path has been delivered by DFN in June. Performance and error rate tests are ongoing but still some routing problems (separate tests not influencing the production environment). CPU Problems with new temperature offset of CPUs reported in last TAB. First bios patch delivered in June didn t improve the situation. However, NEC storage servers that are identical in construction do not show this problem. We try to get the same bios certified for the WNs as well. Delivery of new WNs to users expected during July. Side note: This problem is now documented in the revision guide for AMD / Opteron-CPUs, Rev 3.59: Errata 154, Incorrect diode offset. Laugh or cry. Disk First 20 TB of NEC storage is handed over to BaBar. We will address demands of other experiments and next BaBar storage asap. Availability in chunks of 20 TB (access via xrootd, nfs, dcache). Not all newly delivered storage will be available soon. 17 TB of disk-only dcache storage (no tape connection) is being put on-line to fullfill demands of LHCb and CMS for SC4. Plan to finish this week. New front-ends / VO-boxes New hardware has been received. Machines for ALICE and CMS already configured and delivered to experiments. Next is ATLAS within the next days (because of performance problems with the old machine during SC4), then LHCb, then Dzero. 3. Other services & follow-ups from last TAB Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 4

Squid for CMS Was defined / followed within the LCG 3D sub-project. Squid ready for testing by CMS. Oracle services Were defined / followed within the LCG 3D sub-project. Delays at GridKa because of missing manpower. Oracle RACs for LHCb and ATLAS have been set up. Oracle streams T0-T1 are currently being implemented. glite 3.0 in production Migration was done ad hoc on June 20/21 and experiments (esp. Atlas) complained for not being informed well in advance. See also mail exchange on the TAB mailing list. SAM monitoring is not working properly at GridKa. A respective ticket has been opened to the developers (info as of June 30). Consolidation of grid mapfiles Generation of grid mapfiles at GridKa has been centralised and consolidated because of too many error prone inconsistencies in the internal environment. Obviously, the migration hasn t been discussed with / announced in time to the experiments and affected Dzero production because the VO server at FNAL requires ticket exchange of VO-servers, which is not used within LCG/EGEE and thus exposed a bug in the middleware. A temporary work around has been elaborated with Dzero. News via GGUS (from old TAB) 1. There were valid complains about the diversity of news posted via GGUS. We have implemented a workflow through well defined people to write the news. 2. Mail-forwarding problems of news to vo-softadmins with end address @CERN have been solved. See separate mails on the TAB list. 3. Please note that the correct portal for news posted by centres is the regional support portal (DECH in our case) and not GGUS. We ll move the news announcements by GridKa to the DECH portal in the near future. Fair share to be published on GridKa web pages (from old TAB) Done. See www.gridka.de -> PBS -> akt. Statistik Policy to remove old user data Based upon policies at other sites we have drafted the following and received general agreement with our data privacy commissioner: ---------------------------------------------------------------- Account and File Deletion: Local accounts that have not been used for 12 months will be deleted and all data directly associated with them (home directory) will be lost. The account owner and his experiment's representatives will receive a warning one month before an account is to be deleted. However GridKa is not liable for any failure to give notification before deletion. Data written by the user to disk space outside his home directory, e.g. into experiment specific data areas, may be deleted, or the ownership of the data may be switched over to another user, at the request of the experiment's representatives. These public data areas must not contain any private data, e.g. mailbox or SSH key files. Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 5

---------------------------------------------------------------- The draft is open for discussion within TAB, some details on workflow implementation and formulation still have to be figured out. 4. Problems since last TAB See separated list of tickets. Some problems concentrate accumulate at file server outages around Pfingsten, some inconsistencies after the migration to glite and an outage of a single server about a week ago. Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 6