GridKa Site Report 51 st Session of the GridKa TAB, 6.07.2006 Holger Marten Forschungszentrum Karlsruhe GmbH Institut für Wissenschaftliches Rechnen, IWR Postfach 3640 D-76021 Karlsruhe Content 1. Input for discussion on manpower issues 2. Status of new hardware 3. Other services & follow-ups from last TAB 4. Problems since last TAB Corrections: Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 1
1. Input for discussion on manpower issues Don t interpret this table without reading explanations on the next page! Name Status FTE on GridKa payroll FTE for GridKa operations main tasks GIS Alef (s, p) 1,0 1,0 front-ends, WNs, Linux, electricity, cooling, education of BA students Epting (t, p) 1,0 0,05 ISSeG, EGEE, D-Grid, CA, security Ernst (t, p) 1,0 1,0 batch, accounting, certificates Gabriel (s, t) 1,0 1,0 PPS & production grid services Garcia Marti (s, t) ISSeG 0,05 ISSeG Halstenberg (s, t) 1,0 1,0 FTS, LFC, tape, dcache, srm, tape, (experiment data bases) Heiss (s, t) 1,0 1,0 Tier-2, experiment and SC contact & coordination Hermann (s, t) 1,0 0,05 EGEE ROC management DECH, EGEE SA1 tasks DECH Hoeft (t, p) 1,0 0,75 ISSeG, LAN, WAN, FTS, security, education of students from foreign countries Hohn (t, p) 1,0 1,0 Linux, server/os installations, development & implementation of fast recovery tools, mail, education of BA students Jaeger (t, p) 1,0 1,0 Linux, ROCKS packaging, Ganglia, Nagios, infrastructure installation Koerdt (s, t) EGEE 0,05 EGEE SA1 tasks DECH, deputy ROC management DECH, Marten (s, p) 1,0 1,0 GridKa management, financing, Meier (t, t) 1,0 1,0 disk, file server operation Motzke (s, t) from 7/06 (1,0) (1,0 planned) experiment data bases, Oracle, LFC Ressmann (s, p) 1,0 1,0 dcache, srm, tape Schäffner (t, p) 1,0 0,5 EGEE, D-Grid, VO management, certificates, web pages, Sharma (t, t) 1,0 1,0 administrative support, wiki & cms documentation systems Stannosek (t, t) 1,0 1,0 hardware setup, repair, exchange van Wezel (s, p) 1,0 1,0 disk storage + almost all other technical issues Verstege (t, p) 1,0 1,0 Linux, ROCKS packaging, Ganglia, Nagios, infrastructure installation NN1 (s, t) from 11/06? (1,0) (1,0 planned) PPS & production grid services (planned companion for Gabriel) NN2 (t, t) from 10/06? (1,0) (1,0 planned) LAN, WAN, security (planned companion for Hoeft) DASI Antoni (s, t) 0,3 0,3 GGUS development Dres (t, t) from 9/06 (1,0) (? Tbd) GGUS development, ticket handling Glöer (s, t) 0,2 0,2 tape system management Grein (t, t) from 9/06 (1,0) (? Tbd) GGUS development, ticket handling Heathman (t, p) 0,3 0,3 marketing, conference contributions Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 2
Wochele (s, p) 0,15 0,15 Oracle, experiment DBs company 1,0 1,0 LAN technical support Sum 19,95 +(5,0) 17,40 +(3,0) This is sensitive information. Please don t distribute, and don t use this to directly contact respective persons in case of problems. Instead, please use trouble ticket systems to avoid problems in cases of vacation etc. Status (s/t, p/t) means: (scientist / technician, permanent / temporary contract). No distinction is made between real technicians and engineers. FTE on GridKa payroll means: fraction of FTE per person financially accounted for project GridKa. FTE for GridKa operations means: local operation in the widest sense, i.e. including management, planning, installation & operation of hard and software, optimization, ticket solutions etc. in the GridKa environment. It does not include activities that are formally run under the GridKa flag but don t directly contribute to local operations tasks (like e.g. EGEE ROC management for DECH). However, it should be emphasized that there are definitely synergy among these different projects. Since it is difficult to count these in numbers, an estimation of 0,05 FTE is given in these cases. What are the potential issues? 1. In the current phase of permanent upgrades, improvements, new requirements, new functionalities (of hardware, services, experiment requirements etc.) many of these tasks explicitly require expert knowledge. These experts that maintain the services are the same persons that are needed in meetings, workshops, for reports, conference contributions, publications, funding proposals etc. and a deputy is not always guaranteed. 2. There is a significant fraction of experts sitting on temporary positions and not having an adequate full deputy. There is a high risk of loosing these people and thus of the know-how if the expert decides to move to another institute / job. 3. The communication between GridKa staff and experiments through boards, meetings (esp. TAB) etc. is not bad, but during some phases not intense enough, sometimes leading to misunderstandings of requirements, priorities and objections on both sides. Possibilities for improvements 1. Wide(r) spread of expert knowledge among GridKa staff. This is already done in the following ways: Central documentation of the GridKa setup and operations procedures through internal wiki, inventory and other databases. Already existing and permanently extended and updated. Well defined communication channels through several internal and partially internal mailing lists and ticket process workflows. Already existing. Two permanent weekly meetings plus technical meetings at fixed daily times on demand (every admin can ask for a meeting on the following day). Already existing. Identification of recurring (and not too complex) expert tasks that could be done by other people to unburden and deputize the experts. This process has just started and we ll see how far we can get with it. These methods are quite obvious and strengthen the collaboration of people, but it is also clear that there are natural limitations, especially during phases with very dynamic changes of (external) software and requirements. Detailed expert knowledge will always be needed; Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 3
information exchange cannot replace experience. It is also obvious that this is a continuous process that needs permanent improvements and time (learning curve). 2. Short-term contributions by / involvement of external experts would be very much appreciated in the following fields: srm/dcache; SAM/Monitoring; configuration of storage subsystems; xrootd; experiment specific SFTs, connectivities (to other centres), tests and ticket solutions. However, the emphasis here lies on short-term and experts that have a deep insight into and knowledge of these systems and tasks and of the GridKa environment and requirements (I guess, at least the latter could be gained fast). In the current situation we don t think that we can cope with additional people that have to be trained by ourselves. This might be even more time consuming than helpful. 3. Better information exchange with experiments. We didn t get the impression that huge workshops with extremely tight agendae and several dozens of attendees are always successful, and we don t want to suggest yet other regular meetings. However, ad hoc phone conferences with a few experts on each end of the line focusing on one or two hot topics and without preparing high gloss transparencies are extremely effective and satisfactory for both sides. 4. Medium-term contributions by experiment people. Again, this kind of collaboration is helpful for experiments as well as sites. The application for funding extra people and for a virtual institute for sure is a good step into the right direction. 2. Status of new hardware OPN 10 Gbps to CERN Light path has been delivered by DFN in June. Performance and error rate tests are ongoing but still some routing problems (separate tests not influencing the production environment). CPU Problems with new temperature offset of CPUs reported in last TAB. First bios patch delivered in June didn t improve the situation. However, NEC storage servers that are identical in construction do not show this problem. We try to get the same bios certified for the WNs as well. Delivery of new WNs to users expected during July. Side note: This problem is now documented in the revision guide for AMD / Opteron-CPUs, Rev 3.59: Errata 154, Incorrect diode offset. Laugh or cry. Disk First 20 TB of NEC storage is handed over to BaBar. We will address demands of other experiments and next BaBar storage asap. Availability in chunks of 20 TB (access via xrootd, nfs, dcache). Not all newly delivered storage will be available soon. 17 TB of disk-only dcache storage (no tape connection) is being put on-line to fullfill demands of LHCb and CMS for SC4. Plan to finish this week. New front-ends / VO-boxes New hardware has been received. Machines for ALICE and CMS already configured and delivered to experiments. Next is ATLAS within the next days (because of performance problems with the old machine during SC4), then LHCb, then Dzero. 3. Other services & follow-ups from last TAB Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 4
Squid for CMS Was defined / followed within the LCG 3D sub-project. Squid ready for testing by CMS. Oracle services Were defined / followed within the LCG 3D sub-project. Delays at GridKa because of missing manpower. Oracle RACs for LHCb and ATLAS have been set up. Oracle streams T0-T1 are currently being implemented. glite 3.0 in production Migration was done ad hoc on June 20/21 and experiments (esp. Atlas) complained for not being informed well in advance. See also mail exchange on the TAB mailing list. SAM monitoring is not working properly at GridKa. A respective ticket has been opened to the developers (info as of June 30). Consolidation of grid mapfiles Generation of grid mapfiles at GridKa has been centralised and consolidated because of too many error prone inconsistencies in the internal environment. Obviously, the migration hasn t been discussed with / announced in time to the experiments and affected Dzero production because the VO server at FNAL requires ticket exchange of VO-servers, which is not used within LCG/EGEE and thus exposed a bug in the middleware. A temporary work around has been elaborated with Dzero. News via GGUS (from old TAB) 1. There were valid complains about the diversity of news posted via GGUS. We have implemented a workflow through well defined people to write the news. 2. Mail-forwarding problems of news to vo-softadmins with end address @CERN have been solved. See separate mails on the TAB list. 3. Please note that the correct portal for news posted by centres is the regional support portal (DECH in our case) and not GGUS. We ll move the news announcements by GridKa to the DECH portal in the near future. Fair share to be published on GridKa web pages (from old TAB) Done. See www.gridka.de -> PBS -> akt. Statistik Policy to remove old user data Based upon policies at other sites we have drafted the following and received general agreement with our data privacy commissioner: ---------------------------------------------------------------- Account and File Deletion: Local accounts that have not been used for 12 months will be deleted and all data directly associated with them (home directory) will be lost. The account owner and his experiment's representatives will receive a warning one month before an account is to be deleted. However GridKa is not liable for any failure to give notification before deletion. Data written by the user to disk space outside his home directory, e.g. into experiment specific data areas, may be deleted, or the ownership of the data may be switched over to another user, at the request of the experiment's representatives. These public data areas must not contain any private data, e.g. mailbox or SSH key files. Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 5
---------------------------------------------------------------- The draft is open for discussion within TAB, some details on workflow implementation and formulation still have to be figured out. 4. Problems since last TAB See separated list of tickets. Some problems concentrate accumulate at file server outages around Pfingsten, some inconsistencies after the migration to glite and an outage of a single server about a week ago. Document ID: IWR-Rep-GridKa0073-v1.0-TABSiteReport-060706.doc 6