NAF & NUC reports Y. Kemp for NAF admin team H. Stadie for NUC 4 th annual Alliance Workshop Dresden, 2.12.2010
NAF: The Overview Picture and current resources Yves Kemp NAF 2.12.2010 Page 2
ATLAS NAF CPU usage > Current snapshot of ATLAS NAF batch usage accounting interval: 1.8. - 29.11.2010 total CPU time: 21362 days ~12% of NAF CPUs total wall clock time: 58228 days ~33% of NAF CPUs nominal ATLAS share: ~25% ATLAS is working on the NAF CPU time fractions NAF ATLAS > Analysis type jobs are becoming more prominent within the group of power users Other category includes ~100 users As an example: September 2010-71 users total, 28 from DESY/HUB (slide provided by M. Barisonzi & W. Ehrenfeld) Yves Kemp NAF 2.12.2010 Page 3
NAF usage by CMS > CMS: Install CMSSW on NAF AFS Adapt submission frameworks to local batch Jobs access data on Tier-2 dcache SE Interactive data analysis with PROOF and Lustre > CMS: Additional data sets (160 TB) at DESY All data very well used by community, often many users per dataset > Tasks performed: (Prompt) data analysis Special MC sample production Development of analysis tools Calibration, alignement, CMS Physica Analysis Summary: Yves Kemp NAF 2.12.2010 Page 4 Extract from CHEP 2010 presentation by Kai Leffhalm
NAF usage by LHCb & ILC/CALICE > LHCb: > ILC: E.g: Study of CP-violation in the B sector: Requires complex max. likelihood fits Generate toy MC, very CPU intensive, fast turnaround, short jobs most users perform ntuple production LHCb uses resources as expected NAF important pillar of their analysis infrastructure ILD LoI: Studies of impact of machine background on track reco efficiency Fast turn-around time for efficient prototyping NAF: Easy to manage jobs > CALICE: GEANT4 validation with AHCAL data Custom MC generation NAF: work with scripts in homogeneous environment and keep efficient access to Grid storage Yves Kemp NAF 2.12.2010 Page 5 Extract from CHEP 2010 presentation by Kai Leffhalm
NAF Resources well used Need upgrade in 2011! Recommended limit: < 75%, peaks up to 90% NAF well used by German institutes 21% used by DESY scientists. Yves Kemp NAF 2.12.2010 Page 6
dcache storage & NAF > Both ATLAS and CMS have substantially more space in dcache compared to T2 MoU pledges > NAF and user space And other contributions, e.g. UniHH-CMS E.g ATLAS: T2 part 66% used, NAF part 303 TB/441 TB used ATLAS (HH+ZN) ATLAS (HH+ZN) T2 pledges (740 TB) CMS T2 pledges (400 TB) Used space Free space After observing data taking for ~one year now: 1) Optimize dcache for speed 2) Optimize dcache for safety and availability of custodial data 3) Optimize dcache usage and data placement for non-t2 data ATLAS-HH dcache 29.11/30.11 1.5 Gbyte/s to Grid WN Sustained over 6 hours dcache is THE working horse for data storage Yves Kemp NAF 2.12.2010 Page 7
Hardware Status NAF is three years old now > Have to start replacing first hardware First replacement currently ongoing Newer hardware, more RAM/core, new network technology, At the end, more computing power 10 Gbit infrastructure and more to come in 2011 > New additions to dcache storage (quantity & quality) > Clear commitment from DESY to support NAF > Future purchases planned together with the NUC and take into account findings of the GridCenter Review Task Force. Yves Kemp NAF 2.12.2010 Page 8
Problems and Issues > AFS: Problems started ~Mid July: Whole AFS instance unavailable for some minutes at a time Debugging difficult: Consulting with AFS developers Main cause: SGE behavior with NAF job type when starting many jobs at the same time First countermeasures taken, more to come User training will start this afternoon > Lustre: Many features still not working reliably (e.g. group quotas, ACL, ) Maintenance tools to make users life easier not yet available (deletion tools, ) Overall stability improved, but some hick-ups are still seen Performance reports unclear no end-to-end performance investigation done Future of Lustre unclear in general (ORACLE) and at DESY: looking for alternatives The need for such an easy-access large file store is indisputable Yves Kemp NAF 2.12.2010 Page 9
NAF User Committee and User Meeting > Monthly meetings of the NAF User Committee. Members: ATLAS: Marcello Barisonzi & Wolfgang Ehrenfeld CMS: Andreas Nowack & Hartmut Stadie (Chair) LHC-B: Johan Blouw & Alexey Zhelezov ILC: Steve Aplin & Shaojun Lu IT: Andreas Gellrich & Kai Leffhalm > status reports and discussions with NAF technical coordinators > NAF Users Meeting see you there! Yves Kemp NAF 2.12.2010 Page 10
Random comments collected by NUC > the currently available resources, especially CPU in the batch system, could provide good working conditions, when all systems are working properly > ongoing problems make an effective and timely data analysis almost impossible > dcache user diretories are not reliable enough > congested work group servers > slow I/O with dcache (data placement), need more space > add more Lustre space Yves Kemp NAF 2.12.2010 Page 11
NUC: Some words on support > Support Ansatz: Two different paths Problems with central NAF services DESY helpdesk problem with experiment infrastructure experiment mailing list ( + second level support structure, available for experiment experts directly ) > Challenges: dedicated manpower for central services? dedicated manpower for experiment support? (FSPs) O(Min) response time? analysis with fast turn-around needs very reliable system (better than Tier-2 MoUs) Yves Kemp NAF 2.12.2010 Page 12
Yves Kemp NAF 2.12.2010 Page 13
NAF introduction in one minute > Access to experiment data on the Grid Storage Element > CPU cycles for analysis: interactive and local batch Complement the Grid resources New techniques like PROOF > Additional storage: Lustre parallel file system > Home directories on AFS, accessible from everywhere http://naf.desy.de/ Yves Kemp NAF 2.12.2010 Page 14
NAF well used NAF: Running jobs (2010) > Very peaky behavior: Try to keep overall utilization below 75% and peaks below 90%: Will add hardware in 2011 (starting now). NAF CMS tests 2010-11 > Availability and reliability One of the most important aspects for users and admins Availability and reliability is ~97.5%, similar to the DESY Grid but this does not tell the whole story: The 2.5% failures affect you and your work much more than in the Grid! We want and need to get better!... Have a look at the following slides. Yves Kemp NAF 2.12.2010 Page 15
Major problems in the past months: > Data on dcache not available, slow transfers Some problems with dcache file server availability Under investigation / solved User code sometimes causing denial-of-service: e.g. not closing files after reading them will keep them open for the duration of the job. Only a certain number of files can be kept open at the same time Other jobs cannot open files Slow data transfer: Can have many different causes. It is known that older ROOT files are written in a bad way for reading them efficiently. Sometimes a file server is also overloaded ROOT versions to be changed by experiments & Improvements on dcache side constantly done > Lustre not working properly Lustre does not like small files: Keep your code / SVN / output files outside of Lustre! We provide you AFS-Scratch volumes for such purposes! Other users might do harmful operations and affect your speed or even accessibility To increase stability, Lustre data in HH is going via TCP/IP instead of InfiniBand In general: Lustre future is unclear: ORACLE: DESY looking into alternatives but we recognize that there is a need for easy access file store Yves Kemp NAF 2.12.2010 Page 16
AFS problems in the past months > AFS hangs, Login impossible, shell is frozen, jobs die, > We had severe troubles with NAF AFS cell in the past months > Investigation very difficult and painful, even asking developers for help > Patched AFS kernel module: Solved some problems > It turned out that major problem is due to interference between SGE and AFS. Similar jobs (e.g. one user submitting many jobs): All STDOUT and STDERR end up in files in the same directory These files are created at time of job start If cluster is rather empty, can be several hundreds of jobs: Files are created and read simultaneously in the same directory Fileserver ensures consistency of client cache through callbacks A storm of callbacks between AFS server and AFS clients will basically paralyze the fileserver and the clients, when jobs read in the directory with the.e/.o files > We think we finally have solutions / workarounds! Yves Kemp NAF 2.12.2010 Page 17
Solutions to AFS problem: What NAF can/will do > Limit number of jobs / user: Ad hoc and drastic measure > Throttle start of jobs: Will be implemented soon > Possible long-term solution Change STDOUT/STDERR files with prologue and epilogue methods Write into separate directories > and we now have a simple recipe for you to help us by defusing your jobs: See next slide Yves Kemp NAF 2.12.2010 Page 18
Solutions to AFS problem: What YOU can do Change the submission command like this: qsub -j y -o /dev/null <other requirements> <your jobscript>! Have as the very firsts lines in your job script something like: exec > $TMPDIR /std.out 2>"$TMPDIR /std.err! (this will store the files locally on the WN) ($TMPDIR is unique during job execution, you can of course add $JOB_ID, $SGE_TASK_ID to the filename) and at the very end of your job script, copy these files over to some location on AFS, preferably in a subdirectory any maintainers of CRAB / GANGA / here? Can you implement this for all users? we prepare a web page, and inform all users soon Yves Kemp NAF 2.12.2010 Page 19
Reminder of the NAF support channels > Got a problem with your experiment setup? naf-[atlas,cms,ilc,lhcb]-support@desy.de > Got a problem with the NAF fabric (or are not sure where problem resides)? naf-helpdesk@desy.de Experiment supporters: You know the different system experts and you can use them directly > If you think your job causes a problem: We need you to contact us and help us making the NAF better! Yves Kemp NAF 2.12.2010 Page 20