August 31, 2009 Bologna Workshop Rehearsal I 1
The CMS-HI Research Plan Major goals Outline Assumptions about the Heavy Ion beam schedule CMS-HI Compute Model Guiding principles Actual implementation Computing Compute power Main focus in this talk is on the the T0 and T1 reconstruction pass stages Requirements for analysis and simulation stages are in afternoon session talks Wide area networking Disk and tape storage Proposed Computing Resources Off-line operations August 31, 2009 Bologna Workshop Rehearsal I 2
CMS-HI Research Plan and HI Running CMS-HI Research Program Goals (see Bolek s opening presentation) Focus is on the unique advantages of the CMS detector for detecting high p T jets, Z 0 bosons, photons, D and B mesons, and quarkonia First studies at low luminosity will concentrate on establishing the global properties of heavy ion central collisions at the LHC Later high luminosity runs with sophisticated high level triggering will allow for in-depth rare probe investigations of strongly interacting matter Projected Heavy Ion Beam Schedule Only first HI run is known for certain, at the end of the 2010 pp run This run will be for 2 weeks at low luminosity In 2011 LHC will shut down for an extended period Conditioning work on magnets is needed to achieve design beam energy HI running is assumed to resume in 2012 with higher luminosity This luminosity will require the use of the HLT to select events HI runs will be for one month in each year of the LHC operations August 31, 2009 Bologna Workshop Rehearsal I 3
Assumed HI Running Schedule Projected HI Luminosity and Data Acquisition for the LHC 2010-2013 CMS-HI Run Ave. L (cm -2 s -1 ) Uptime (s) Events taken Raw data 2010 ~5.0 x 10 25 ~10 5 ~2.0 x 10 7 (MB) ~50 2012 2.5 x 10 26 5 x 10 5 2.5 x 10 7 (HLT) 110 2013 5.0 x 10 26 10 6 5.0 x 10 7 (HLT) 225 Caveats 1) First year running may achieve greater luminosity and uptime resulting in up to 50% more events taken than assumed here 2) First year running is in minimum bias mode where the HLT will operate in a tag and pass mode 3) Third year running is the planned nominal year case when the CMS DAQ writes at 225 MB/s for the planned 10 6 s of HI running August 31, 2009 Bologna Workshop Rehearsal I 4
CMS-HI Compute Model CMS-HI Compute Model Guiding Principles CMS-HI computing will follow, as much as feasible, the CMS TDR-2005 report CMS-HI software development is entirely within the CMSSW release schedule Size of CMS-HI community and national policies mandates that we adapt the CMS multi-tiered computing grid structure to be optimum for our purposes CMS-HI Compute Model Implementation Raw data is processed at the T0/CAF in much the same workflows and staff as in HEP Data quality monitoring is performed in on-line and off-line modes A first pass reconstruction of the data is performed at the T0 Raw data, AlCa output, and T0 reco output are transferred to a Tier1-like computer center to be built at Vanderbilt University s ACCRE facility DOE policy will not allow the US sites to be official Tier1 with guaranteed performance ACCRE center will archive all the CMS-HI data and production files to tape It is possible that the Tier1 center in France will receive ~10% of the CMS-HI data Reco passes will be done at the Vanderbilt site, and possibly in France The Vanderbilt center will be the main analysis Tier2 for US groups (DOE policy) Reconstruction output will be transferred to Tier2 centers in Russia, Brazil, and Turkey Simulation production will be done at a newly funded CMS-HI center at the MIT Tier2, and at other participating CMS Tier2 centers (Russia, France, ) August 31, 2009 Bologna Workshop Rehearsal I 5
Computing Requirements: Annual Compute Power Budget for All Tasks Assumptions for Tasks Requiring Available Compute Power Reconstruction passes: CMS standard is 2, will want more for the first data set Analysis passes are scaled to take 25% of total annual compute power Simulation production/analyses are estimated to take one reconstruction pass Simulation event samples at 10% of real event totals (RHIC/post-RHIC experience) Simulation event generation/analysis processing takes 10x that of real event Constraint: Accomplish All Data Processing in One Year Offline processing must keep up with the annual data streams The goal is to process all the data within one year after acquisition, on average It is essential to have completed analyses to guide future running conditions Total Computer Power 12 Month Budget Allocation (after T0 reconstruction) A single, complete raw data reconstruction pass is targeted to take 4 months Analysis passes of the two reconstruction passes will take 4 months Simulation production/analysis will be done in 4 months In this model the total compute power requirement is verified by establishing that one reconstruction pass after the T0 will take not more than 4 months To establish this verification we need to determine the reconstruction times August 31, 2009 Bologna Workshop Rehearsal I 6
Computing Requirements: Determining Compute Power for One Reconstruction Pass Simulated two sets of HI collision events using the HYDJET generator Minimum bias event set meaning all impact parameters, first year running model Central collision event set meaning 0 4 fm impact parameter corresponds to the HLT events which will begin with the second year of running Simulated events are reconstructed in the CMSSW_3_1_2 release, August 2009 Results of simulation reconstruction CPU times Major component of the reconstruction time is the pixel proto-track formation This component depends on the p T minimum momentum cut parameter value Heavy ion event reconstruction will require the use of a p T cut > 75 MeV/c There will not be enough compute power possible to track every charged particle in central collision events down to 75 MeV/c ; this is not a major drawback for us At RHIC the PHENIX charged particle physics program cut is >= 0.2 GeVc Protons, kaons, and pions range out at ~0.5, ~0.3, and ~0.2 GeV/c, respectively The reconstruction CPU times for central events are plotted on the next page Minimum bias event reconstruction CPU times are approximately 3 times faster August 31, 2009 Bologna Workshop Rehearsal I 7
Computing Requirements: Determining Compute Time for Central Event Reco Distribution of reco CPU times for 200 central collision HI events using an 8.5 HS06 node Dependence of reco CPU times for central collision HI events on tracking p T cut Tracking p T cut = 0.3 GeV/c Error bars are the RMS widths of the distributions, not the centroid uncertainties Major caveat: The particle multiplicity for HI collisions at the LHC is uncertain to +/- 50% Therefore all CPU times and event size predictions have the same uncertainty August 31, 2009 Bologna Workshop Rehearsal I 8
Charged Particle Tracking in CMS pp tracking algorithm builds proto-tracks in 3 pixel layers down to p T = 75 MeV/c HI tracking algorithm for proto-tracks in pixel layers restricts the search window to correspond to a higher minimum p T August 31, 2009 Bologna Workshop Rehearsal I 9
Reconstruction Computing Requirements: Method of Verification for CMS-HI Compute Power Use event reconstruction times previously quoted Calculated the number of days processing time at the T0 Since reconstruction processing depends on the tracking pt cut, then alternative choices for the p T cut will be compared The T0 is (conservatively) fixed at 62,000 HEPSPEC06 units for five years Calculate the number of days for the T1 reconstruction pass The calculation is done for each year starting in 2010 Only one T1 center at Vanderbilt is assumed for now A contribution from the T1 center in France could occur eventually The T1 center at Vanderbilt will be constructed during a 5-year period Equal installments of about 4,250 HEPSPEC06 units (~500 cores) are assumed The same choices of the p T cut as used for the T0 will be compared Compare the number of days to the 4-month goal for each year If the number of days is ~4 months, the compute model is self-consistent If not, more compute power or less computing (p T tracking cut) are the choices Recall: all the predicted compute times have a +/-50% uncertainty until we see data August 31, 2009 Bologna Workshop Rehearsal I 10
Reconstruction Computing Requirements: 5-Year Annual Allocations of CMS-HI Compute Power August 31, 2009 Bologna Workshop Rehearsal I 11
Computing Requirements: Wide Area Networking Nominal Year Running Specifications for HI CMS DAQ planned to write at 225 MB/s for 10 6 s = 225 TB Calibration and fast reconstruction at T0 = 75 TB Total nominal year data transfer from T0 = 300 TB Note: DAQ may be allowed to write faster eventually Nominal Year Raw Data Transport Scenario T0 holds raw data briefly (days) to do calibrations, preliminary reco Data are written to T0 tape archive, not designed for re-reads Above mandates a continuous transfer of data to a remote site 300 TB x 8 bits/byte / (30 days x 24 hours/day x 3600 sec/hour) = 0.93 Gbps DC rate (no outages), or ~10 TB/day Same rate calculation as for pp data except pp runs for ~5 months A safety margin must be provided for outages, e.g. 4 Gbps burst August 31, 2009 Bologna Workshop Rehearsal I 12
Computing Requirements: Wide Area Networking Nominal Year Raw Data Transport Scenario Tier0 criteria mandate a continuous 1 Gbps transfer of data to a remote site A safety margin must be provided for outages, e.g. 4 Gbps burst Plan is to transfer these raw data to ACCRE for archiving to tape and for second pass reconstruction In August 2008, Vanderbilt spent $1M to provide a 10 Gbps link into ACCRE from Atlanta s Southern Crossroads hub which provides service to STARLIGHT Raw Data Transport Network Proposal for CMS-HI (Discussion with US DOE-NP) CMS-HEP and ATLAS will use USLHCNet to FNAL and BNL FNAL estimates that CMS-HI traffic will be ~2% all pp use of USLHCNet Propose to use USLHCNet at modest pro-rated cost to DOE-NP with HI raw data being transferred to STARLIGHT during one month only Have investigated Internet2 alternatives to the use of USLHCNet CMS management strongly discourages use of a new data path for CMS-HI A new data path from T0 would have to be solely supported by the CMS-HI group Transfers from VU to overseas Tier2 sites would use Internet2 alternatives Waiting for DOE review report from a presentation done on May 11 August 31, 2009 Bologna Workshop Rehearsal I 13
Computing Requirements: Local Disk Storage Considerations Disk Storage Categories [need confirmation of numbers] Raw data buffer for transfer from T0, staging to tape archive Currently 2.5 MBytes/minimum bias event; 4.4 MBytes/HLT event Reconstruction (RECO) output Currently 1.93 MBytes/minimum bias event; 7.68 MBytes/central event Analysis Object Data (AOD) output (input for Physics Interests Groups) PAT output format may replace AOD output format Don t have current numbers for AOD or PAT output events sizes Assume scale factors for AOD/PAT output and for PInG output Simulation production (MC) [need good numbers] Currently? MBytes/minimum bias event;? MBytes/central event Disk Storage Acquisition Timelines Model dependent according to luminosity growth Need to minimize tape re-reads, most "popular" files kept on disk Model dependent according to pace of physics publication: deadlines are set by which older data are removed (RHIC experience) August 31, 2009 Bologna Workshop Rehearsal I 14
Table has to be re-worked according to new events size information FY Computing Requirements: Local Disk Storage Annual Allocations Annual Disk Storage Allocated to Real Data Reconstruction and Analysis Events (Millions) Raw Data Buffer RECO AOD PInG 2010 0 40 20 10 5 75 Total 2011 10 75 60 40 20 195 2012 25 75 150 60 30 315 2013 51 75 200 110 50 435 2014 51 75 200 150 85 510 Assumptions for the Allocation Growth of Real Data Disk Needs 1) In 2010 2011 there will be two years devoted to processing the first year s ~50 TB of data 2) Event growth in 2012-2013 as per model on page 4 3) Relative sizes of RECO, AOD, and PInG categories according to present experience 4) Relative amounts of CPU/disk/tape purchased are constrained by the annual funding Possible disk contribution from Tier1 site in France (or non-us Tier2 sites) is not included August 31, 2009 Bologna Workshop Rehearsal I 15
Table has to be re-worked according to new events size information Computing Requirements: Local Disk Storage Annual Allocations Annual Disk Storage Allocated to Simulation Production and Analysis FY Events (Millions) Evt. Size Raw RECO AOD Assumptions for the Allocation Growth of Simulation Disk Needs 1) GEANT (MC) event size is taken to be five times a real event size 2) Out-year RECO/AOD/PInG totals include retention of prior year results for ongoing analyses 3) Total real data + MC storage after 5 years = 680 TB in the US PInG 2010 0.5 15 6 3 1 1 25 2011 0.6 36 18 6 3 2 65 4) These totals do not reflect pledges of disk space for Tier2 institutions outside the US Total 2012 1.25 75 18 6 4 3 105 2013 2.55 75 39 16 6 5 145 2014 2.55 75 39 32 12 12 170 August 31, 2009 Bologna Workshop Rehearsal I 16
Table has to be re-worked according to new events size information Computing Requirements: Tape Storage Allocations Tape Storage Categories and Annual Allocations First priority is archiving securely the raw data for analysis FY Tape must be on hand at beginning of the run, i.e. as early as August 2010 So the tape must be (over-)bought in the prior months Tape storage largely follows annual real data production statistics Some, but not all, MC/RECO/AOD/PInG annual outputs will be archived to tape Access to tape will be limited to archiving and production teams Guidance from experiences at RHIC's HPSS Unfettered, unsynchronized read requests to tape drives leads to strangulation Real Evts. (Millions) Raw RECO AOD PInG Real MC Real+MC Tot. 2010 0.0 90 0 0 0 90 10 100 100 2011 20 60 20 7 3 90 10 200 100 2012 25 150 75 15 7.5 248 48 470 270 2013 51 300 115 23 11.5 449 97 1010 540 2014 51 300 115 26 13.0 454 102 1570 560 Δ (ΤΒ) August 31, 2009 Bologna Workshop Rehearsal I 17
Summary of Proposed Compute Resources for CMS-HI United States (DOE-NP funding proposal) Tier1/Tier2-like center for raw data processing and analysis at Vanderbilt This Tier1/Tier2-like center will have ~21,000 HEPSPEC06 after 5 years Raw data transport to the Vanderbilt center will use USLHCNET during one month All other file transfers to the Vanderbilt center will use non-uslhcnet resources (Proposed to the DOE-NP, not yet final); Vanderbilt center has 10 Gbps gateway A new HI simulation center will be added to the Tier2 center at MIT, ~6000 HS06 This new simulation center has been approved in principle by the DOE-NP There will be 680 TB of disk and 1.5 PB (?) of tape after 5 years at Vanderbilt France (possibly a 10% Tier1 contribution, with corresponding Tier2) Russia (preliminary numbers have already been provided) Brazil (preliminary numbers have already been provided) Turkey (preliminary numbers nave already been provided) Korea (?, no preliminary numbers have been provided) August 31, 2009 Bologna Workshop Rehearsal I 18
OFFLINE Operations at Vanderbilt T1/T2: Projected FTE staffing by local group Proof of Principle from RHIC Run7 Project Done for PHENIX at Vanderbilt 30 TBytes of raw data were shipped by GridFTP from RHIC to ACCRE in 6 weeks, comparable to the ~30-60 TBytes we expect from the LHC in 2010 Raw data were processed immediately on 200 CPUs as calibrations were received Reconstructed output of ~20 TBytes was shipped back to BNL/RACF for analysis by PHENIX users; there was no local tape archiving (data archived at BNL s HPSS) Entire process was automated/web-viewable using ~100 homemade PERL scripts PHENIX specific local manpower consisted of one faculty member working at 0.8 FTE, one post-doc at 0.4 FTE, three graduate students at 0.2 FTE each Intended Mode of Operation for CMS-HI Will install already developed CMS data processing software tools Much of CMSSW is working at ACCRE, good cooperation with VU CMS-HEP group Extensive domestic/international networking testing is already in progress Dedicating a similar sized local manpower effort as worked for PHENIX Run7 project, and overseas persons in CMS-HI for second/third shift monitoring tasks Will build a ~30 m 2 CMS center in our Physics building (~$20K equipment cost) to do off-line monitoring and DQM supervision from right near our own faculty offices August 31, 2009 Bologna Workshop Rehearsal I 19
OFFLINE Operations: Performance evaluations and oversight Existing Local Monitoring and Error Handling Tools ACCRE already has tools which provide a good statistical analysis of operational readiness and system utilizations A 24/7 trouble ticket report system was highly effective in the PHENIX Run7 project Internal and External Operations Evaluations The OSG and the CMS automated testing software are already reporting daily network reliability metrics to the ACCRE facility Although ACCRE will not be an official Tier1 site in CMS we will still strive to have its network performance as good as any Tier1 in CMS CMS-HI will establish an internal oversight group to advise the local Vanderbilt group on the performance of the CMS-HI compute center Oversight group will give annual guidance for new hardware purchases The progress and performance of the CMS-HI compute center will be reviewed at the quarterly CMS Compute Resources Board meetings, like all Tier1/Tier2 facilities August 31, 2009 Bologna Workshop Rehearsal I 20
Backup Various Cost Tables August 31, 2009 Bologna Workshop Rehearsal I 21
Computing Requirements: Local Disk Storage Allocation Risk Analysis Storage Allocation Comparisons and Risk Analysis CMS-HI proposes to have 680 TB after 5 years when steady state data production is 300 TB per year (or possibly even greater) CMS-HEP TDR-2005 proposed 15.6 PB combined Tier1 + Tier2 Ratio is 4.4% for disk storage as compared with 10% for CPUs PHENIX had 800 TB disk storage at RCF in 2007 when 600 TB of raw data was written, six years after RHIC had started Painful in PHENIX for allocating disk space among users and production Contention for last few TB of disk space for analysis purposes PHENIX has access to significant disk space at remote sites (CCJ, CCF, Vanderbilt,...) for real data production and simulations Mitigation of Disk Storage Allocation Risk May have access to other disk resources (REDDNet from NSF, 10s TB) Could decide to change balance of CPUs/Disk/Tape in outer years Disk storage at overseas CMS-HI sites has been pledged (100s of TB) August 31, 2009 Bologna Workshop Rehearsal I 22
Computing Requirements: Tape Storage Decision: FNAL or Vanderbilt? Quote For Tape Archive Services at FNAL (5-year cumulative costs) Tape media = $167,990 Additional tape drives = $32,000 Tape robot maintenance = $135,000 Added FNAL staff (0.2 FTE) = $200,000 Total = $534,990 Quote For Tape Archive Services at Vanderbilt (5-year costs) Will use existing tape robot system (LT04 tapes) for first two years, and then move to a larger tape robot system (LT06 tapes) for next three years Total hardware cost including license, media, and maintenance = $249,304 ACCRE staff requirement at 0.5 FTE/year subsidized at 58% by Vanderbilt Net staff cost to DOE = $134,613 Total = $383,917 Advantages to CMS-HI for Tape Archive at Vanderbilt Simpler operations no synchronization between Vanderbilt and FNAL Subsidy of staff cost by Vanderbilt results in overall less expense to DOE August 31, 2009 Bologna Workshop Rehearsal I 23
Capital Cost and Personnel Summary Category FY10 FY11 FY12 FY13 FY14 Total CPU cores 440 456 480 360 288 2024 Total CPUs 440 896 1376 1736 2024 2024 Disk 75 120 120 120 75 510 Total Disk 75 195 315 435 510 510 Tape 100 100 270 540 560 1570 Total Tape 100 200 470 1010 1570 1570 CPU cost $151,800 $125,400 $120,000 $81,000 $64,800 $543,000 Disk cost $26,250 $36,000 $30,000 $24,000 $15,000 $131,250 Tape cost $71,150 $22,535 $58,449 $48,960 $48,210 $249,304 Total hard. $249,200 $183,935 $208,449 $153,960 $128,010 $923,554 Staff (DOE cost) Real Data Processing and Tape Archiving Center at Vanderbilt $89,650 $155,394 $161,609 $235,303 $244,715 $866,671 Total cost $338,850 $339,329 $370,059 $389,263 $372,725 $1,810,226 August 31, 2009 Bologna Workshop Rehearsal I 24
Capital Cost and Personnel Summary Simulation Production and Analysis Center at MIT Category FY10 FY11 FY12 FY13 FY14 Total CPU cores 152 152 160 120 96 680 Total CPUs 152 304 464 584 680 680 Disk 25 40 40 40 25 100 Total Disk 25 65 105 145 170 170 CPU cost $52,440 $41,800 $40,000 $27,000 $21,600 $182,000 Disk cost $8,750 $12,000 $10,000 $8,000 $5,000 $43,750 Total hard. $59,350 $53,800 $50,000 $35,000 $26,600 $224,750 0.25 FTE $37,040 $37,280 $39,400 $40,600 $41,800 $197,120 Total cost $96,390 $92,080 $89,400 $75,600 $68,400 $421,870 August 31, 2009 Bologna Workshop Rehearsal I 25
Capital Cost and Personnel Summary Cost Model Assumptions for Vanderbilt Site Category FY10 FY11 FY12 FY13 FY14 Total Hardware CPU cores with 4 GB Disk per TB Staffing (FTE) $345 $275 $250 $225 $225 - $350 $300 $250 $200 $200 - By DOE 0.75 1.25 1.25 1.75 1.75 6.75 By VU 2.25 1.75 1.75 1.25 1.25 8.25 Staffing (Cost) By DOE $89,650 $155,394 $161,609 $235,303 $244,715 $886,671 By VU $268,950 $217,551 $226,253 $168,074 $174,797 $1,055,625 August 31, 2009 Bologna Workshop Rehearsal I 26
Capital Cost and Personnel Summary Cumulative Cost of US Proposal Category FY10 FY11 FY12 FY13 FY14 Total Real data center at Vanderbilt Simulation center at MIT Vanderbilt + MIT Total $338,850 $339,329 $370,059 $389,263 $372,725 $1,810,116 $96,390 $92,080 $89,400 $75,600 $68,400 $421,870 $435,240 $431,409 $459,459 $464,863 $441,125 $2,232,096 August 31, 2009 Bologna Workshop Rehearsal I 27