Distributed Monte Carlo Production for Joel Snow Langston University DOE Review March 2011
Outline Introduction FNAL SAM SAMGrid Interoperability with OSG and LCG Production System Production Results LUHEP Computing Summary DOE Review March 2011 Joel Snow Langston University 2
Introduction Covers my tenure as MC production coordinator Simulation data (MC) crucial to physics analysis Tevatron luminosity and hence raw data volume is at record levels Challenge for analysts and production Personnel & computing resources migrating to LHC experiments DZero strategy Increase automation Leverage resources and support DOE Review March 2011 Joel Snow Langston University 3
Evolution Mature experiment, but nimble history of adopting innovative technologies distributed data handling - SAM early adopter of the grid for production - SAMGrid significant investment in these technologies Grid technology allows opportunistic usage DZero can mix traditional dedicated and opportunistic resources Grid interoperability Leverages resources and support, reduces personnel needs per CPU hour DOE Review March 2011 Joel Snow Langston University 4
Sequential data Access via Metadata Fermilab system first used by DZero SAM distributed data handling system predates grid Set of servers working together to store and retrieve files and metadata Permanent storage and local disk caches Database tracks location, metadata of files, job processing history Delivers files to jobs (using GridFTP over WAN), provides job submission capabilities DOE Review March 2011 Joel Snow Langston University 5
SAMGrid Fermilab developed grid first used by DZero for global MC production in 2004 SAMGrid = SAM + Job and Information Management (JIM) components Provides the user with transparent remote job submission, data processing and status monitoring. VDT based (Globus + Condor) Logically consists of Multiple execution sites Resource selector Multiple Job Submission (Scheduler) sites Multiple Clients (User Interface) to Submission site. DOE Review March 2011 Joel Snow Langston University 6
SAMGrid Interoperability As Open Science Grid (OSG) and LHC Computing Grid (LCG) became operational it was desirable to leverage these resources for DZero FNAL and DZero developed and deployed SAMGrid interoperability with both LCG and OSG resources Execution site acts as a Forwarding node packages SAMGrid jobs for OSG/LCG job submission via Condor-G DOE Review March 2011 Joel Snow Langston University 7
Consolidation, Automation, Exploitation SAMGrid sites require operational manpower and expert support People power and FNAL support migrating to LHC experiments Increase automation - Automc Reduce number of SAMGrid sites, increase use of OSG and LCG comes with support provides opportunistic job slots DOE Review March 2011 Joel Snow Langston University 8
MC production gets work from the SAM Request System Physics groups' MC requests are parametrized and prioritized as a Python object Production System DOE Review March 2011 Joel Snow Langston University 9
Automatic Monte Carlo Request Processing Developed Automc System in use at FNAL Handles official DZero MC production at all but 2 sites From approved request to final data storage Easy to use minimizes manpower needs Site independent deploy for any grid site (SAMGrid, OSG, LCG) capable of managing many sites Handle recovery of common failures Integrates with existing MC request priority protocol DOE Review March 2011 Joel Snow Langston University 10
AutoMC Monitoring Running at FNAL & managing production at 39 sites http://www-d0.fnal.gov/computing/mcprod/dajd/dajd_status.html DOE Review March 2011 Joel Snow Langston University 11
Production System Resources MC production uses a variety of dedicated and opportunistic resources on 4 continents Non-grid site at ccin2p3 Lyon (FR) very productive, flexible Native Samgrid sites: FZU (CZ), GridKa (DE), LUHEP (US), USTC (CN) LCG resources: CE's, SE's, and Samgrid-LCG infrastructure in FR, UK, NL OSG resources: OSG resources: CE's, SE's, and Samgrid-OSG infrastructure in US DOE Review March 2011 Joel Snow Langston University 12
MC Production Results Looking back at the last 30 days Averaging 5.8M events per day and totaling 172.8M events in 30 days DOE Review March 2011 Joel Snow Langston University 13
MC Production Results Looking back at the last year (2010/02/14-2011/02/14) cumulative since September 2005. Averaging 49M events per week and totaling 2.6B events in a year DOE Review March 2011 Joel Snow Langston University 14
MC Production Results Looking back at the last year by production segment 52 week averages per week (2010/02/14-2011/02/14) Non-grid: 19.8M, OSG: 11.4M, Samgrid: 12.6M, LCG: 4.9M DOE Review March 2011 Joel Snow Langston University 15
MC Production Results Looking back at the last year by production segment Cumulative since September, 2005 Production Last Year By Segment Nongrid OSG Samgrid LCG 52 week totals (2010/02/14-2011/02/14) Non-grid: 1041M, OSG: 596M, Samgrid: 658M, LCG: 257M 40.8% 23.3% 25.8% 10.1% DOE Review March 2011 Joel Snow Langston University 16
MC Production Geographic Events Last Year: Distribution 1.1% 22.5% Europe 1925M N. America 574M 0.9% Asia 29M S. America 24M 75.4% (2010/02/14-2011/02/14) Europe S. America N. America Asia DOE Review March 2011 Joel Snow Langston University 17
MC Production Results Looking back at the last 5.5 years (2005/09/05-2011/02/14) cumulative since September 2005. Averaging 19.2M events per week and totaling 2.82B events DOE Review March 2011 Joel Snow Langston University 18
MC Production Results Looking back at the last 5.5 years by production segment 5.5 year averages per week (2005/09/05-2011/02/14) Non-grid: 8.0M, OSG: 4.8M, Samgrid: 5.3M, LCG: 1.1M DOE Review March 2011 Joel Snow Langston University 19
MC Production Results Looking back at the last 5.5 years by production segment Cumulative since September, 2005 Production Last Year By Segment Nongrid OSG Samgrid LCG 5.5 year totals (2005/09/05-2011/02/14) Non-grid: 2.26B, OSG: 1.37B, Samgrid: 1.51B, LCG: 306M 41.5% 25.2% 27.7% 5.6% DOE Review March 2011 Joel Snow Langston University 20
Production Results Last 7 Years DZero MC Production in Millions of Events per year ending 12/26 DZero MC Production in Millions of Events 3000 2500 2000 1500 1000 500 0 2004 2005 2006 2007 2008 2009 2010 LCG OSG SAMGrid Non-Grid Year Total Non-Grid SAMGrid OSG LCG 2010 2388.5 1011.2 614.8 539.2 223.3 2009 1122.6 540.3 217.9 364.2 0.3 2008 794.8 315.6 213.6 259.7 5.8 2007 398.2 109.1 158.1 96.5 34.4 2006 348.0 144.4 195.5 0.5 7.6 2005 98.1 68.6 29.5 0.0 0.0 2004 42.4 41.8 0.6 0.0 0.0 DOE Review March 2011 Joel Snow Langston University 21
Production Results Last 7 Years DZero MC Production in Terabytes of Data per year ending 12/26 DZero MC Production in Terabytes of Data 250 200 150 100 50 LCG OSG SAMGrid Non-Grid Year Total Non-Grid SAMGrid OSG LCG 2010 221.0 83.3 61.8 53.7 22.3 2009 95.3 42.7 19.8 32.8 0.0 2008 67.8 26.9 18.4 22.0 0.5 2007 31.6 7.3 13.2 8.2 2.9 2006 23.0 9.4 13.1 0.0 0.5 2005 6.0 4.1 1.9 0.0 0.0 2004 1.9 1.9 0.0 0.0 0.0 0 2004 2005 2006 2007 2008 2009 2010 DOE Review March 2011 Joel Snow Langston University 22
OU DZero MC Production 2005/09/05-2011/02/14 OUHEP produced 306 M events and 28.4 TB data Last year OUHEP produced 139 M events and 14.0 TB data 2010/02/14 2011/02/14 Cumulative since Sept. 2005 DOE Review March 2011 Joel Snow Langston University 23
LU DZero MC Production 2005/09/05-2011/02/14 LUHEP produced 15.5 M events and 1.36 TB data Last year LUHEP produced 4.6 M events and 450 GB data 2010/02/14 2011/02/14 Cumulative since Sept. 2005 DOE Review March 2011 Joel Snow Langston University 24
LUHEP Computing 2 grid enabled clusters both producing DØ MC Old Samgrid cluster- 12 job slots New OSG cluster - 12 job slots with small associated SE used as DØ cache DOE Review March 2011 Joel Snow Langston University 25
Condor Q's at LUHEP SAMGrid Last Year OSG DOE Review March 2011 Joel Snow Langston University 26
Summary DZero 's early deployment of grid technology and automation has dramatically increased MC production First deployment SAM distributed data handling system Early SAMGrid deployment Use of OSG and LCG resources through interoperability with SAMGrid First opportunistic usage of OSG Storage Elements Automated MC production system Anticipate adequate MC through the last analysis DOE Review March 2011 Joel Snow Langston University 27