NCP Computing Infrastructure & T2-PK-NCP Site Update Saqib Haleem National Centre for Physics (NCP), Pakistan
Outline NCP Overview Computing Infrastructure at NCP WLCG T2 Site status Network status and Issues Conclusion and Future work
About NCP The National Centre for Physics (NCP), Pakistan has been established to promote research in Physics & applied disciplines in the country. Collaboration with many in International organizations including CERN, SEASME, ICTP, TWAS Major Research Programs: Experimental High Energy Physics, Theoretical and Plasma Physics, Nano Sciences and Catalysis, Laser Physics, vacuum Sciences and technology, Earthquake studies. Organize/hosts International & National workshops, conferences every year. International Scientific School (ISS), NCP school on LHC Physics.
Computing Infrastructure Overview Installed Resources in Data Centre No. of Physical Servers 89 Processor sockets 140 CPU Cores 876 Memory Capacity 1.7TB No. of Disks ~600 Storage Capacity Network Equipment (Switches, Router, Firewalls) ~500TB ~50
Computing Infrastructure Overview Computating Infrastructure of NCP is mainly utilized for two projects: Major Portion of compute and storage resources are being used in WLCG T2 site Project & Local Batch system focusing Experimental High Energy Physicist community Rest Allocated to High Performance Computing (HPC) cluster.
WLCG Site at NCP status NCP-LCG2 site hosted at National Centre for Physics (NCP), for WLCG CMS experiment. Site was established in 2005, with very limited resources. Later on resources upgraded in 2009. The Only CMS Tier-2 Site in Pakistan( T2_PK_NCP) NCP-LCG2 WLCG resource Pledges Installed Pledged 2018 2018 2019 WLCG Resource Summary Physical Hosts 63 CPU sockets 126 No. of CPU Cores 524 HEPSPEC06 6365 Disk Storage Capacity Network Connectivity Compute Elements (CE) DPM Storage Disk Nodes 330 TB 1 Gbps (shared) 03 02( CREAM-CE) 01 (HTCONDOR-CE) 13 CPU (HEP-SPEC06) 6365 6365 6365 Disk (TeraBytes) 330 330 500*
WLCG Hardware Status Computing Servers Hardware Sun Fire X4150(Intel(R) Xeon(R) CPU X5460 @ 3.16GHz) Dell Power Edge R610 (Intel(R) Xeon(R) CPU X5670 @ 2.93GHz) Sockets /cores Qty. 02/04 28 224 02/06 25 300 Total Cores 524 cores Storage Servers Storage Server Quantity Usable Capacity Transtec Lynx 4300 23TB x1tb Dell Power Edge T620 24 X 2TB Existing Hardware is very old, and replacement of hardware is in progress. Recent Purchase: Compute Servers : 3 X Dell PowerEdge 740R (2 x Intel Xeon Silver 4116 Processor) 116 CPU cores 1 X Lenovo systemx 3650 (2 x Intel Xeon E52699v4 Processor) Storage: Dell PowerEdge 740xd CPU (2 x Intel Xeon Silver 4116 Processor) = 16 cores (50TB) Dell PowerEdge R430 CPU 2 x 2609v4 = 16 cores (100TB PV-MD1200) More Procurement is in process.. 15 2 330TB 150TB Raw
NCP-LCG2 Site Availability & Reliability http://argo.egi.eu/lavoisier/site_ar?site=ncp-lcg2,%20&cr=1&start_date=2018-01-01&end_date=2018-10-31&granularity=monthly&report=critical
Usage Statistics
HPC Cluster HPC cluster Deployed is available based on Rocks Linux Distribution, and available 24x7. 100 +CPU cores are available. Applications Installed WEIN2K Suspect GEANT2 R/RMPI Quantum Espresso Scotch MCR.. Material Science Climate Modeling, Earthquake studies Research Areas Condensed Matter Physics Plasma Physics.. Cluster Usage ---- Last 2 years Cluster Usage Trend---- Last 3 years
Computing Model Partially Shifted the NCP compute servers hardware on Openstack Based Private Cloud, for efficient utilization of idle resources. Ceph based block storage model is also adopted for flexibility. Currently two computing projects are successfully running on cloud. WLCG T2 Site ( compute Nodes) Local HPC cluster ( Compute Nodes)
Network Status Currently 1Gbps of R&E link, between Pakistan Education and Research Network (PERN) and TEIN. Pakistan s Connectivity to TEIN via Singapore (SG) PoP. R&E Link is shared among Universities/Projects in Pakistan ( i.e. Not dedicated for LHC traffic only)
T2_PK_NCP Network Status NCP has 1Gbps of Network connectivity with PERN, with following traffic policies: Commercial/commodity Internet 50 Mbps Bandwidth for International R&E destinations is kept un-restricted. Commercial Internet PERN AS45773 Internet2 R&E Networks TEIN GEANT 1Gbps Fully operational on IPv4 and IPv6. R&D link ( without restriction) 1Gbps Commercial internet @50Mbps IPv6 enabled NCP services ( Email, DNS, WWW, FTP WLCG Compute Elements (CEs), Storage Element(SE) & Pool nodes, PerfSonar Node 2400:FC00:8540::/44 111.68.99.128/27 111.68.96.160/27 NCP WLCG site Hu2wai-NE-40
Network Traffic Statistics ( 1 Year) ~ 50 % of NCP traffic is on IPv6 ( Mostly WLCG traffic) IPv6 Traffic trend is increasing.
T2_PK_NCP Network Status Low Network Throughput is being observed from T1 sites, for last 1 year, which resulted in decommissioning T2_PK_NCP links. To become a commissioned site for production data transfers, a site must pass a commissioning test. [1] T1 to T2 site average data transfer rate must be > 20MB/s. Accumulated transfer rate of all T1 sites T2 to T1 site average data transfer rate must be > 5MB/s. However, NCP is getting average ~ 5MB/s to 10MB/s transfer rate from almost all T1 sites in multiple tests. [1] https://twiki.cern.ch/twiki/bin/view/cmspublic/compopstransferteamlinkcommissioning Accumulated transfer rate of T1_IT_CNAF site
T2_PK_NCP Network Status Only T1_UK_RAL -> NCP link commissioned in Nov 2017 and later on decommissioned in April, 2018 due to low data rate.
T2_PK_NCP : Site Readiness Report Recently changed status from WR to Morgue, Due to Network issues. Ref: https://cms-site-readiness.web.cern.ch/cms-site-readiness/sitereadiness/html/sitereadinessreport.html
Network issues Multiple simultaneous Network related issues identified, which increased the complexity of issue: T1_US_FNAL, and T1_RU_JINR were not initially following R&E (TEIN) path.( Resolved: April 2018) Low buffer related errors observed in NCP Gateway Device. Replaced with new hardware. (Resolved: March, 2018) PERN also Identified the issue in their uplink route. ( Resolved: March, 2018).
PERN Network Complexity NCP is located on North Region, while TEIN link is terminated on south Region HEC(KHI) PoP. NCP Multiple Physical routes and devices exist between the path from NCP to TEIN. Initially PERN network suffered with frequent congestion issues to fiber ring outages. (However now situation is stable, due to bifurcation of fiber routes) TEIN Link Ref: http://pern.edu.pk/wp-content/uploads/2018/10/rsz_pern_v2_updated.jpg
Network issues Steps taken up-till now, for identification of Low throughput from WLCG sites: Involved WLCG Network Throughput Working Group, and Global NOC (IU). Tuning of storage servers: system parameters /etc/sysctl.conf file. Placement of Storage node at NCP edge network. Deployed PerfSonar Nodes at NCP and PERN Network. Installed GridFTP clients on PERN edge to isolate the issue.
Network Map for Troubleshooting Network Team from Global NoC and ESNET prepared Map based on Perfsonar nodes in path, Initially investigating low throughput issue from Italian site T1_IT_CNAF, and T1_UK_RAL to NCP Case: https://docs.google.com/document/d/1hhhk 9t4PpYPzZOfJUAhupRAodhPT6HNIZi6J8ljpBtw /edit# Figure: Perfsonar Nodes & Latency between paths
PerfSonar Mesh Ref: http://data.ctc.transpac.org/maddash-webui/index.cgi?dashboard=gtpng%20mesh
Findings & conclusion Traffic path from R&E network is offering high latency as compared to commercial network. Difference of 100+ ms. Commercial route is providing better WLCG transfer rate, but have some issues: Can not ensure guaranteed bandwidth all the time either from PERN or far end T1 Site. NCP is currently the major user of 1Gbps R&E link in Pakistan, but ~10-30 % utilization by other organizations, which eventually reducing bandwidth for LHC traffic. Minor packet loss on high latency route can severely impact throughput. Need to increase the coordination among intermediates Network Operators to find out the source of packet loss, possibly low buffered intermediate device, or source of intermittent congestion.
Upcoming Work Planning to pledge at least 3 x times more compute and storage to fulfill the needs of WLCG experiment during HL-LHC. Compute pledges from 6K ( HEPSPEC) -> 18 K (HEPSPEC) Storage Pledges from 500 TB -> 2PB. International R&E Bandwidth increase from 1Gbps to 2.5Gbps. Upgradation of storage and ToR switches 1Gbps->10Gbps HL-LHC---Expected Storage Requirement Upgradation of PERN2 -> PERN3 Edge network upgradation to 10Gbps Core Network upgradation to support 40Gbps NCP site connectivity with LHCONE. HL-LHC----Expected Computing Requirement
Q &A