NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

Similar documents
t and Migration of WLCG TEIR-2

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Grid Computing Activities at KIT

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

WLCG Network Throughput WG

Support for multiple virtual organizations in the Romanian LCG Federation

Austrian Federated WLCG Tier-2

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Status of KISTI Tier2 Center for ALICE

Clouds at other sites T2-type computing

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Figure 1: cstcdie Grid Site architecture

Clouds in High Energy Physics

Data transfer over the wide area network with a large round trip time

Grid Computing at the IIHE

Distributing storage of LHC data - in the nordic countries

Network Analytics. Hendrik Borras, Marian Babik IT-CM-MM

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

Overview. About CERN 2 / 11

New International Connectivities of SINET5

Presentation of the LHCONE Architecture document

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Connectivity Services, Autobahn and New Services

McAfee Network Security Platform 9.2

High-Energy Physics Data-Storage Challenges

The INFN Tier1. 1. INFN-CNAF, Italy

Andrea Sciabà CERN, Switzerland

Long Term Data Preservation for CDF at INFN-CNAF

Research and Education Networking Ecosystem and NSRC Assistance

The Legnaro-Padova distributed Tier-2: challenges and results

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Grid and Cloud Activities in KISTI

Analytics of Wide-Area Lustre Throughput Using LNet Routers

CMS Data Transfer Challenges and Experiences with 40G End Hosts

GÉANT Mission and Services

Microsoft IT Leverages its Compute Service to Virtualize SharePoint 2010

vstart 50 VMware vsphere Solution Specification

Opportunities A Realistic Study of Costs Associated

Operating the Distributed NDGF Tier-1

Data storage services at KEK/CRC -- status and plan

ATLAS 実験コンピューティングの現状と将来 - エクサバイトへの挑戦 坂本宏 東大 ICEPP

AMRES Combining national, regional and & EU efforts

Parallel Storage Systems for Large-Scale Machines

Measurement and Monitoring. Yasuichi Kitmaura Takatoshi Ikeda

Cloud Computing For Researchers

Network Expectations for Data Intensive Science

Users and utilization of CERIT-SC infrastructure

Server Specifications

Towards Network Awareness in LHC Computing

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

DYNES: DYnamic NEtwork System

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Storage and I/O requirements of the LHC experiments

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

Accelerating Throughput from the LHC to the World

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

FUJITSU Software Smart Communication Optimizer V1.1.0 Introduction

High Throughput WAN Data Transfer with Hadoop-based Storage

Current Status of the Ceph Based Storage Systems at the RACF

70-414: Implementing an Advanced Server Infrastructure Course 01 - Creating the Virtualization Infrastructure

Computing for LHC in Germany

Reference Architecture for Dell VIS Self-Service Creator and VMware vsphere 4

GRID AND HPC SUPPORT FOR NATIONAL PARTICIPATION IN LARGE-SCALE COLLABORATIONS

Benoit DELAUNAY Benoit DELAUNAY 1

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

CMS Belgian T2. G. Bruno UCL, Louvain, Belgium on behalf of the CMS Belgian T2 community. GridKa T1/2 meeting, Karlsruhe Germany February

ANSE: Advanced Network Services for [LHC] Experiments

Use of containerisation as an alternative to full virtualisation in grid environments.

arxiv: v1 [cs.dc] 20 Jul 2015

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Virtualization. A very short summary by Owen Synge

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

GÉANT : e-infrastructure connectivity for the data deluge

G-NET: Effective GPU Sharing In NFV Systems

SLIDE 1 - COPYRIGHT 2015 ELEPHANT FLOWS IN THE ROOM: SCIENCEDMZ NATIONALLY DISTRIBUTED

Achieving the Science DMZ

PI SERVER 2012 Do. More. Faster. Now! Copyri g h t 2012 OSIso f t, LLC.

Enhancing Infrastructure: Success Stories

McAfee Network Security Platform 9.2

VMware vcloud Air User's Guide

NCIT*net 2 General Information

Private Cloud at IIT Delhi

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Europe and its Open Science Cloud: the Italian perspective. Luciano Gaido Plan-E meeting, Poznan, April

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

LHC and LSST Use Cases

ALICE Grid Activities in US

The LHC Computing Grid

Cloud Service SLA Declaration

First Experience with LCG. Board of Sponsors 3 rd April 2009

MidoNet Scalability Report

Edge for All Business

Transcription:

NCP Computing Infrastructure & T2-PK-NCP Site Update Saqib Haleem National Centre for Physics (NCP), Pakistan

Outline NCP Overview Computing Infrastructure at NCP WLCG T2 Site status Network status and Issues Conclusion and Future work

About NCP The National Centre for Physics (NCP), Pakistan has been established to promote research in Physics & applied disciplines in the country. Collaboration with many in International organizations including CERN, SEASME, ICTP, TWAS Major Research Programs: Experimental High Energy Physics, Theoretical and Plasma Physics, Nano Sciences and Catalysis, Laser Physics, vacuum Sciences and technology, Earthquake studies. Organize/hosts International & National workshops, conferences every year. International Scientific School (ISS), NCP school on LHC Physics.

Computing Infrastructure Overview Installed Resources in Data Centre No. of Physical Servers 89 Processor sockets 140 CPU Cores 876 Memory Capacity 1.7TB No. of Disks ~600 Storage Capacity Network Equipment (Switches, Router, Firewalls) ~500TB ~50

Computing Infrastructure Overview Computating Infrastructure of NCP is mainly utilized for two projects: Major Portion of compute and storage resources are being used in WLCG T2 site Project & Local Batch system focusing Experimental High Energy Physicist community Rest Allocated to High Performance Computing (HPC) cluster.

WLCG Site at NCP status NCP-LCG2 site hosted at National Centre for Physics (NCP), for WLCG CMS experiment. Site was established in 2005, with very limited resources. Later on resources upgraded in 2009. The Only CMS Tier-2 Site in Pakistan( T2_PK_NCP) NCP-LCG2 WLCG resource Pledges Installed Pledged 2018 2018 2019 WLCG Resource Summary Physical Hosts 63 CPU sockets 126 No. of CPU Cores 524 HEPSPEC06 6365 Disk Storage Capacity Network Connectivity Compute Elements (CE) DPM Storage Disk Nodes 330 TB 1 Gbps (shared) 03 02( CREAM-CE) 01 (HTCONDOR-CE) 13 CPU (HEP-SPEC06) 6365 6365 6365 Disk (TeraBytes) 330 330 500*

WLCG Hardware Status Computing Servers Hardware Sun Fire X4150(Intel(R) Xeon(R) CPU X5460 @ 3.16GHz) Dell Power Edge R610 (Intel(R) Xeon(R) CPU X5670 @ 2.93GHz) Sockets /cores Qty. 02/04 28 224 02/06 25 300 Total Cores 524 cores Storage Servers Storage Server Quantity Usable Capacity Transtec Lynx 4300 23TB x1tb Dell Power Edge T620 24 X 2TB Existing Hardware is very old, and replacement of hardware is in progress. Recent Purchase: Compute Servers : 3 X Dell PowerEdge 740R (2 x Intel Xeon Silver 4116 Processor) 116 CPU cores 1 X Lenovo systemx 3650 (2 x Intel Xeon E52699v4 Processor) Storage: Dell PowerEdge 740xd CPU (2 x Intel Xeon Silver 4116 Processor) = 16 cores (50TB) Dell PowerEdge R430 CPU 2 x 2609v4 = 16 cores (100TB PV-MD1200) More Procurement is in process.. 15 2 330TB 150TB Raw

NCP-LCG2 Site Availability & Reliability http://argo.egi.eu/lavoisier/site_ar?site=ncp-lcg2,%20&cr=1&start_date=2018-01-01&end_date=2018-10-31&granularity=monthly&report=critical

Usage Statistics

HPC Cluster HPC cluster Deployed is available based on Rocks Linux Distribution, and available 24x7. 100 +CPU cores are available. Applications Installed WEIN2K Suspect GEANT2 R/RMPI Quantum Espresso Scotch MCR.. Material Science Climate Modeling, Earthquake studies Research Areas Condensed Matter Physics Plasma Physics.. Cluster Usage ---- Last 2 years Cluster Usage Trend---- Last 3 years

Computing Model Partially Shifted the NCP compute servers hardware on Openstack Based Private Cloud, for efficient utilization of idle resources. Ceph based block storage model is also adopted for flexibility. Currently two computing projects are successfully running on cloud. WLCG T2 Site ( compute Nodes) Local HPC cluster ( Compute Nodes)

Network Status Currently 1Gbps of R&E link, between Pakistan Education and Research Network (PERN) and TEIN. Pakistan s Connectivity to TEIN via Singapore (SG) PoP. R&E Link is shared among Universities/Projects in Pakistan ( i.e. Not dedicated for LHC traffic only)

T2_PK_NCP Network Status NCP has 1Gbps of Network connectivity with PERN, with following traffic policies: Commercial/commodity Internet 50 Mbps Bandwidth for International R&E destinations is kept un-restricted. Commercial Internet PERN AS45773 Internet2 R&E Networks TEIN GEANT 1Gbps Fully operational on IPv4 and IPv6. R&D link ( without restriction) 1Gbps Commercial internet @50Mbps IPv6 enabled NCP services ( Email, DNS, WWW, FTP WLCG Compute Elements (CEs), Storage Element(SE) & Pool nodes, PerfSonar Node 2400:FC00:8540::/44 111.68.99.128/27 111.68.96.160/27 NCP WLCG site Hu2wai-NE-40

Network Traffic Statistics ( 1 Year) ~ 50 % of NCP traffic is on IPv6 ( Mostly WLCG traffic) IPv6 Traffic trend is increasing.

T2_PK_NCP Network Status Low Network Throughput is being observed from T1 sites, for last 1 year, which resulted in decommissioning T2_PK_NCP links. To become a commissioned site for production data transfers, a site must pass a commissioning test. [1] T1 to T2 site average data transfer rate must be > 20MB/s. Accumulated transfer rate of all T1 sites T2 to T1 site average data transfer rate must be > 5MB/s. However, NCP is getting average ~ 5MB/s to 10MB/s transfer rate from almost all T1 sites in multiple tests. [1] https://twiki.cern.ch/twiki/bin/view/cmspublic/compopstransferteamlinkcommissioning Accumulated transfer rate of T1_IT_CNAF site

T2_PK_NCP Network Status Only T1_UK_RAL -> NCP link commissioned in Nov 2017 and later on decommissioned in April, 2018 due to low data rate.

T2_PK_NCP : Site Readiness Report Recently changed status from WR to Morgue, Due to Network issues. Ref: https://cms-site-readiness.web.cern.ch/cms-site-readiness/sitereadiness/html/sitereadinessreport.html

Network issues Multiple simultaneous Network related issues identified, which increased the complexity of issue: T1_US_FNAL, and T1_RU_JINR were not initially following R&E (TEIN) path.( Resolved: April 2018) Low buffer related errors observed in NCP Gateway Device. Replaced with new hardware. (Resolved: March, 2018) PERN also Identified the issue in their uplink route. ( Resolved: March, 2018).

PERN Network Complexity NCP is located on North Region, while TEIN link is terminated on south Region HEC(KHI) PoP. NCP Multiple Physical routes and devices exist between the path from NCP to TEIN. Initially PERN network suffered with frequent congestion issues to fiber ring outages. (However now situation is stable, due to bifurcation of fiber routes) TEIN Link Ref: http://pern.edu.pk/wp-content/uploads/2018/10/rsz_pern_v2_updated.jpg

Network issues Steps taken up-till now, for identification of Low throughput from WLCG sites: Involved WLCG Network Throughput Working Group, and Global NOC (IU). Tuning of storage servers: system parameters /etc/sysctl.conf file. Placement of Storage node at NCP edge network. Deployed PerfSonar Nodes at NCP and PERN Network. Installed GridFTP clients on PERN edge to isolate the issue.

Network Map for Troubleshooting Network Team from Global NoC and ESNET prepared Map based on Perfsonar nodes in path, Initially investigating low throughput issue from Italian site T1_IT_CNAF, and T1_UK_RAL to NCP Case: https://docs.google.com/document/d/1hhhk 9t4PpYPzZOfJUAhupRAodhPT6HNIZi6J8ljpBtw /edit# Figure: Perfsonar Nodes & Latency between paths

PerfSonar Mesh Ref: http://data.ctc.transpac.org/maddash-webui/index.cgi?dashboard=gtpng%20mesh

Findings & conclusion Traffic path from R&E network is offering high latency as compared to commercial network. Difference of 100+ ms. Commercial route is providing better WLCG transfer rate, but have some issues: Can not ensure guaranteed bandwidth all the time either from PERN or far end T1 Site. NCP is currently the major user of 1Gbps R&E link in Pakistan, but ~10-30 % utilization by other organizations, which eventually reducing bandwidth for LHC traffic. Minor packet loss on high latency route can severely impact throughput. Need to increase the coordination among intermediates Network Operators to find out the source of packet loss, possibly low buffered intermediate device, or source of intermittent congestion.

Upcoming Work Planning to pledge at least 3 x times more compute and storage to fulfill the needs of WLCG experiment during HL-LHC. Compute pledges from 6K ( HEPSPEC) -> 18 K (HEPSPEC) Storage Pledges from 500 TB -> 2PB. International R&E Bandwidth increase from 1Gbps to 2.5Gbps. Upgradation of storage and ToR switches 1Gbps->10Gbps HL-LHC---Expected Storage Requirement Upgradation of PERN2 -> PERN3 Edge network upgradation to 10Gbps Core Network upgradation to support 40Gbps NCP site connectivity with LHCONE. HL-LHC----Expected Computing Requirement

Q &A