Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer

Similar documents
igeni: International Global Environment for Network Innovations

BigData Express: Toward Predictable, Schedulable, and High-Performance Data Transfer. BigData Express Research Team November 10, 2018

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Pacific Wave: Building an SDN Exchange

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Layer 2 High Performance Networking Challenges and Breakthroughs: Lessons From GLIF 2012 and SC12 Demonstrations

DWDM-RAM: Enabling Grid Services with Dynamic Optical Networks

Purdue MSI Proposal May 5, NSF CISE/EIA Research Infrastructure PI Workshop

Wide-Area Networking at SLAC. Warren Matthews and Les Cottrell (SCS Network Group) Presented at SLAC, April

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

CHAMELEON: A LARGE-SCALE, RECONFIGURABLE EXPERIMENTAL ENVIRONMENT FOR CLOUD RESEARCH

Achieving the Science DMZ

Advanced Computer Networks. End Host Optimization

NVMe Over Fabrics (NVMe-oF)

RIGHTNOW A C E

TRANSCLOUD: Design Considerations for a. Multiple Administrative Domains Rick McGeer, HP Labs. August 1, 2010

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

TERENA E2E Workshop

ClearStream. Prototyping 40 Gbps Transparent End-to-End Connectivity. Cosmin Dumitru! Ralph Koning! Cees de Laat! and many others (see posters)!

Improving Network Infrastructure to Enable Large Scale Scientific Data Flows and Collaboration (Award # ) Klara Jelinkova Joseph Ghobrial

DWDM-RAM: DARPA-Sponsored Research for Data Intensive Service-on-Demand Advanced Optical Networks

StarLight Connects Global Research Community Using Force10 TeraScale E-Series

Programmable Information Highway (with no Traffic Jams)

BigData Express: Toward Predictable, Schedulable, and High-performance Data Transfer. Fermilab, May 2018

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Next Generation Networking and The HOPI Testbed

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

CloudLab. Updated: 5/24/16

Hong Kong Workshop on Next Generation Open Communication Exchanges for Science Research University of Hong Kong. January 18, 2012

DWDM-RAM: DARPA-Sponsored Research for Data Intensive Service-on-Demand Advanced Optical Networks

N V M e o v e r F a b r i c s -

The Future of the Internet

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

Zhengyang Liu! Oct 25, Supported by NSF Grant OCI

An FPGA-Based Optical IOH Architecture for Embedded System

Title DC Automation: It s a MARVEL!

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

High Throughput WAN Data Transfer with Hadoop-based Storage

BigData Express: Toward Predictable, Schedulable, and High-performance Data Transfer. Wenji Wu Internet2 Global Summit May 8, 2018

Delivering High-Throughput SAN with Brocade Gen 6 Fibre Channel. Friedrich Hartmann SE Manager DACH & BeNeLux

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

HPC learning using Cloud infrastructure

Supercomputing Asia Singapore March 26-29, 2018

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Enterprise2014. GPFS with Flash840 on PureFlex and Power8 (AIX & Linux)

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

TALK THUNDER SOFTWARE FOR BARE METAL HIGH-PERFORMANCE SOFTWARE FOR THE MODERN DATA CENTER WITH A10 DATASHEET YOUR CHOICE OF HARDWARE

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Benefits of full TCP/IP offload in the NFS

Data Protection for Cisco HyperFlex with Veeam Availability Suite. Solution Overview Cisco Public

Best Practices for Setting BIOS Parameters for Performance

Designing Next Generation FS for NVMe and NVMe-oF

Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams

Delivering HCI with VMware vsan and Cisco UCS

Network Design Considerations for Grid Computing

CMS Data Transfer Challenges and Experiences with 40G End Hosts

Experiences with 40G/100G Applications

Developing Networking and Human Expertise in Support of International Science

SLIDE 1 - COPYRIGHT 2015 ELEPHANT FLOWS IN THE ROOM: SCIENCEDMZ NATIONALLY DISTRIBUTED

Observer GigaStor. Post-event analysis and network security forensics

Zhengyang Liu University of Virginia. Oct 29, 2012

SNIA s SSSI Solid State Storage Initiative. Jim Pappas Vice-Char, SNIA

Toward a Memory-centric Architecture

vsan 6.6 Performance Improvements First Published On: Last Updated On:

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Heterogeneous Interconnection between SDN and Layer2 Networks based on NSI

The Oracle Database Appliance I/O and Performance Architecture

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

EXPERIENCES WITH NVME OVER FABRICS

Implementation of the Pacific Research Platform over Pacific Wave

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Data Centric Computing

Networks & protocols research in Grid5000 DAS3

Ian Foster, An Overview of Distributed Systems

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware

Cisco SAN Analytics and SAN Telemetry Streaming

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

HPE Storage Spaces Direct Solutions

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

Analytics of Wide-Area Lustre Throughput Using LNet Routers

International Exchanges Current and Future

Flowzilla: A Methodology for Detecting Data Transfer Anomalies in Research Networks. Anna Giannakou, Daniel Gunter, Sean Peisert

G-NET: Effective GPU Sharing In NFV Systems

NVMe over Universal RDMA Fabrics

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

15-744: Computer Networking. Data Center Networking II

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Interconnect Your Future

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Transcription:

Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer International Center for Advanced Internet Research Northwestern University Se-young Yu Jim Chen, Joe Mambretti, Fei Yeh INDIS Workshop, Co-Located With ACM International Conference for High Performance Computing, Networking, Storage, and Analysis Dallas, TX November 11, 2018

Introduction to icair: Accelerating Leading Edge Innovation and Enhanced Global Communications through Advanced Internet Technologies, in Partnership with the Global Community Creation and Early Implementation of Advanced Networking Technologies - The Next Generation Internet All Optical Networks, Terascale Networks, Networks for Petascale and Exascale Science Advanced Applications, Middleware, Large-Scale Infrastructure, NG Optical Networks and Testbeds, Public Policy Studies and Forums Related to Optical Fiber and Next Generation Networks Three Major Areas of Activity: a) Basic Research b) Design and Implementation of Prototypes and Research Testbeds, c) Operations of Specialized Communication Facilities (e.g., StarLight, Specialized Science Networks)

StarLight By Researchers For Researchers StarLight: Experimental Optical Infrastructure/Proving Ground For Next Gen Network Services Optimized for High Performance Data Intensive Science Multiple 100 Gbps (57+ 100 G Paths) StarWave 100 G Exchange Innovating First of a Kind Services and Capabilities View from StarLight Abbott Hall, Northwestern University s Chicago Campus

icair: Founding Partner of the Global Lambda Integrated Facility Available Advanced Network Resources Visualization courtesy of Bob Patterson, NCSA; data compilation by Maxine Brown, UIC. www.glif.is

StarLight SDX Context: Providing DTN Based Services For High Performance Data Flows Across the Globe, Especially For Data Intensive Science Integrated With an International SDX National Science Foundation International Research Network Connections RXP: StarLight SDX A Software Defined Networking Exchange for Global Science and Education (NSF ACI-1450871)

Data Transfer Nodes (DTNs) Purpose-built systems (network appliances) High-Performance networks and I/O Optimized for 10-100 Gbps transfers Providing high-performance transfer tools to applications, processes, users Faster disks SSD, RAID, SAN

100 Gbps Transfer Challenges Modern x86 CPUs cannot transfer data at 100 Gbps with a single flow Requires multiple concurrent flows (4-8) I/O devices create many more interrupts NIC and storage devices through MSI-X HDD RAIDs cannot read/write at 12.5 GB/s

DTN Bottlenecks and Potential Solution Bottlenecks: Slow disk speed, even with multiple NVMe The multi-processors in DTNs start to create bottlenecks Responses: Need to optimize CPU and storage to improve performance in network applications Need to handle applications with increasing number of cores without increasing processor speed

Non-Uniform Memory Access (NUMA) Allows multiple processors to access memory simultaneously Forms a cluster-like logical processing unit called a NUMA node Memory controller handles memory access between NUMA nodes However, there is overhead in accessing memory in different NUMA nodes

How to use NUMA better Bind processes and NIC into the same NUMA node Processes can be bound In specific core Each process is not allowed to be scheduled to another core Within NUMA node Processes can be scheduled within local NUMA nodes Anywhere Processes can be scheduled anywhere regardless of NUMA Binding a single memory-to-memory transfer process to local NUMA node improves throughput Q: Does this technique also improve throughput in disk-to-disk transfer?

How to use NUMA better Hypothesis: It is better to bind disk transfer processes in local NUMA nodes Put a NIC and NVMe devices in the same NUMA Limit transfer processes to that NUMA > Less foreign NUMA memory access

Why Test Initially Within Local Area Network? Easier to establish a clean path Minimizing packet loss effect 100 Gbps DTN with NVMe is not widely available until recently NUMA access latency difference between the nodes is in 100s of nanoseconds where TCP is in milliseconds Larger TCP access latency may disguise affinity effect Need to highlight the difference in affinity settings

How To Use NUMA Better (Individual Disk) -Pinned to Core CPU load on System Balanced CPU load on Pinned to NUMA Throughput limited by NVMe under test ~ 60 Gbps Binding processes to specific core reduced overhead CPU load on Pinned to Core

RAID vs Non-RAID RAID provides better management for multiple storage systems Is RAID overhead significant in 100 Gbps transfer? Are there any differences among different process binding schemes?

How to use NUMA better (RAID) - Pinned to NUMA CPU load on System Balanced 40 20 Throughput (Gbps)60 System balanced Pinned to NUMA Pinned to Core 0 0 200 400 600 800 1000 1200 1400 Elapsed Tim e (s) CPU load on Pinned to NUMA Software RAID created more overhead at almost 100% Under heavy loads, flexibility within NUMA helps CPU load on Pinned to Core

NVMe over Fabrics Connects remote NVMe devices to local machine Access to the NVMe block device File copy tools can be used Good for streaming instead of copy and paste

60 40 20 Throughput (Gbps)80 0 Faster peak throughput, but similar completion time some processes have delayed completion time Pinning to NUMA allows faster completion time NVMe over Fabrics - Pinned to NUMA System balanced Pinned to NUMA Pinned to Core 0 100 200 300 400 500 600 700 Elapsed Time (s) CPU load on System Balanced CPU load CPU load 1 0 1 0 0 CPU load1 CPU cores Local NUMA Foreign NUMA CPU cores 0 100 200 300 400 500 600 Elapsed Time (s) CPU load on Pinned to NUMA 0 CPU load1 CPU cores Local NUMA 0 100 200 300 400 500 600 Elapsed Time (s) CPU load on Pinned to Core CPU cores Local NUMA 0 100 200 300 400 500 600 700 Elapsed Time (s)

Starlight SDX DTN-as-a-Service Software Stack Building a loss-less DTNs in 100 Gbps network Developing a NUMA-aware testing framework to identify packet loss between DTNs Systematic tuning and optimization of the DTNs for 100 Gbps disk-to-disk transfer Passive monitoring system for bottleneck detection in high-speed network data transfer

X-NET: SCinet Data Transfer Node(DTN) Service TEAM MEMBERS Jim Chen NWU/STARLight Gonzalo Rodrigo Apple/LBL Ana Giannakou LBL Eric Pouyoul ESnet Fei Yeh NWU/STARLight Se-Young Yu NWU/STARLight Xiao(Shawn)Wang NWU/STARLight David Wheeler NCSA/UIUC ABLE TO DO: 1) Develop 100G network fiber/link/vlan/route verification procedures with a portable tester to shorten set up time and improve readiness. 2) Prototype user experiment environment isolation & management solutions: Docker/ Kubernetes/Rancher/VM, also plan to evaluate other Docker Integration 3) Design AI-Enabled DTN use case and workflow prototype Related & Supported Paper: 1) "Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer" -Se-Young Yu & others. 2) "BigData Express: Toward Schedulable, Predictable, and High-performance Data Transfer -Wenji Wu & other 3) Flowzilla: A methodology for Detecting Data Transfer Anomalies in Research Networks. -Anna Giannakou & others 4) And Additional Papers Issues & Recommendations: DTN user cases Prepare for 100G network data connectivity end to end tests DTN performance tuning over network

X-NET: SCinet Data Transfer Node(DTN) Service Project A-N SCinet Scalable DTN @ 8 X 100G 7 X 100G SC18 Dallas 12 X 100G Project A-N StarLight Scalable DTN @ 10 X 100G 10 X 100G B X 100G Project B-E Home sites Project A A x 100G B X 100G Project B A X 100G Project A Home site

SC18 X-NET Faucet and SCinet DTN Team Collaboration: Faucet Demo with 100G DTN Probe in DNOCs DNOC 2644 Allied Telesis SBx908 GEN2 DNOC 420 NOC SCinet FAUCET 100 Gbps Flows Cisco C9500-32C

SC18 X-NET Faucet and SCinet DTN Team Collaboration: Faucet Demo with 100G DTN Probe in DNOCs Please Visit Booth 2851 For test configuration

Conclusions Storage systems are still common bottlenecks in 100 Gbp disk-to-disk transfer CPU affinity settings in NUMA can reduce processor overhead NB: Actual affinity setting may differ Software RAID for NVMe may not work well in 100 Gbps for now NVMe Over Fabrics has less overhead in 100 Gbps network transfers

Future Work Disk-to-disk transfers Across WANs 100 Gbps DTNs with NVMe are now available RoCE v2 requires very clean path Congestion control based on ECN (like DCTCP) NVMe over Fabrics TCP Hardware NVMe RAID Apply to the following projects Starlight SDX DTNs SCinet DTN Chameleon Large Flow Appliance Investigation of intelligence/automation techniques

www.startap.net/starlight Thanks to the NSF, DOE, DARPA Universities, National Labs, International Partners, and Other Supporters

Thank you Any Question?