End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Similar documents
High-speed, Short-latency Multipath Ethernet Transport for Interconnections

SSD Architecture Considerations for a Spectrum of Enterprise Applications. Alan Fitzgerald, VP and CTO SMART Modular Technologies

An FPGA-Based Optical IOH Architecture for Embedded System

PCI Express SATA III RAID Controller Card with Mini-SAS Connector (SFF-8087) - HyperDuo SSD Tiering

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

A closer look at network structure:

4 Port PCI Express 2.0 SATA III 6Gbps RAID Controller Card with HyperDuo SSD Tiering

Computer Networks Principles

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

MicroTCA / AMC Solutions for Real-Time Data Acquisition

Networks. Distributed Systems. Philipp Kupferschmied. Universität Karlsruhe, System Architecture Group. May 6th, 2009

Understanding Performance of PCI Express Systems

The Tofu Interconnect 2

NVMe performance optimization and stress testing

Breadth First Search on Cost efficient Multi GPU Systems

6.9. Communicating to the Outside World: Cluster Networking

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Next Generation Architecture for NVM Express SSD

PCI Express: Evolution, Deployment and Challenges

FEATURE. High-throughput Video Data Transfer of Striping with SSDs for 8K Super Hi-Vision

Computer Networking. Introduction. Quintin jean-noël Grenoble university

Onyx: A Prototype Phase-Change Memory Storage Array

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Chapter 1. Introduction

SAS Technical Update Connectivity Roadmap and MultiLink SAS Initiative Jay Neer Molex Corporation Marty Czekalski Seagate Technology LLC

BlueGene/L. Computer Science, University of Warwick. Source: IBM

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

A Next Generation Home Access Point and Router

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Messaging Overview. Introduction. Gen-Z Messaging

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

3 Port PCI Express 2.0 SATA III 6 Gbps RAID Controller Card w/ msata Slot and HyperDuo SSD Tiering

JMR ELECTRONICS INC. WHITE PAPER

Motivation CPUs can not keep pace with network

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

2 Port PCI Express 2.0 SATA III 6Gbps RAID Controller Card w/ 2 msata Slots and HyperDuo SSD Tiering

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

3 Port PCI Express 2.0 SATA III 6 Gbps RAID Controller Card w/ msata Slot and HyperDuo SSD Tiering

2-Port PCI Express 2.0 SATA III 6Gbps RAID Controller Card with 2 msata Slots and HyperDuo SSD Tiering

UM DIA NA VIDA DE UM PACOTE CEE

Internetworking Part 1

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

Investigating the Use of Synchronized Clocks in TCP Congestion Control

NVM PCIe Networked Flash Storage

Throughput & Latency Control in Ethernet Backplane Interconnects. Manoj Wadekar Gary McAlpine. Intel

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

Deep Learning Performance and Cost Evaluation

A Network Storage LSI Suitable for Home Network

Maximizing NFS Scalability

Xilinx Answer QDMA Performance Report

Specifying Storage Servers for IP security applications

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Accelerating Spark RDD Operations with Local and Remote GPU Devices

What is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments

Basic Low Level Concepts

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

Computer Systems Laboratory Sungkyunkwan University

CSE 473 Introduction to Computer Networks. Exam 1. Your name: 9/26/2013

Solving the Data Transfer Bottleneck in Digitizers

Hálózatok üzleti tervezése

Leveraging HyperTransport for a custom high-performance cluster network

Advanced FCoE: Extension of Fibre Channel over Ethernet

Infiniband Fast Interconnect

Lecture 16: Network Layer Overview, Internet Protocol

GUARANTEED END-TO-END LATENCY THROUGH ETHERNET

Introduction to Routers and LAN Switches

440GX Application Note

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation

Student ID: CS457: Computer Networking Date: 3/20/2007 Name:

Congestion Management in Lossless Interconnects: Challenges and Benefits

Networks and Distributed Systems. Sarah Diesburg Operating Systems CS 3430

CSE 123A Computer Networks

CS 421: COMPUTER NETWORKS SPRING FINAL May 8, minutes

CS268: Beyond TCP Congestion Control

Cloudian Sizing and Architecture Guidelines

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

CSCI Computer Networks

Configuring QoS CHAPTER

IBM POWER8 100 GigE Adapter Best Practices

Chapter 18. Introduction to Network Layer

Computer Networks (Fall 2011) Homework 2

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Data Link Control Protocols

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (3 rd Week)

Ron Emerick, Oracle Corporation

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren

TOC: Switching & Forwarding

The protocol can efficiently use interconnect and fabric techno-

TOC: Switching & Forwarding

Transcription:

Hot Interconnects 2014 End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Green Platform Research Laboratories, NEC, Japan J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa, and T. Yoshikawa

Background: ExpEther, PCIe Interconnect Based on Ethernet Extension of PCI Express over Ethernet Encapsulation of PCIe packets (TLPs) into Ethernet frames Pros: Scalable connectivity, system reconfiguration, and separate implementation VLAN 1 VLAN 2 TLP: PCIe Transaction Layer Packet Host A Host B Host C Host Adaptor TLP ExpEther Bridge ExpEther Bridge ExpEther Bridge Ether Header TLP Ethernet TLP ExpEther Bridge I/O Device A ExpEther Bridge I/O Device B ExpEther Bridge I/O Device C J. Suzuki et al., 14 th IEEE Symposium on High-Performance Interconnects, pp. 45-51, 2006. I/O Device Box I/O Device I/O Device I/O Device Page 2 NEC Corporation 2014 Hot Interconnects 2014

Overhead of TLP Encapsulation Overhead of packet-by-packet encapsulation decreases PCIe throughput obtained through Ethernet connections Proposal: Aggregate multiple TLPs into single Ethernet frame [Current System] Overhead 128 or 256B 8B 18B + Header 4B IFG: 12B Preamble Ethernet Header Control Header TLP FCS [Proposal] Ethernet Control Preamble TLP TLP TLP TLP Header Header FCS Page 3 NEC Corporation 2014 Hot Interconnects 2014

Challenges of Packet Aggregation in PCIe over Ethernet Low-latency is important and avoid additional delay! PCIe traffic is sensitive to delay! 1-us wait time (needs to aggregate TLPs by max Ethernet frame length) is large for short latency I/O device, e.g., PCIe-connected PCM Needs to be End-to-End! It is difficult to modify commercial Ethernet switches to aggregate TLPs W/o modification of hosts system stack! Avoid modifying OS and device drivers Page 4 NEC Corporation 2014 Hot Interconnects 2014

Related Work 1/2 Previous work has been done in wireless and optical network They are categorized into two groups! A. Jain, et al., PhD Thesis, Univ. of Colorado, 2003. [Method 1] Introduce wait time to aggregate packet with next one End-to-end possible Increase transmission delay Sender WAIT Network Receiver Page 5 NEC Corporation 2014 Hot Interconnects 2014

Related Work 2/2 [Method 2] Adaptively aggregate packets if they are accumulated in queue (in network node adjacent to bottleneck link) Low-latency Hop-by-hop only and switch needs to be adapted Network Bottleneck Adaptive Aggregation Page 6 NEC Corporation 2014 Hot Interconnects 2014

Proposed Method Adaptive aggregation in End-to-End! Inside PCIe-to-Ethernet bridge, perform adaptive aggregation behind congestion control unit Our congestion control was proposed in another work, H. Shimonishi et al., CQR Workshop, 2008.! TLPs are extracted from aggregation queue at the rate of bottleneck link and aggregated if multiple packets are accumulated then Adaptive aggregation if TLPs are accumulated Extraction at the rate of bottleneck link in Ethernet TLP Aggregation Queue Congest. Control PCIe-to-Ethernet Brdg at Host Ethernet Bottleneck Page 7 NEC Corporation 2014 Hot Interconnects 2014

Proposed Method Adaptive aggregation in End-to-End! Inside PCIe-to-Ethernet bridge, perform adaptive aggregation behind congestion control unit Our congestion control was proposed in another work, H. Shimonishi et al., CQR Workshop, 2008.! TLPs are extracted from aggregation queue at the rate of bottleneck link and aggregated if multiple packets are accumulated then Adaptive aggregation if TLPs are accumulated Extraction at the rate of bottleneck link in Ethernet TLP Aggregation Queue Congest. Control PCIe-to-Ethernet Brdg at Host Ethernet Bottleneck Page 8 NEC Corporation 2014 Hot Interconnects 2014

Feature of the method Low-latency! No additional wait time is introduced for aggregation No manual parameter settings! #aggregated TLPs are automatically decided Off-the-shelf OS, device drivers, I/O devices, and Ethernet Reduced hardware footprints! Implementing aggregation function before congestion control reduces internal bus width compared to that implemented inside it Page 9 NEC Corporation 2014 Hot Interconnects 2014

Architectural Diagram of PCIe-to-Ethernet Bridge Multiple aggregation queues depending on destination node TLPs are aggregated in pipeline at the rate notified by congestion control function to achieve high throughput transmission PCIe-to-Ethernet Bridge (ExpEther Bridge) Queues depending on destination Max Aggregation Aggregation Address Table Pipeline Rate Notification PCIe Destination Look-Up Transmission Queue Sorted by Destination Round-Robin Encapsulation Congestion Control and Retransmission Ethernet Decapsulation Page 10 NEC Corporation 2014 Hot Interconnects 2014

Sending TLPs 1. TLPs received from PCIe are sorted and stored in queues depending on their destination 2. TLPs are extracted from queue in round-robin. All TLPs in each queue up to the number limited by max Ethernet frame are extracted at one time 3. Aggregated TLPs are encapsulated and sent to Ethernet Rate of round-robin TLP extraction is set to the transmission rate notified by congestion control function 2. Aggregating TLPs 1. Queuing depending in round robin on destination Aggregation 3. Encapsulation Address Table Rate Notification PCIe Destination Look-Up Transmission Queue Sorted by Destination Round-Robin Page 11 NEC Corporation 2014 Hot Interconnects 2014 Encapsulation Congestion Control and Retransmission Ethernet

Receiving TLPs Aggregated TLPs are decapsulated and sent to PCIe bus PCIe Decapsulation Congestion Control and Retransmission Ethernet Page 12 NEC Corporation 2014 Hot Interconnects 2014

Evaluation using prototype Aggregation method was implemented into FPGA-based ExpEther bridge I/O performance was evaluated with 1:1 host and I/O device connection! Size of TLP payload: 128B! RAID0 was configured using SATA SSDs accommodated in JBOD! Implemented congestion control unit for evaluation: simple rate limit function to virtualize network congestion Server I/O Box JBOD ExpEther Bridge 10GbE ExpEther Bridge RAID Controller SAS: (Serial SCSI) PCIe Bus Gen1 x8 (=16Gb/s) Page 13 NEC Corporation 2014 Hot Interconnects 2014 SATA SSD

Performed Evaluation 1. Whether TLPs are adaptively aggregated depending on the performance of connected I/O device! Even when full Ethernet bandwidth is available, it is bottleneck for some devices and not for others! Vary #SSDs configuring RAID0 2. Whether TLPs are adaptively aggregated depending on the bandwidth of Ethernet bottleneck link! By using rate limiting function implemented into FPGA 3. Whether TLP aggregation increases transmission delay Page 14 NEC Corporation 2014 Hot Interconnects 2014

[Eval. 1] #SSDs in RAID0 were varied Performance of I/O device is increased with #SSDs Read and write throughput were increased up to 41% and 37%, respectively! TLPs were started to be aggregated when #SSDs was two! Throughput was saturated at 982MB/s (=7.9Gb/s). Further improvement seemed difficult because of TLP and Ethernet header 41% 37% [Read Throughput] [Write Throughput] Page 15 NEC Corporation 2014 Hot Interconnects 2014

Why the improvement better for read? Bottleneck of read performance is Ethernet throughput! DMA requests are sent sequentially by I/O device Bottleneck of write performance is both Ethernet latency and throughput! Response of DMA requests are waited to send next ones Host Throughput Bottleneck I/O Device DMA Host Latency Bottleneck I/O Device DMA (Request) Data Throughput Bottleneck Data DMA (Request) Read from I/O Device Write to I/O Device Page 16 NEC Corporation 2014 Hot Interconnects 2014

[Eval. 2] Ethernet bandwidth were varied No improvement when Ethernet bandwidth was 10 Gb/s! Ethernet was not the bottleneck Limiting Ethernet bandwidth below 5 Gb/s had TLPs be aggregated! Maximum improvement in throughput: 41% in read, 39% in write [Read Throughput] [Write Throughput] #SSD in RAID0 = 1 Page 17 NEC Corporation 2014 Hot Interconnects 2014

[Eval. 3] Increase of TLP transmission delay Measured degradation of I/O performance using file I/O benchmark fio 4KB read and write (when Ethernet throughput was not bottleneck)! No degradation 64MB read and write (when Ethernet throughput was bottleneck)! Latency improved because I/O performance was improved Read [us] Write [us] 4KB w/ Aggregation 70.68 98.18 102% 100% 4KB w/o Aggregation 69.5 98.21 No degrade in short I/O 64MB w/ Aggregation 65180 65038 71% 72% 64MB w/o Aggregation 91622 89993 Page 18 NEC Corporation 2014 Hot Interconnects 2014

Conclusion End-to-end adaptive I/O packet (TLP) aggregation! Aggregation behind congestion control inside PCIe-to-Ethernet bridge! Low-latency! Off-the-shelf OS, device drivers, I/O devices, and Ethernet switches Evaluation of prototype implemented using FPGA! I/O performance improved by up to 41%! No degradation of application performance due to increase of latency Future work! Full Implementation of congestion control function! Evaluation using multiple hosts and I/O devices Page 19 NEC Corporation 2014 Hot Interconnects 2014