Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

Similar documents
Implementing Ultra Low Latency Data Center Services with Programmable Logic

Implementing Ultra Low Latency Data Center Services with Programmable Logic

FPGA Augmented ASICs: The Time Has Come

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

The Myricom ARC Series with DBL

High Performance Packet Processing with FlexNIC

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Virtual Switch Acceleration with OVS-TC

ANIC Host CPU Offload Features Overview An Overview of Features and Functions Available with ANIC Adapters

RoCE vs. iwarp Competitive Analysis

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Experience with the NetFPGA Program

INT-1010 TCP Offload Engine

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper

INT 1011 TCP Offload Engine (Full Offload)

Agilio CX 2x40GbE with OVS-TC

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Much Faster Networking

The Myricom ARC Series of Network Adapters with DBL

10G bit UDP Offload Engine (UOE) MAC+ PCIe SOC IP

6.9. Communicating to the Outside World: Cluster Networking

Accelerated Programmable Services. FPGA and GPU augmented infrastructure.

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

FlexNIC: Rethinking Network DMA

Hardware NVMe implementation on cache and storage systems

10Gb Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Latency-Sensitive Applications

Agilio OVS Software Architecture

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Important new NVMe features for optimizing the data pipeline

Netronome NFP: Theory of Operation

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

INT G bit TCP Offload Engine SOC

Advanced Computer Networks. End Host Optimization

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

OpenStack Networking: Where to Next?

Network Virtualization in Multi-tenant Datacenters

Introduction to the OpenCAPI Interface

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

SmartNIC Programming Models

Flash vs. Disk Storage: Testing Workloads is Key

NVMe : Redefining the Hardware/Software Architecture

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

An NVMe-based FPGA Storage Workload Accelerator

GUARANTEED END-TO-END LATENCY THROUGH ETHERNET

WebSphere MQ Low Latency Messaging V2.1. High Throughput and Low Latency to Maximize Business Responsiveness IBM Corporation

10 G Bit TCP+UDP Offload Engine (TOE+UOE) Hardware IP Core

1G Bit TCP+UDP Offload Engine (TOE+UOE) Hardware IP Core

InfiniBand Networked Flash Storage

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Enyx soft-hardware design services and development framework for FPGA & SoC

URDMA: RDMA VERBS OVER DPDK

Solarflare and OpenOnload Solarflare Communications, Inc.

Five ways to optimise exchange connectivity latency

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The Best Ethernet Storage Fabric

Using FPGAs to accelerate NVMe-oF based Storage Networks

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Jakub Cabal et al. CESNET

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

SmartNIC Programming Models

Martin Dubois, ing. Contents

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

INSIGHTS. FPGA - Beyond Market Data. Financial Markets

Software Datapath Acceleration for Stateless Packet Processing

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Virtio/vhost status update

Accelerating Contrail vrouter

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

An Introduction to the QorIQ Data Path Acceleration Architecture (DPAA) AN129

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

1G bit TCP Offload Engine SOC IP

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

On the cost of tunnel endpoint processing in overlay virtual networks

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Using (Suricata over) PF_RING for NIC-Independent Acceleration

LOW LATENCY DATA DISTRIBUTION IN CAPITAL MARKETS: GETTING IT RIGHT

Netchannel 2: Optimizing Network Performance

Rapid Platform Deployment: Allows clients to concentrate their efforts on application software.

10 G bit TCP Offload Engine + PCIe/DMA SOC IP

Product Overview. Programmable Network Cards Network Appliances FPGA IP Cores

5051 & 5052 PCIe Card Overview

100% PACKET CAPTURE. Intelligent FPGA-based Host CPU Offload NIC s & Scalable Platforms. Up to 200Gbps

Towards a Software Defined Data Plane for Datacenters

Total Cost of Ownership Analysis for a Wireless Access Gateway

Baseband Device Drivers. Release rc1

Network Design Considerations for Grid Computing

Designing Next Generation FS for NVMe and NVMe-oF

Improving DPDK Performance

Building a Platform Optimized for the Network Edge

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

CSE398: Network Systems Design

Programmable NICs. Lecture 14, Computer Networks (198:552)

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

with Sniffer10G of Network Adapters The Myricom ARC Series DATASHEET

An Intelligent NIC Design Xin Song

A-GEAR 10Gigabit Ethernet Server Adapter X520 2xSFP+

Transcription:

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance STAC Summit: Panel: FPGA for trading today: December 2015 John W. Lockwood, PhD, CEO Algo-Logic Systems, Inc. JWLockwd@algo-logic.com (408) 707-3746 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel http://algo-logic.com

GDN Powers Algo-Logic IP Cores, Pre-built FPGA Applications, and Systems GDN Gateware Defined Networking Accelerated Server IP CORE FPGA Gateware HDD SSD NIC+FPGA CPU Cores CPU 10G 40G 100 GE Low Latency MAC,TCP, Protocol Parsers Order Book cores Pre-Programmed apps in multiple FPGA vendor devices Pre-Programmed apps in multiple FPGA Cards Integrated Switch Solutions Integrated Server Systems Data Center Deployments to Co-Location Facility 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 2

Algo-Logic s Family of Accelerated Finance Applications Tick-to-Trade System Low Latency Library Full Order Book Low Latency TCP 76 ns MAC to Application 10GE PHY/MAC 89 ns Round-trip latency Market Data Filter Protocol Parsers All major exchanges Accelerated Server FPGA HDD SSD CPU Cores CPU 10G 40G 100 GE Algorithms in Logic: All apps run in FPGA Not STAC Benchmarks 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 3

Algo-Logic s Tick-to-Trade System: Full Offload to FPGA System Software Your control/gui interface(s) running on your server Your Unique Design Requirements In-House integration & API customizations GDN* Design Customization Areas Top of Order- Book Info Orders, Symbols, Trigger criteria etc. Heartbeats, echo info, stats & status Execution Reports & logs * Gateware Defined Networking (GDN) FPGA Card FPGA interfaces + API customizations 10GE multicast data Algo- Logic ULL PHY+ MAC IP Customizations Algo-Logic Market Data Filter Module UDP Parser IP Customizations Algo-Logic Protocol Parsing Libraries IP Customizations Algo-Logic Full Order- Book Processing Filtered Trade events & Top-of- Book data Your Trading Logic, Algorithms, Order Criteria & triggers Risk Checks module Inject market Orders IP Customizations Algo-Logic 76-nanosecond TCP/UDP Endpoint Algo- Logic ULL PHY+ MAC Orders to Exchange(s) Execution Reports from Exchange(s) (at your co-location site) Algo-Logic Systems GDN ULL Trading Solutions are sub-microsecond 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 4 Traditional Legacy Software Trading Systems at 20 to 50 microseconds

Algo-Logic s Key Value Store (KVS) Examples: Directory Key Value Company Phone # Algo-Logic (408) 707-3740 Key/Value Store (KVS) Simplifies implementation of large-scale distributed computation algorithms Data Center Servers exchanges data over standard Ethernet Forwarding Tables Data Deduplication IP Address Interface : MAC Address 204.2.34.5 Eth6 : 02:33:29:F2:AB:CC Content Hash Storage Block ID XYZ 948830038411 Order ID Symbol, Side, Price Stock Trading ATY11217911101 AAPL, B, 126.75 Virtex Edge List Graph Search v140 v201, v206, v225 Challenges Operating System delays packets and limits throughput Per-core processing inefficient at high-speed packet processing Solutions Bypass kernel bypass with DPDK Offload of packet processing with FPGA 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 5

Implementation of KVS with Socket I/O, Kernel Bypass, and GDN in FPGA Benchmark same application Key/Value Store (KVS) Running on the same PC Intel i7-4770k CPU, 82598 NIC, and Altera Stratix V A7 FPGA With three different implementations Socket I/O, Kernel bypass with DPDK, FPGA OCSM LEGEND Data Transfer = 10g Ethernet Traditional Socket I/O Intel 10G NIC Kernel Driver Message Process Kernel Bypass with DPDK Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU Receive Queue Dequeue GDN in FPGA Enqueue OCSM LEGEND Control Handoff = 10g Ethernet Intel 82598 DPDK Supported NIC Store Dequeue Message Buffer Message Process Note: Message read once into CPU Cache Response Generation OCSM 10g Ethernet Parser Modifier REQUEST GENERATOR OCSM Header Identifier RESPONSE GENERATOR OCSM Header Reconstruct Key/Value Extractor Key/Value Search Response Decoder Exact Match Search Engine (EMSE) Data Transfer = Transmit Queue Enqueue Algo-Logic gateware Algo-Logic software on Nallatech P385 with on Intel 82598 10GE NIC Altera Stratix V A7 FPGA and Core i7-4770k CPU 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 6

Measured Latency, Throughput, and Power Results FPGA PHY MAC GDN-Traffic Classifier Associative Rule-Match CAM Key Extractor Parser Flow or ACL Target Queues MACs PHYs All Datapaths Summary Latency (µseconds) Tested Throughput (CSMs/sec) Power (µjoules/csm) KVS in Software Sockets 41.54 4.0 11 KVS in DPDK KVS in FPGA Rack of Search Servers Additional KVS Servers Provision Controller UPS Power 10G 40G DPDK 6.434 16 6.6 RTL 0.467 15 0.52 All Datapaths Summary Latency (µseconds) Maximum Throughput (CSMs/sec) Power (µjoules/csm) GDN vs. Sockets 88x less 13x 21x less GDN vs. DPDK 14x less 3.2x 13x less Advance Results: http://algo-logic.com/kvs-whitepaper Not STAC Benchmarks 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 7

Tighter Spread = Less Jitter Peercentage of Observed s Percentage of s Observed [%] KVS Latency in FPGA, DPDK, and Sockets 50.00% 45.00% 40.00% 35.00% Latency Comparison 100k packets, 1 OCSM per packet, 1k pps Altera Stratix V RTL Average: 0.467µs KVS in FPGA: Best Latency, No Jitter 0.70% 0.60% KVS in Software Worst Latency Worst Jitter Socket Implementation Latency Distribution with One OCSM/ Intel i7 Average: 41.54µs 30.00% 25.00% KVS in DPDK: Lowers Latency, Some Jitter 0.50% 0.40% 0.30% 0.20% Sockets RTL Sockets DPDK 20.00% 0.10% 15.00% 10.00% 5.00% DPDK Average: 6.29µs 0.00% 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0 Latency Distribution [µs] Sockets Average: 41.40µs Advance Results: http://algologic.com/kvswhitepaper 0.00% 0 5.5 11 16.5 22 27.5 33 38.5 44 Latency Distribution [µs] Lowest Lower Latency = Faster Response Not STAC Benchmarks 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 8

Key Results: Gateware Defined Networking Gateware Defined Networking (GDN) Lowers Latency 7x to 45x over optimized DPDK and traditional Linux networking software Increases Throughput 3x to 13x improvement in Throughput / Server Reduces Power 13x to 21x less Power / Server Advance Results: http://algo-logic.com/kvs-whitepaper Not STAC Benchmarks 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 9

Algo-Logic is Partnered to Provide Solutions on All Major Platforms 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 10

Thank You! Algo-Logic Systems, Inc. Phone: (408) 707-3740 Web: http://algo-logic.com Email: info@algo-logic.com Corporate Headquarters 2255-D Martin Ave Santa Clara, CA 95050 2015 Algo-Logic Systems Inc., All rights reserved. STAC FPGA Panel 11