Implementing Ultra Low Latency Data Center Services with Programmable Logic

Similar documents
Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Hardware NVMe implementation on cache and storage systems

High Performance Packet Processing with FlexNIC

INT 1011 TCP Offload Engine (Full Offload)

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Accelerating Ceph with Flash and High Speed Networks

Experience with the NetFPGA Program

Netronome NFP: Theory of Operation

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Agilio CX 2x40GbE with OVS-TC

INT G bit TCP Offload Engine SOC

InfiniBand Networked Flash Storage

Extremely Fast Distributed Storage for Cloud Service Providers

Pactron FPGA Accelerated Computing Solutions

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

INT-1010 TCP Offload Engine

An Intelligent NIC Design Xin Song

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

CS 856 Latency in Communication Systems

OCP Engineering Workshop - Telco

Getting Real Performance from a Virtualized CCAP

10G bit UDP Offload Engine (UOE) MAC+ PCIe SOC IP

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Higher Level Programming Abstractions for FPGAs using OpenCL

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

NVMe : Redefining the Hardware/Software Architecture

Efficient Data Movement in Modern SoC Designs Why It Matters

ARISTA: Improving Application Performance While Reducing Complexity

6.9. Communicating to the Outside World: Cluster Networking

GPUs and Emerging Architectures

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Cisco HyperFlex HX220c Edge M5

Programmable Server Adapters: Key Ingredients for Success

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

NetFPGA Hardware Architecture

Cisco HyperFlex HX220c M4 Node

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Improving Ceph Performance while Reducing Costs

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

Using FPGAs to accelerate NVMe-oF based Storage Networks

High Performance Memory Opportunities in 2.5D Network Flow Processors

Next Generation Architecture for NVM Express SSD

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

Accelerated Programmable Services. FPGA and GPU augmented infrastructure.

Learning with Purpose

Virtual Switch Acceleration with OVS-TC

HADP Talk BlueDBM: An appliance for Big Data Analytics

Flash vs. Disk Storage: Testing Workloads is Key

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Product Overview. Programmable Network Cards Network Appliances FPGA IP Cores

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Aerospike Scales with Google Cloud Platform

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Flash Controller Solutions in Programmable Technology

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

P51: High Performance Networking

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Toward Seamless Integration of RAID and Flash SSD

Introduction to Infiniband

Solid Access Technologies, LLC

NVMe Over Fabrics (NVMe-oF)

PacketShader: A GPU-Accelerated Software Router

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

FlexNIC: Rethinking Network DMA

FPGA Implementation of Erasure Codes in NVMe based JBOFs

Storage Protocol Offload for Virtualized Environments Session 301-F

Enyx soft-hardware design services and development framework for FPGA & SoC

N V M e o v e r F a b r i c s -

Data Path acceleration techniques in a NFV world

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

An Introduction to the QorIQ Data Path Acceleration Architecture (DPAA) AN129

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Development of Communication LSI for 10G-EPON

Today s Data Centers. How can we improve efficiencies?

Total Cost of Ownership Analysis for a Wireless Access Gateway

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera

LegUp: Accelerating Memcached on Cloud FPGAs

Industry Collaboration and Innovation

Network Design Considerations for Grid Computing

Hardware Evolution in Data Centers

OpenStack Networking: Where to Next?

FPGAs and Networking

1G Bit TCP+UDP Offload Engine (TOE+UOE) Hardware IP Core

10 G Bit TCP+UDP Offload Engine (TOE+UOE) Hardware IP Core

HPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

CS 261 Fall Mike Lam, Professor. Memory

MIPI : Advanced Driver Assistance System

Designing Next Generation FS for NVMe and NVMe-oF

Five Key Steps to High-Speed NAND Flash Performance and Reliability

Flash Trends: Challenges and Future

Transcription:

Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://algo-logic.com Solutions@Algo-Logic.com (408) 707-3740 2255-D Martin Ave., Santa Clara, CA 95050

INCREASING SILICON DIE AREA Why the Move to Programmable Logic? There are large challenges in scaling the performance of software now. The question is: What s next? We took a bet on programmable hardware. - Doug Burger, Microsoft Driving Metrics in the Data Center Latency: Reduce delay Avoid jitter Throughput Processing packets at line rate Handle 10G, 25G, 40G, and 100G Power: Driving cost of OpEx TIME Micro 1971 Sequential Processing CPU 1985 Optimized Sequential CPU CPU 2006 Multicore GPU CPU GPU CPU CPU FPGA Field Programmable Gate Array (FPGA) logic moves into the CPU Microsoft accelerates BING search with FPGA Intel acquires Altera CPU 2013 Add Vector Processing = SILICONE DIE AREA 2016-2019 Add Logic Processing 2015 Algo-Logic Systems Inc., All rights reserved. 2

Servers Accelerated with FPGA Gateware FPGA Augments Existing Servers Can run on an expansion card (same size as a GPU) Or may be integrated into the CPU socket GDN Applications run on FPGA Implements low-latency, low-power, high-throughput data processing Accelerated Server RAM RAM RAM RAM HDD SSD RAM FPGA RAM CPU Cores CPU 10G 40G 100 GE Rack Server 2015 Algo-Logic Systems Inc., All rights reserved. 3

Example of Low Latency Service: Key/Value Store Examples: Directory Forwarding Tables Data Deduplication Key Company Phone # IP Address Content Hash Value Algo-Logic (408) 707-3740 Interface : MAC Address 204.2.34.5 Eth6 : 02:33:29:F2:AB:CC Storage Block ID XYZ 948830038411 Order ID Symbol, Side, Price Stock Trading ATY11217911101 AAPL, B, 126.75 Virtex Edge List Graph Search v140 v201, v206, v225 Key/Value Store (KVS) Simplifies implementation of large-scale distributed computation algorithms Data Center Servers exchanges data over standard Ethernet Challenges Operating System delays packets and limits throughput Per-core processing inefficient at high-speed packet processing Solutions Bypass kernel bypass with DPDK Offload of packet processing with FPGA 2015 Algo-Logic Systems Inc., All rights reserved. 4

Mobile Application Servers Need Fast Key/Value Stores Scalable backend services to share data Sensors { location, bio, movement,.. } Social { status, dating, updates, multi-player games.. } Media { video/security, audio/music,.. } Communication { network status, handoff, short messages.. } Database { users, providers, payments, travel, authentication,.. } Must be able to scale as the number of users grows Scale up to provide the best latency, throughput, and power Scale out to increase storage capacity, throughput, and redundancy Example: Mobile location sharing 2015 Algo-Logic Systems Inc., All rights reserved. 5

Case Study: Implementing Uber with KVS Uber in 2014 162,037 drivers in the US completed 4 or more trips New drivers doubled every 6 months for past 2 years Number of Uber Users = 8M Number of Cities = 290 Total trips = 140M Daily Trips = 1M Analysis and Assumptions Assuming 25% of the drivers are active 25% of 160k drivers = 40k active cars (<48k) Drivers update position once per second = 40k IOW Implementation Uber with on an Algo-Logic KVS card Washington Post, Dec. 2014 2015 Algo-Logic Systems Inc., All rights reserved. 6

Algo-Logic s KVS solution for Mobile Applications KVS on FPGA Application Server Scale UP for FASTER access to shared data 2015 Algo-Logic Systems Inc., All rights reserved. 7

Algo-Logic s KVS solution for Mobile Applications KVS on FPGA KVS on FPGA 2U KVS System on FPGA 2U KVS System on FPGA KVS on FPGA 2U KVS System on FPGA 2U System 2U System App Servers And SCALE-OUT quickly to increase storage capacity and throughput 2015 Algo-Logic Systems Inc., All rights reserved. 8

Provisioning and Measurements with GDN-Switch 2015 Algo-Logic Systems Inc., All rights reserved. 9

Linux Software Socket KVS OCSM 10g Ethernet Intel 10G NIC Kernel Driver Message Process LEGEND Data Transfer = Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU 2015 Algo-Logic Systems Inc., All rights reserved. 10

Algo-Logic KVS with DPDK: Bypass the Kernel Receive Queue Dequeue Enqueue OCSM LEGEND Control Handoff = 10g Ethernet Intel 82598 DPDK Supported NIC Store Dequeue Message Buffer Message Process Note: Message read once into CPU Cache Response Generation Data Transfer = Transmit Queue Enqueue Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU 2015 Algo-Logic Systems Inc., All rights reserved. 11

Algo-Logic GDN-Search: ALGO-LOGIC S KVS FPGA SOLUTION in FPGA FOR KEY/VALUE SEARCH OPTIMISED FOR LARGE NUMBER OF KEY/VALUE SEARCH ENTRIES OCSM 10g Ethernet FPGA MAC0 RX MAC0 TX Parser Reconstruct Handler REQUEST GENERATOR 32B OCSM Header Identifier RESPONSE GENERATOR 32B OCSM Header Reconstruct Key-Value Extractor Key-Value Response Decoder Key-Value Search Key: 96b, Value: 96b EMSE2 On-Chip Memory OCSM 10g Ethernet MAC1 RX MAC1 TX Parser Reconstruct Handler REQUEST GENERATOR 64B OCSM Header Identifier RESPONSE GENERATOR 64B OCSM Header Reconstruct Key-Value Extractor Key-Value Response Decoder Key-Value Search Key: 96b, Value: 352b EMSE2 Off-Chip Memory Algo-Logic gateware on Nallatech P385 with Altera Stratix V A7 FPGA 2015 Algo-Logic Systems Inc., All rights reserved. 12

Trends for Adding Storage around FPGAs Six banks of memory controllers QDR SRAM RLDRAM DDR3, DDR4 64 lanes of SERDES SATA disk and Flash Serial memories Hybrid Memory Cube (HMC) Mosys Bandwidth Engine (BE2, BE3) New Memories 3D Xpoint with DDR4 Interface Potential for Terabytes of Memory on each card 2015 Algo-Logic Systems Inc., All rights reserved. 13

Implementation of KVS with Socket I/O, DPDK, and FPGA Benchmark same application Key/Value Store (KVS) Running on the same PC Intel i7-4770k CPU, 82598 NIC, and Altera Stratix V A7 FPGA With three different implementations Socket I/O, DPDK, FPGA OCSM LEGEND Data Transfer = 10g Ethernet Socket I/O Intel 10G NIC Kernel Driver Message Process DPDK Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU Receive Queue Dequeue FPGA Enqueue OCSM LEGEND Control Handoff = 10g Ethernet Intel 82598 DPDK Supported NIC Store Dequeue Message Buffer Message Process Note: Message read once into CPU Cache Response Generation OCSM 10g Ethernet Parser Modifier REQUEST GENERATOR OCSM Header Identifier RESPONSE GENERATOR OCSM Header Reconstruct Key/Value Extractor Key/Value Search Response Decoder Exact Match Search Engine (EMSE) Data Transfer = Transmit Queue Enqueue Algo-Logic gateware Algo-Logic software on Nallatech P385 with on Intel 82598 10GE NIC Altera Stratix V A7 FPGA and Core i7-4770k CPU 2015 Algo-Logic Systems Inc., All rights reserved. 14

KVS Hardware in Data Center Rack FPGA GDN-Classify Associative Rule-Match CAM PHY MAC Key Extractor Parser Flow or ACL Target Queues MAC s PHYs KVS in Software KVS in DPDK KVS in FPGA Rack of Search Servers 10G 40G Additional KVS Servers Provision Controller UPS Power 2015 Algo-Logic Systems Public 2015 Algo-Logic Systems Inc., All rights reserved. 15

Load Testing off the KVS Implementations GEN: Traffic Generator KVS Implementations Mellanox 10G NIC Intel 82598 10G NIC Kernel Driver Process Message Traffic Generator Intel 82599 10G NIC Intel 82598 10G NIC Intel DPDK EAL Process Message Mellanox Nallatech Process 10G NIC P385 10G Message 2015 Algo-Logic Systems Inc., All rights reserved. 16

Full-Rate Processing (100% of TX Load) Stratix V FPGA with EMSE-2 Drops no s DPDK Drops s Software Drops s 2015 Algo-Logic Systems Inc., All rights reserved. 17

Sockets Power Consumption Profile (10M s with 40 CSM Messages/) 2015 Algo-Logic Systems Inc., All rights reserved. 18

DPDK Power Consumption Profile (10M s with 40 CSM Messages/) 2015 Algo-Logic Systems Inc., All rights reserved. 19

Latency Measurement with GDN-Classify 10G Rx T_out- T_in 40G Tx KVS Server Loopback GDN Switch 10G Tx Time Stamp 40G Rx Search Latency (Different for Software, DPDK, and GDN-Search) + Switch Latency (constant for GDN-Switch) Round trip latency of GDN Switch is deterministic (constant) Round trip latency of KVS Sever = Total Round trip Time Round trip time with 10G loopback on GDN Switch 2015 Algo-Logic Systems Inc., All rights reserved. 20

Tighter Spread = Less Jitter Peercentage of Observed s Percentage of s Observed [%] KVS Latency in FPGA, DPDK, and Sockets 50.00% 45.00% 40.00% 35.00% Latency Comparison 100k packets, 1 OCSM per packet, 1k pps Altera Stratix V RTL Average: 0.467µs KVS in FPGA: Best Latency, No Jitter 0.70% 0.60% KVS in Software Worst Latency Worst Jitter Socket Implementation Latency Distribution with One OCSM/ Intel i7 Average: 41.54µs 30.00% 25.00% KVS in DPDK: Lowers Latency, Some Jitter 0.50% 0.40% 0.30% 0.20% Sockets RTL Sockets DPDK 20.00% 0.10% 15.00% DPDK Average: 6.29µs 0.00% 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0 Latency Distribution [µs] 10.00% 5.00% Sockets Average: 41.40µs 0.00% 0 5.5 11 16.5 22 27.5 33 38.5 44 Latency Distribution [µs] Lowest Lower Latency = Faster Response 2015 Algo-Logic Systems Inc., All rights reserved. 21

Measured Latency, Throughput, and Power Results FPGA PHY MAC GDN-Classify Associative Rule-Match CAM Key Extractor Parser Flow or ACL Target Queues MACs PHYs All Datapaths Summary Latency (µseconds) Tested Throughput (CSMs/sec) Power (µjoules/csm) KVS in Software Sockets 41.54 4.0 11 KVS in DPDK KVS in FPGA Rack of Search Servers Additional KVS Servers Provision Controller 10G 40G DPDK 6.434 16 6.6 RTL 0.467 15 0.52 All Datapaths Summary Latency (µseconds) Maximum Throughput (CSMs/sec) Power (µjoules/csm) GDN vs. Sockets 88x less 13x 21x less GDN vs. DPDK 14x less 3.2x 13x less UPS Power 2015 Algo-Logic Systems Inc., All rights reserved. 22

Conclusions: Programmable Hardware in the Data Center Lowers Latency 88x faster than Linux networking sockets 14x faster than optimized DPDK (kernel bypass) Increases Throughput (IOPs) 3x to 13x improvement in throughput Lowers Capital Expenditures (CapEx) Reduces Power 13x to 21x reduction in power Reduces Operating Expenditures (OpEx) 2015 Algo-Logic Systems Inc., All rights reserved. 23