Implementing Ultra Low Latency Data Center Services with Programmable Logic

Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://algo-logic.com Solutions@Algo-Logic.com (408) 707-3740 2255-D Martin Ave., Santa Clara, CA 95050

INCREASING SILICON DIE AREA Why the Move to Programmable Logic? There are large challenges in scaling the performance of software now. The question is: What s next? We took a bet on programmable hardware. - Doug Burger, Microsoft Driving Metrics in the Data Center Latency: Reduce delay Avoid jitter Throughput Processing packets at line rate Handle 10G, 25G, 40G, and 100G Power: Driving cost of OpEx TIME Micro 1971 Sequential Processing CPU 1985 Optimized Sequential CPU CPU 2006 Multicore GPU CPU GPU CPU CPU FPGA Field Programmable Gate Array (FPGA) logic moves into the CPU Microsoft accelerates BING search with FPGA Intel acquires Altera CPU 2013 Add Vector Processing = SILICONE DIE AREA 2016-2019 Add Logic Processing 2015 Algo-Logic Systems Inc., All rights reserved. 2

Servers Accelerated with FPGA Gateware FPGA Augments Existing Servers Can run on an expansion card (same size as a GPU) Or may be integrated into the CPU socket GDN Applications run on FPGA Implements low-latency, low-power, high-throughput data processing Accelerated Server RAM RAM RAM RAM HDD SSD RAM FPGA RAM CPU Cores CPU 10G 40G 100 GE Rack Server 2015 Algo-Logic Systems Inc., All rights reserved. 3

Example of Low Latency Service: Key/Value Store Examples: Directory Forwarding Tables Data Deduplication Key Company Phone # IP Address Content Hash Value Algo-Logic (408) 707-3740 Interface : MAC Address 204.2.34.5 Eth6 : 02:33:29:F2:AB:CC Storage Block ID XYZ 948830038411 Order ID Symbol, Side, Price Stock Trading ATY11217911101 AAPL, B, 126.75 Virtex Edge List Graph Search v140 v201, v206, v225 Key/Value Store (KVS) Simplifies implementation of large-scale distributed computation algorithms Data Center Servers exchanges data over standard Ethernet Challenges Operating System delays packets and limits throughput Per-core processing inefficient at high-speed packet processing Solutions Bypass kernel bypass with DPDK Offload of packet processing with FPGA 2015 Algo-Logic Systems Inc., All rights reserved. 4

Mobile Application Servers Need Fast Key/Value Stores Scalable backend services to share data Sensors { location, bio, movement,.. } Social { status, dating, updates, multi-player games.. } Media { video/security, audio/music,.. } Communication { network status, handoff, short messages.. } Database { users, providers, payments, travel, authentication,.. } Must be able to scale as the number of users grows Scale up to provide the best latency, throughput, and power Scale out to increase storage capacity, throughput, and redundancy Example: Mobile location sharing 2015 Algo-Logic Systems Inc., All rights reserved. 5

Case Study: Implementing Uber with KVS Uber in 2014 162,037 drivers in the US completed 4 or more trips New drivers doubled every 6 months for past 2 years Number of Uber Users = 8M Number of Cities = 290 Total trips = 140M Daily Trips = 1M Analysis and Assumptions Assuming 25% of the drivers are active 25% of 160k drivers = 40k active cars (<48k) Drivers update position once per second = 40k IOW Implementation Uber with on an Algo-Logic KVS card Washington Post, Dec. 2014 2015 Algo-Logic Systems Inc., All rights reserved. 6

Algo-Logic s KVS solution for Mobile Applications KVS on FPGA KVS on FPGA 2U KVS System on FPGA 2U KVS System on FPGA KVS on FPGA 2U KVS System on FPGA 2U System 2U System App Servers And SCALE-OUT quickly to increase storage capacity and throughput 2015 Algo-Logic Systems Inc., All rights reserved. 8

Linux Software Socket KVS OCSM 10g Ethernet Intel 10G NIC Kernel Driver Message Process LEGEND Data Transfer = Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU 2015 Algo-Logic Systems Inc., All rights reserved. 10

Algo-Logic KVS with DPDK: Bypass the Kernel Receive Queue Dequeue Enqueue OCSM LEGEND Control Handoff = 10g Ethernet Intel 82598 DPDK Supported NIC Store Dequeue Message Buffer Message Process Note: Message read once into CPU Cache Response Generation Data Transfer = Transmit Queue Enqueue Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU 2015 Algo-Logic Systems Inc., All rights reserved. 11

Algo-Logic GDN-Search: ALGO-LOGIC S KVS FPGA SOLUTION in FPGA FOR KEY/VALUE SEARCH OPTIMISED FOR LARGE NUMBER OF KEY/VALUE SEARCH ENTRIES OCSM 10g Ethernet FPGA MAC0 RX MAC0 TX Parser Reconstruct Handler REQUEST GENERATOR 32B OCSM Header Identifier RESPONSE GENERATOR 32B OCSM Header Reconstruct Key-Value Extractor Key-Value Response Decoder Key-Value Search Key: 96b, Value: 96b EMSE2 On-Chip Memory OCSM 10g Ethernet MAC1 RX MAC1 TX Parser Reconstruct Handler REQUEST GENERATOR 64B OCSM Header Identifier RESPONSE GENERATOR 64B OCSM Header Reconstruct Key-Value Extractor Key-Value Response Decoder Key-Value Search Key: 96b, Value: 352b EMSE2 Off-Chip Memory Algo-Logic gateware on Nallatech P385 with Altera Stratix V A7 FPGA 2015 Algo-Logic Systems Inc., All rights reserved. 12

Trends for Adding Storage around FPGAs Six banks of memory controllers QDR SRAM RLDRAM DDR3, DDR4 64 lanes of SERDES SATA disk and Flash Serial memories Hybrid Memory Cube (HMC) Mosys Bandwidth Engine (BE2, BE3) New Memories 3D Xpoint with DDR4 Interface Potential for Terabytes of Memory on each card 2015 Algo-Logic Systems Inc., All rights reserved. 13

Implementation of KVS with Socket I/O, DPDK, and FPGA Benchmark same application Key/Value Store (KVS) Running on the same PC Intel i7-4770k CPU, 82598 NIC, and Altera Stratix V A7 FPGA With three different implementations Socket I/O, DPDK, FPGA OCSM LEGEND Data Transfer = 10g Ethernet Socket I/O Intel 10G NIC Kernel Driver Message Process DPDK Algo-Logic software on Intel 82598 10GE NIC and Core i7-4770k CPU Receive Queue Dequeue FPGA Enqueue OCSM LEGEND Control Handoff = 10g Ethernet Intel 82598 DPDK Supported NIC Store Dequeue Message Buffer Message Process Note: Message read once into CPU Cache Response Generation OCSM 10g Ethernet Parser Modifier REQUEST GENERATOR OCSM Header Identifier RESPONSE GENERATOR OCSM Header Reconstruct Key/Value Extractor Key/Value Search Response Decoder Exact Match Search Engine (EMSE) Data Transfer = Transmit Queue Enqueue Algo-Logic gateware Algo-Logic software on Nallatech P385 with on Intel 82598 10GE NIC Altera Stratix V A7 FPGA and Core i7-4770k CPU 2015 Algo-Logic Systems Inc., All rights reserved. 14

KVS Hardware in Data Center Rack FPGA GDN-Classify Associative Rule-Match CAM PHY MAC Key Extractor Parser Flow or ACL Target Queues MAC s PHYs KVS in Software KVS in DPDK KVS in FPGA Rack of Search Servers 10G 40G Additional KVS Servers Provision Controller UPS Power 2015 Algo-Logic Systems Public 2015 Algo-Logic Systems Inc., All rights reserved. 15

Load Testing off the KVS Implementations GEN: Traffic Generator KVS Implementations Mellanox 10G NIC Intel 82598 10G NIC Kernel Driver Process Message Traffic Generator Intel 82599 10G NIC Intel 82598 10G NIC Intel DPDK EAL Process Message Mellanox Nallatech Process 10G NIC P385 10G Message 2015 Algo-Logic Systems Inc., All rights reserved. 16

Latency Measurement with GDN-Classify 10G Rx T_out- T_in 40G Tx KVS Server Loopback GDN Switch 10G Tx Time Stamp 40G Rx Search Latency (Different for Software, DPDK, and GDN-Search) + Switch Latency (constant for GDN-Switch) Round trip latency of GDN Switch is deterministic (constant) Round trip latency of KVS Sever = Total Round trip Time Round trip time with 10G loopback on GDN Switch 2015 Algo-Logic Systems Inc., All rights reserved. 20

Tighter Spread = Less Jitter Peercentage of Observed s Percentage of s Observed [%] KVS Latency in FPGA, DPDK, and Sockets 50.00% 45.00% 40.00% 35.00% Latency Comparison 100k packets, 1 OCSM per packet, 1k pps Altera Stratix V RTL Average: 0.467µs KVS in FPGA: Best Latency, No Jitter 0.70% 0.60% KVS in Software Worst Latency Worst Jitter Socket Implementation Latency Distribution with One OCSM/ Intel i7 Average: 41.54µs 30.00% 25.00% KVS in DPDK: Lowers Latency, Some Jitter 0.50% 0.40% 0.30% 0.20% Sockets RTL Sockets DPDK 20.00% 0.10% 15.00% DPDK Average: 6.29µs 0.00% 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0 Latency Distribution [µs] 10.00% 5.00% Sockets Average: 41.40µs 0.00% 0 5.5 11 16.5 22 27.5 33 38.5 44 Latency Distribution [µs] Lowest Lower Latency = Faster Response 2015 Algo-Logic Systems Inc., All rights reserved. 21

Measured Latency, Throughput, and Power Results FPGA PHY MAC GDN-Classify Associative Rule-Match CAM Key Extractor Parser Flow or ACL Target Queues MACs PHYs All Datapaths Summary Latency (µseconds) Tested Throughput (CSMs/sec) Power (µjoules/csm) KVS in Software Sockets 41.54 4.0 11 KVS in DPDK KVS in FPGA Rack of Search Servers Additional KVS Servers Provision Controller 10G 40G DPDK 6.434 16 6.6 RTL 0.467 15 0.52 All Datapaths Summary Latency (µseconds) Maximum Throughput (CSMs/sec) Power (µjoules/csm) GDN vs. Sockets 88x less 13x 21x less GDN vs. DPDK 14x less 3.2x 13x less UPS Power 2015 Algo-Logic Systems Inc., All rights reserved. 22

Conclusions: Programmable Hardware in the Data Center Lowers Latency 88x faster than Linux networking sockets 14x faster than optimized DPDK (kernel bypass) Increases Throughput (IOPs) 3x to 13x improvement in throughput Lowers Capital Expenditures (CapEx) Reduces Power 13x to 21x reduction in power Reduces Operating Expenditures (OpEx) 2015 Algo-Logic Systems Inc., All rights reserved. 23