Industry Collaboration and Innovation

Size: px

Start display at page:

Download "Industry Collaboration and Innovation"

Eustacia Gibbs
5 years ago
Views:

1 Industry Collaboration and Innovation

2 Open Coherent Accelerator Processor Interface OpenCAPI TM - A New Standard for High Performance Memory, Acceleration and Networks Jeff Stuecheli April 10, 2017

3 What is Open CAPI Device attach Memory Accelerators Network Storage Etc. Latency: 10s of ns interface overhead Bandwidth: 25G+ differential signaling Flexibility: One interface scaling low latency memory to sophisticated accelerators

4 Use Cases A truly heterogeneous architecture built upon OpenCAPI

5 Asymmetric Design Philosophy Motivation Simplify accelerator to enable 1. Host ISA agnostic 2. Contain coherence complexity to host silicon 3. Higher performance as logic in host silicon higher perf than accelerator 4. Contain accelerator in sandbox to enable fault tolerance and security

6 Importance of Latency Server memory latency is critical TOC factor Differential solution must provide ~equivalent effective latency of DDR standards POWER8 DMI round trip latency à 10ns Typical PCIe round trip latency à ~100s ns Why is DMI so low? DMI designed from ground up for minimum latency due to ld/str requirements Open CAPI key concept Provide DMI like latency, but with enhanced command set of CAPI

7 Comparison of Acceleration Paradigms Memory Transform Example: Basic offload Egress Transform Processor Chip DLx/TLx Data Acc Processor Chip DLx/TLx Data Acc Examples: Machine Learning, Deep Learning potentially using OpenCAPI attached memory Ingress Transform Examples: Encryption, Compression, Erasure prior to network or storage Processor Chip DLx/TLx Data Acc Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI), Data Plane Accelerator (DPA), Video Encoding (H.265) etc Needle-in-a-haystack Needle-In-A-Haystack Engine Engine Processor Chip DLx/TLx Needles Acc Haystack Data Bi-Directional Transform Processor Chip TLx/DLx Data Acc Acc Examples: Database searches, joins, intersections,merges Examples: NoSQL such as Neo4J with Graph Node Traversals etc 7

8 Comparison of Memory Paradigms Main Memory Example: Basic DDR attach Processor Chip DLx/TLx Data DDR4/5 Emerging Memory Processor Chip Needle-in-a-haystack Tiered Memory Engine Processor Chip DLx/TLx DLx/TLx Data SCM DDR4/5 DLx/TLx Data SCM 8

9 zero Cycle DDR4/5 buffer chip strawman 25.6 GHz Serdes 8:1 serdes serdes 1.6 GHz Bypass Activate decode Bypass Data DDR phy DDR DDR 3.2 GHz DDR

10 Simplified High Performance Accelerator Explicit command templates enable extremely simple command decode For best latency CRC bypass Virtual address based cache enables simplified parallel accelerator caching structures based around data structure semantics (rather than typical Von Neumann bottleneck)

11 Host scalability Asymmetric design isolated host coherence throughput from accelerator (avoid scenario of plugging a card or flashing an FPGA image slowing entire system). Ground up high throughput design with explicit memory barriers enables efficient host implementation (rather than overly strict PCIe). Open CAPI optimized serdes significant power+area advantage over PCIeenables higher IO bw

12 Virtual Addressing An OpenCAPI device operates in the virtual address spaces of the applications that it supports Eliminates kernel and device driver software overhead Improves accelerator performance Allows device to operate directly on application memory without kernel-level data copies or pinned pages Simplifies programming effort to integrate accelerators into applications The Virtual-to-Physical Address Translation occurs in the host CPU Reduces design complexity of OpenCAPI-attached devices Makes it easier to ensure interoperability between an OpenCAPI device and multiple CPU architectures Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access

13 OpenCAPI Protocol Stack The OpenCAPI transaction layer specifies the control and response packets between a host and an endpoint OpenCAPI device The transaction layer on the host is referred to as the TL The transaction layer on the endpoint OpenCAPI device is referred to as the TLx On the host side the transaction layer converts: Host specific protocol requests into transaction layer defined commands TLx commands into host specific protocol requests. When the host protocol completes it provides responses to the TLx commands (if required) TLx responses into responses for host initiated requests On the endpoint OpenCAPI device, the transaction layer converts: AFU protocol requests into transaction layer commands TL commands into AFU protocol requests. When the AFU protocol completes it provides responses to the TL commands (if required) TL responses into responses for AFU initiated requests The full TL specification can be obtained by simply going to opencapi.org and registering under the technical -> specifications pull down menu The OpenCAPI data link layer supports a 25Gbps serial data rate per lane connecting a processor to an FPGA or an ASIC that contains an endpoint accelerator or device The basic configuration supports 8 lanes running at GHz for a 25 GB/s data rate. The data link layer implemented on the host is referred to as the DL The data link layer implemented on the endpoint OpenCAPI device is referred to as the DLx. The full DL specification can be obtained by simply going to opencapi.org and registering under the technical -> specifications pull down menu

September IBM + Google Technical Meeting DL Flits and TL Frame Format time 64B control flit 64B data flit Transmission order from right to left, top to bottom CRC in DL content covers flits of same

14 September IBM + Google Technical Meeting DL Flits and TL Frame Format time 64B control flit 64B data flit Transmission order from right to left, top to bottom CRC in DL content covers flits of same color Control flit may be followed by another control flit or 0 to 8 data flits Data descriptor in TL command / response tells how many data flits follow Optimized for low latency (FPGA friendly) Data packet alignment to avoid any byte rotation at receiver CRC alignment enables lowest latency control packet processing

15 25 Gbit PHY Open CAPI is agnostic to processor architecture and as such the electrical interface is not being defined by the OpenCAPI consortium or any of its workgroups However if a partner wishes to connect with IBM s Power9 microprocessor the electrical interface is defined as follows Definition is being driven by the 25G workgroup within the OpenPower Foundation Based on the OIF CEI 28G SR specification 25Gbit/sec signaling and protocol built to enable very low latency interface on CPU and attached device Allows for future looking media improvements such as 32 Gb/s and 56 Gb/s signaling

OpenCAPI Coherence Programming Model Open CAPI Architecture

model Threads can be local to Host or Accelerator Host

Host Threads with Fast Notification Atomics Accelerator

16 OpenCAPI Coherence Programming Model Open CAPI Architecture offers advancements for the Host<->Accelerator programming model Threads can be local to Host or Accelerator Host Process Host Memory Shared Host-Accelerator memory local to Host Threads with Fast Notification Atomics Accelerator Memory Accelerated Function Shared Host-Accelerator memory local to Accelerator

17 FPGA versus ASIC/Structured Array The OpenCAPI architecture is truly agnostic to a specific vendor technology The TLx and DLx reference RTL is written for the Xilinx FPGA Vivado toolchain and statistics are provided using that flow in this deck BRAM(s), Distributed RAM(s) are specific Xilinx constructs To convert to an ASIC vendor or a structured array technology would be a very minimal exercise to port Discussions have been held to take our reference RTL and harden it into a structured array technology Master Definition is underway between IBM and Toshiba Economy of scale and NRE are considerations that partners need to make in deciding whether to go the FPGA, ASIC, or structured array route to market

18 Reference Card Design Definition of FPGA reference card is being driven as part of the 25G workgroup within the OpenPower consortium Definition of the cable(s) are also driven as part of the 25G workgroup within the OpenPower consortium Currently IBM and Xilinx are driving the initial definition of a PCIE based form factor card Representative Diagram is articulated below FPGA 25G Sideband signals (low freq) 18

19 Thank-you! Any questions?

Industry Collaboration and Innovation

Industry Collaboration and Innovation OpenCAPI Topics Industry Background Technology Overview Design Enablement OpenCAPI Consortium Industry Landscape Key changes occurring in our industry Historical microprocessor