Introduction to the OpenCAPI Interface

Size: px

Start display at page:

Download "Introduction to the OpenCAPI Interface"

Rosamond Shaw
5 years ago
Views:

1 Introduction to the OpenCAPI Interface Brian Allison, STSM OpenCAPI Technology and Enablement Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit

2 Industry Collaboration and Innovation

3 Introduction to the OpenCAPI Interface Topics OpenCAPI Protocol Stack OpenCAPI Reference Design Overview OpenCAPI TLx Reference Code OpenCAPI TLx-AFU Interface and Snippets OpenCAPI Reference Design Cards OpenCAPI Reference AFUs OpenCAPI Performance OpenCAPI Roadmap

4 OpenCAPI device Host processor OpenCAPI Protocol Stack Host bus interface Host bus protocol layer TL TL Frame/Parser DL Host fabric bus OpenCAPI packets DL packet (format) Transaction Layer (TL) specifies the control and response packets between a host and an endpoint OpenCAPI device TL On the host side converts: Host specific protocol requests into transaction layer defined commands (downbound) TLx commands into host specific protocol requests. (upbound) Responses to Endpoint initiated commands PHY DL packet Serial link Data Link layer supports a Gbps serial data rate per lane connecting a processor to an accelerator dvice: DL and DLX TLx: On the endpoint OpenCAPI device, the transaction layer converts: PHYX DLX TLX Frame/Parser TLX DLX packet DLX packet (format) AFU packets AFU protocol requests into transaction layer commands TL commands into AFU protocol requests. Responses to host initiated commands AFU protocol layer AFU protocol stack interface AFU 4

5 OpenCAPI Protocol Stack config_write config_read CFG Host CPU TL DL PHY PHYX DLX TLX All other TL commands AFU

6 OpenCAPI FPGA Reference Design Overview Xilinx based Verilog design for Ultrascale+ FPGA s Contains Phy configuration (PHYx), Transaction Layer (TLx), Data Link Layer (DLx) and Config Core (CFG) GT/s x8 link per PHYx/TLx/DLx/CFG using Xilinx GTY PHY Tightly integrated with the DLx logic 64B dataflow (P9 Nest runs 16B) Vivado toolchain flow with tcl script project creation Currently using internal github for project repository Made available to NDA customers Will be available to OpenCAPI consortium members via consortium github 6

7 OpenCAPI TLx Reference Code Config accesses hidden from AFU and sent directly to CFG Core by the TLx Don t want the AFU to have the complexity nor the ability to brick a card TLx interfaces to AFU via low level Transaction Layer Protocol (think parallel interface(s)) Interface specification defined in the TLX 3.0 Reference Design Specification TLx Parser receives OpenCAPI Host initiated TL Architected packets and decodes Separate Command & Response Interface for the separate Virtual Channels Can send 1 command per cycle on each interface to the AFU Separate Data interface Commands not presented to AFU until data on the link is received for that command TLx Framer receives AFU commands and responses and packetizes them into efficient OpenCAPI TLx Architected packets to send to the host Separate Command & Response Interface for separate Virtual Channels Can receive 1 command per cycle on each interface 7

8 OpenCAPI TLx AFU Interface Individual TL packet contents are driven or received by the AFU TLx only parses and packs the contents from/to the link packets into interface fields No knowledge of location of fields within packets is necessary by the AFU No knowledge of template usage is necessary by the AFU TLx has no intelligent logic for architected sequences of flows AFU must perform the proper sequences and follow the architecture Credit based interface to the AFU 8

9 Host to Accelerator Command Snippet Signal Name Bits Source Description tlx_afu_cmd_valid 1 TLX Command Valid. The remaining signals in this table are valid coincident with the assertion of tlx_afu_cmd_valid. tlx_afu_cmd_opcode 8 TLX Command Opcode. Note: Please see OpenCAPI 3.0 TL Specification for valid opcodes tlx_afu_cmd_capptag 16 TLX tlx_afu_cmd_pa 64 TLX Physical Address Unique handle specifying the host CAPP and command instance. Provided by the CAPP requesting command services of the TL. Command Data Length tlx_afu_cmd_dl 2 TLX tlx_afu_cmd_pl 3 TLX Encodings Size 2b 00 Reserved 2b Bytes 2b Bytes 2b Bytes Partial Length Encodings Size 3b Byte 3b Bytes 3b Bytes 3b Bytes 3b Bytes 3b Bytes 3b Reserved

10 Host to AFU Data Snippet Signal Name Bits Source Description tlx_afu_cmd_data_valid 1 TLX Command Data Valid. Valid data is present. tlx_afu_cmd_data_bus 512 TLX Command Data Bus. tlx_afu_cmd_data_bdi 1 TLX Bad Data Indicator. If asserted indicates the data received during the same cycle has an error and cannot be trusted. afu_tlx_cmd_rd_req 1 AFU AFU requests host command data known to be available afu_tlx_cmd_rd_cnt 3 AFU AFU specifies the number of data packets it will accept. Encodings Size 3b Bytes 3b Bytes 3b Bytes 3b Bytes 3b Bytes 3b Bytes 3b Bytes 3b Bytes Note: 001, 010, and 011 were set to match the data length encoding. 10

11 AFU Initiated Command and Data Snippet Signal Name Bits Source Description afu_tlx_cmd_valid 1 AFU Indicates that a valid AP command has arrived from the AFU to the TLX. Any command field that pertains to the arriving opcode should contain valid information at this time. Other command fields are undefined and may contain garbage. afu_tlx_cmd_opcode 8 AFU AP Command Opcode. (see TL Specification) afu_tlx_cmd_actag 12 AFU Address Context tag (see TL Specification) afu_tlx_cmd_ea_or_obj 68 AFU Effective Address/Object Handle. (see TL Specification) afu_tlx_cmd_afutag 16 AFU AFU Tag. (see TL Specification) afu_tlx_cmd_be 64 AFU Byte enable. (see TL Specification) afu_tlx_cmd_bdf 16 AFU Bus Device Function (see TL Specification) afu_tlx_cmd_pasid 20 AFU User process ID (see TL Specification) afu_tlx_cdata_valid 1 AFU AP Command Data Valid. Indicates that a valid packet of command immediate data has arrived from the TLX. The data bus and the bdi bit contain valid information. afu_tlx_cdata_bus 512 AFU AP Command Data Bus. afu_tlx_cdata_bdi 1 AFU Bad Data Indicator. Indicates that the AP command data packet is bad. 11

OpenCAPI Reference Design Cards Initial work done on Xilinx VU3P FPGA with Alpha Data 9V3 card Currently using Vivado 2018.2, but floorplan snapshot below is from 2017.

12 OpenCAPI Reference Design Cards Initial work done on Xilinx VU3P FPGA with Alpha Data 9V3 card Currently using Vivado , but floorplan snapshot below is from Images also created and tested on KU15P FPGA (Mellanox Innova-2) Work is ongoing with Xilinx ZU19P FPGA Next generation images to be created on Nallatech 250SOC Alpha Data 9H7 (VU37P) and 9H3 (VU33P) VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile DLx 9392/ (1.19%) 19026/ (4.82%) 0/ (0%) 7.5/720 (1.0%) 12 TLx 13806/ (1.75%) 8463/ (2.14%) 2156/ (1.09%) 0/720 (0%)

13 OpenCAPI 3.0 Reference AFU s MemCopy The MemCopy example is a data mover from source address -> destination address using Virtual Addressing and includes these features Work queue for each context which can be configured to do copy commands, interrupts, translation touch, wake host thread (all command types for host validation) Configuration and MMIO Register Space actag Table used for Bus/Device/Function and Process ID identification 512 processes/contexts and configurable up to 32 engines supporting up to 2K transfers using 64B, 128B, or 256B operations Memory Home Agent (LPC) The Memory Home Agent example implements memory off the endpoint OpenCAPI accelerator to act as a coherent extension to the host processor memory The Memory Home Agent example includes these features Configuration and MMIO Register Space Individual and pipelined operation for memory loads and stores 13 Interrupts, with error details reported to software through MMIO registers Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address

14 OpenCAPI 3.0 Reference AFU s AFP Main performance AFU Single process programmed to do streaming reads, streaming writes or a mix Data is not checked purely for bandwidth and latency testing Interrupt and Wake Host Thread latency counters Ping-Pong latency test added (MMIO to AFP->DMA store to memory) 14

0 25 Gb/s x8 Measured Bandwidth @25Gb/s 3.81 GB/s 12.57 GB/s 22.1 GB/s 4.16 GB/s 11.85 GB/s 21.6 GB/s N/A 13.94 GB/s 22.

15 CAPI and OpenCAPI Performance 128B DMA Read 128B DMA Write 256B DMA Read 256B DMA Write CAPI 1.0 PCIE Gen3 x8 Measured CAPI 2.0 PCIE Gen4 x8 Measured OpenCAPI Gb/s x8 Measured 3.81 GB/s GB/s 22.1 GB/s 4.16 GB/s GB/s 21.6 GB/s N/A GB/s 22.1 GB/s N/A GB/s 22.0 GB/s Power 8/9 CPU Xilinx KU60/VU3P FPGA First Introduction in nd Generation 15 Open Architecture with a Clean Slate Focused on Bandwidth and Latency

16 Latency Ping-Pong Test Simple workload created to simulate communication between system and attached FPGA Host Code 1. Copy 512B from cache to FPGA 2. Poll on incoming 128B cache injection 3. Reset poll location 4. Repeat TL, DL, PHY Host Code 1. Copy 512B from cache to FPGA 2. Poll on incoming 128B cache injection 3. Reset poll location 4. Repeat PCIe Stack Bus traffic recorded with protocol analyzer and PowerBus traces Response times and statistics calculated OpenCAPI Link TLx, DLx, PHYx FPGA Code 1. Poll on 512B received from host 2. Reset poll location 3. DMA write 128B for cache injection 4. Repeat PCIe Link FPGA PCIe HIP* FPGA Code 1. Poll on 512B received from host 2. Reset poll location 3. DMA write 128B for cache injection 4. Repeat * HIP refers to hardened IP

17 Latency Test Results 378ns Total Latency est. <555ns Total Latency 737ns Total Latency 776ns Total Latency P9 OpenCAPI P9 PCIe Gen4 P9 PCIe Gen3 Kaby Lake PCIe Gen3* 3.9GHz Core, 2.4GHz Nest 3.9GHz Core, 2.4GHz Nest 3.9GHz Core, 2.4GHz Nest 298ns est. <337ns 337ns 376ns 2ns Jitter 7ns Jitter 31ns Jitter TL, DL, PHY PCIe Stack PCIe Stack PCIe Stack OpenCAPI Link PCIe G4 Link PCIe G3 Link PCIe G3 Link TLx, DLx, PHYx (80ns ) Xilinx PCIe HIP (218ns ) Altera PCIe HIP (400ns ) Altera PCIe HIP (400ns ) Xilinx FPGA VU3P Xilinx FPGA VU3P Altera FPGA Stratix V Altera FPGA Stratix V * Intel Core i Quad-Core 3.6GHz (4.2GHz TurboBoost) Derived from round-trip time minus simulated FPGA app time Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time Derived from measured CPU turnaround time plus vendor provided HIP latency Derived from simulation Vendor provided latency statistic

18 Roadmap - OpenCAPI 4.0 (P9 /Axone) Adds posted DMA Store operations with Address Translation Cache New AFU validation/reference design needs Address Translation Cache in MemCopy (in development) Storage Class Memory Development Technology previews being developed on OpenCAPI

19 Table of Enablement Deliveries Item Delivery Name Where to Obtain Available When OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications) Xilinx Vivado Project Build with Memcopy Exerciser Device Discovery and Configuration Specification and RTL <snapshot>.tar.gz Enablement WG Today Vivado Project Flow Enablement WG Today OpenCAPI 3.0 Configuration Sub- System Reference Design Specification Enablement WG Causeway AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today 25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signalling WG Causeway Today 25Gbps PHY Mechanical Specification OpenCAPI Simulation Environment (OCSE) Memcopy and Memory Home Agent Exercisers Today 25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today ocse-<version>.tar.gz OpenCAPIDemokit.pdf MCP3 and LPC <snapshot>.tar.gz Enablement WG Enablement WG Reference Driver Available LIBOCXL Ubuntu GitHub Today Today Today Today 19

Industry Collaboration and Innovation

Industry Collaboration and Innovation OpenCAPI Topics Industry Background Technology Overview Design Enablement OpenCAPI Consortium Industry Landscape Key changes occurring in our industry Historical microprocessor