OpenCAPI Technology Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name Join the Conversation #OpenPOWERSummit
Industry Collaboration and Innovation
OpenCAPI Topics Computation Data Access Industry Background Where/How OpenCAPI Technology is used Technology Overview and Advantages Demonstrations OpenCAPI Consortium Where it all Happens Key Messages Throughout Open IO Standard High Performance No OS/Hypervisor/FW Overhead with Low Latency and High Bandwidth Not tied to Power Architecture Agnostic Very Low Accelerator Design Overhead Programing Ease Ideal for Accelerated Computing and SCM Supports heterogeneous environment Use Cases Optimized for within a single system node Products exist today!
Industry Background that Defined OpenCAPI Growing computational demand due to emerging workloads (e.g., AI, cognitive, etc.) Moore s Law not being supported by traditional silicon scaling Computation Driving increased dependence on Hardware Acceleration for performance Hyperscale Datacenters and HPC need much higher network bandwidth 100 Gb/s -> 200 Gb/s -> 400Gb/s are emerging Deep learning and HPC require more bandwidth between accelerators and memory Emerging memory/storage technologies are driving need for bandwidth with low latency Data Access Hardware accelerators are defining the attributes of a high performance bus Growing demand for network performance and network offload Introduction of device coherency requirements (IBM s introduction in 2013) Emergence of complex storage and memory solutions Various form factors with no one able to address everything (e.g., GPUs, FPGAs, ASICs, etc.) 4 all Relevant to Modern Data Centers
Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI OpenCAPI 3.0 OpenCAPI specifications are downloadable from the website at www.opencapi.org - Register - Download OpenCAPI 3.1
Buffered System Memory OpenCAPI Memory Buffers OpenCAPI Key Attributes Standard System Memory Advanced SCM Solutions Storage/Compute/Network etc ASIC/FPGA/FFSA FPGA, SOC, GPU Accelerator Load/Store or Block Access Caches Accelerated OpenCAPI Device Device Memory TLx/DLx U Accelerated Function TL/DL 25Gb I/O Application Any OpenCAPI Enabled Processor 1. Architecture agnostic bus Applicable with any system/microprocessor architecture 2. Optimized for High Bandwidth and Low Latency 3. High performance 25Gbps PHY design with zero overhead 4. Coherency - Attached devices operate natively within application s user space and coherently with host microprocessor 5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement 6. Wide range of Use Cases and access semantics 7. CPU coherent device memory (Home Agent Memory) 8. Architected for both Classic Memory and emerging Advanced Storage Class Memory 9. Minimal OpenCAPI design overhead (FPGA less than 5%) 6
POWER9 IO Features POWER9 IO Leading the Industry PCIe Gen4 CAPI 2.0 (Power) NVLink 2.0 OpenCAPI 3.0 POWER9 Silicon Die Various packages (scale-out, scale-up) 8 and 16Gbps PHY Protocols Supported PCIe Gen3 x16 and PCIe Gen4 x8 CAPI 2.0 on PCIe Gen4 PCIe Gen4 P9 25Gbs 25Gbps PHY Protocols Supported OpenCAPI 3.0 NVLink 2.0
Virtual Addressing and Benefits An OpenCAPI device operates in the virtual address spaces of the applications that it supports Eliminates kernel and device driver software overhead Allows device to operate on application memory without kernel-level data copies/pinned pages Simplifies programming effort to integrate accelerators into applications Improves accelerator performance The Virtual-to-Physical Address Translation occurs in the host CPU Reduces design complexity of OpenCAPI-attached devices Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access
Acc Acceleration Paradigms with Great Performance Memory Transform Processor Chip Example: Basic work offload DLx/TLx Data Acc OpenCAPI is ideal for acceleration due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture Examples: Machine or Deep Learning such as Natural Language processing, sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory Egress Transform Ingress Transform Acc Acc Processor Chip DLx/TLx Data Processor Chip DLx/TLx Data Examples: Encryption, Compression, Erasure prior to delivering data to the network or storage Needle-in-a-haystack Needle-In-A-Haystack Engine Engine Processor Chip DLx/TLx Needles Examples: Database searches, joins, intersections, merges Only the Needles are sent to the processor Large Haystack Of Data Examples: Video Analytics, Network Security, Deep Packet Inspection, Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading etc Bi-Directional Transform Processor Chip DLx/TLx Data Acc Acc Examples: NoSQL such as Neo4J with Graph Node Traversals, etc 9
Comparison of Memory Paradigms Common physical interface between non-memory and memory devices OpenCAPI protocol was architected to minimize latency; excellent for classic DRAM memory Extreme bandwidth beyond classical DDR memory interface Agnostic interface will handle evolving memory technologies in the future (e.g., compute-in-mem) Ability to handle a memory buffer to decouple raw memory and host interface to optimize power, cost, perf Main Memory Example: Basic DDR attach Processor Chip DLx/TLx Data DDR4/5 OpenCAPI 3.1 Architecture Ultra Low Latency ASIC buffer chip adding +5ns on top of native DDR direct connect!! Emerging Storage Class Memory Processor Chip DLx/TLx Data SCM Storage Class Memories have the potential to be the next disruptive technology.. Examples include ReRAM, MRAM, Z-NAND All are racing to become the defacto Tiered Memory Processor Chip DLx/TLx Data DDR4/5 DLx/TLx Data SCM Storage Class Memory tiered with traditional DDR Memory all built upon OpenCAPI 3.1 & 3.0 architecture. Still have the ability to use Load/Store Semantics
CAPI and OpenCAPI Performance 128B DMA Read 128B DMA Write 256B DMA Read 256B DMA Write CAPI 1.0 PCIE Gen3 x8 Measured BW @8Gb/s CAPI 2.0 PCIE Gen4 x8 Measured BW @16Gb/s OpenCAPI 3.0 25 Gb/s x8 Measured BW @25Gb/s 3.81 GB/s 12.57 GB/s 22.1 GB/s 4.16 GB/s 11.85 GB/s 21.6 GB/s N/A 13.94 GB/s 22.1 GB/s N/A 14.04 GB/s 22.0 GB/s POWER8 CAPI 1.0 POWER9 CAPI 2.0 and OpenCAPI 3.0 Xilinx KU60/VU3P FPGA POWER8 Introduced in 2013 11 POWER9 Second Generation POWER9 Open Architecture with a Clean Slate Focused on Bandwidth and Latency
Latency Test Results
Latency Test Simple workload created to simulate communication between system and attached FPGA 1. Copy 512B from host send buffer to FPGA 2. Host waits for 128 Byte cache injection from FPGA and polls for last 8 bytes 3. Reset last 8 bytes 4. Repeat Go TO 1.
OpenCAPI Enabled FPGA Cards Mellanox Innova2 Accelerator Card Alpha Data 9v3 Accelerator Card Typical eye diagram at 25Gb/s using these cards 14
Barrel Eye G2 System Demo Actual Barrel Eye G2 demo system Packet Classifier Demonstration using Alpha Data 9v3 Accelerator Card (early classifier bringup at 20 Gb/s)
Barrel Eye G2 System Demo Actual Barrel Eye G2 demo system Packet Classifier Demonstration using Mellanox Innova2 Accelerator Card (early classifier bringup at 20 Gb/s)
OpenCAPI Consortium Incorporated September 13, 2016 Announced October 14, 2016 5 Open forum founded by AMD, Google, IBM, Mellanox, and Micron Manage the OpenCAPI specification, Establish enablement, Grow the ecosystem Currently over 35 members Consortium now established Established Board of Directors (AMD, Google, IBM, Mellanox Technologies, Micron, NVIDIA, Western Digital, Xilinx) Governing Documents (Bylaws, IPR Policy, Membership) with established Membership Levels Website www.opencapi.org Technical Steering Committee with Work Group Process established Marketing/Communications Committee Work Groups TL Specification, DL Specification, PHY Signaling, PHY Mechanical, Compliance, and Enablement Creation of additional work groups include: Memory, Software, Accelerator, and more OpenCAPI Specification available on web site, was contributed to consortium as starting point for the Work Groups Design enablement available today (reference designs, documentation, SIM environment, exercisers, etc.)
OpenCAPI Design Enablement Item Availability OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications) Xilinx Vivado Project Build with Memcopy Exerciser Device Discovery and Configuration Specification and RTL AFU Interface Specification Reference Card Design Enablement Specification 25Gbps PHY Signal Specification 25Gbps PHY Mechanical Specification OpenCAPI Simulation Environment (OCSE) Tech Preview Memcopy and Memory Home Agent Exercisers Reference Driver Available Today Today Today Today 2Q18 Today Today Today Today 2Q18 18
Membership Entitlement Details Strategic level - $25K Draft and Final Specifications and enablement License for Product development Workgroup participation and voting TSC participation Vote on new Board Members Nominate and/or run for officer election Prominent listing in appropriate materials Contributor level - $15K Draft and Final Specifications and enablement License for Product development Workgroup participation and voting TSC participation Submit proposals Observing level - $5K Final Specifications and enablement License for Product development Academic and Non-Profit level - Free Final Specifications and enablement Workgroup participation and voting
Current Members Strategic Membership level Contributor Membership level Observing Membership Level Academic Membership Level 20
Cross Industry Collaboration and Innovation Research & Academic SW Deployment Systems and Software Accelerator Solutions SOC OpenCAPI Protocol Products and Services Welcoming new members in all areas of the ecosystem 21
OpenCAPI Consortium Next Steps JOIN TODAY! www.opencapi.org Come see us in the Exhibit Hall OpenCApI BOOth. 5