Intel FPGA IP Core Cache Interface (CCI-P)

Size: px

Start display at page:

Download "Intel FPGA IP Core Cache Interface (CCI-P)"

Juniper Mason
6 years ago
Views:

1 Intel FPGA IP Core Cache Interface (CCI-P) Interface Specification September 2017 Revision 0.5 Document Number: External

2 Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Copies of documents which have an order number and are referenced in this document may be obtained by calling or by visiting Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright 2017, Intel Corporation. All Rights Reserved. 2 Document Number: External, Revision 0.5

3 Contents 1 About this Document Intended Audience Conventions Related Documentation Glossary Introduction Multi-chip and Discrete Package with Intel FPGA Block Diagram Development Model Memory Hierarchy CCI-P Interface Features Signaling Information Read from/write to Main Memory UMsg MMIO Cycles to I/O Memory CCI-P Tx Signals Tx Header Format CCI-P Rx Signals Rx Header and Rx Data Format Multi-Cache Line Memory Requests Additional Control Signals Protocol Flow Upstream Requests Downstream Requests Ordering Rules Memory Requests MMIO Requests Timing Diagrams Clock Frequency CCI-P Guidance AFU Requirements Mandatory AFU CSR Definitions AFU Discovery Flow AFU_ID How to Create an AFU_ID / GUID How to Use an AFU_ID Basic Building Blocks Device Feature List Document Number: External, Revision 0.5 3

4 Figures Tables Figure 2-1. High Level Block Diagram of MCP/DCP with Intel FPGA IP Logic. 13 Figure 2-2. MCP/DCP with Intel FPGA IP System Memory Hierarchy, 1 Processor Topology Figure 3-1. CCI-P Signals Figure 3-2. UMsg Initialization and Usage Flow Figure 3-3. Multi-CL Memory Request Figure 3-4. Multi-CL Memory Write Responses Figure 3-5. Multi-CL Memory Read Responses Figure 3-6. Two Writes on Same VC, Only One Outstanding Figure 3-7. Write Out of Order Commit Figure 3-8. Use WrFence to Enforce Write Ordering Figure 3-9. Read Re-Ordering to Same Address, Different VCs Figure Read Re-Ordering to Same Address, Same VC Figure Tx Channel 0 and 1 Almost Full Threshold Figure Write Fence Behavior Figure C0 Rx Channel Interleaved between MMIO Requests and Memory Responses Figure Rd Response Timeout Figure 4-1. AFU Discovery Flow Figure 6-1. Example Feature Hierarchy Figure 6-2. Device Feature Conceptual View Table 1-1. Related Documentation... 8 Table 1-2. Acronyms and Definition Table... 9 Table 2-1. CCI-P Features Table 2-2. Comparison of Platform Capabilities Table 2-3. AFU Memory Read Paths Table 3-1. CCI-P Features Summary Table 3-2. Tx Channel Description Table 3-3. Tx Header Field Definitions Table 3-4. Tx Header Field Definitions Table 3-5. C0 Read Memory Request Header Format Structure: t_ccip_c0_reqmemhdr Table 3-6. C1 Write Memory Request Header Format Structure: t_ccip_c1_reqmemhdr Table 3-7. C1 Fence Header Format Structure: t_ccip_c1_reqfencehdr Table 3-8. C2 MMIO Response Header Format Table 3-9. Rx Channel Signal Description Table Rx Header Field Definitions Table AFU Rx Response Encodings and Channels Mapping Table C0 Memory Read Response Header Format Structure: t_ccip_c0_rspmemhdr Table MMIO Request Header Format Document Number: External, Revision 0.5

5 Table C1 Memory Write Response Header Format Structure: t_ccip_c1_rspmemhdr Table UMsg Header Format Table WrFence Header Format Structure: t_ccip_c1_rspfencehdr Table Clock and Reset Table Protocol Flow for Upstream Request from AFU to FIU Table CCI-P VL0 protocol flows Table Protocol Flow for Downstream Requests from CPU to AFU Table Ordering Rules for Upstream Requests from AFU Table MMIO Ordering Rules Table Clock Frequency Table Recommended Choices for Memory Requests Table 4-1. Register Attribute Definition Table 4-2. Mandatory AFU CSRs Table 4-3. Feature Header CSR Definition Table 4-4. AFU_ID_L CSR Definition Table 4-5. AFU_ID_H CSR Definition Table 4-6. DFH_RSVD0 CSR Definition Table 4-7. DFH_RSVD1 CSR Definition Table 6-1. Device Feature Header CSR Table 6-2. Next DFH Byte Offset Example Table 6-3. Mandatory BBB DFH Register Map Table 6-4. BBB_ID_L CSR Definition Table 6-5. BB_ID_H CSR Definition Code Code 3-1. ccip_std_afu Port Map Code 3-2. Tx Interface Structure Inside ccip_if_pkg.sv Code 3-3. Tx Channel Structure Inside ccip_if_pkg.sv Code 3-4. Rx Interface Structure Inside ccip_if_pkg.sv Code 3-5. Rx Channel Structure Inside ccip_if_pkg.sv Code 4-1. Set the Mandatory AFU Registers in the AFU Code 4-2. Software Reads the AFU ID Document Number: External, Revision 0.5 5

6 Document Revision History Document Number Revision Number Description Date External 0.5 Initial External Revision Sept Document Number: External, Revision 0.5

7 1 About this Document This document describes the Core Cache Interface (CCI-P) specification which is the interface between the Accelerated Function Unit (AFU) and a multi-chip package (MCP) or Discrete Chip Package (DCP) with Intel FPGA IP. 1.1 Intended Audience The intended audience is system engineers, platform architects, and software developers. Users must design the HW AFU to be compliant with the CCI-P specification. 1.2 Conventions Conventions used in this document include the following: # Precedes a command that indicates the command is to be entered as root. $ Indicates a command is to be entered as a user. This font <variable_name> Filenames, commands, and keywords are printed in this font. Long command lines are printed in this font. Although some very long command lines may wrap to the next line, the return is not considered part of the command; do not enter it. Indicates the placeholder text that appears between the angle brackets is to be replaced with an appropriate value. Do not enter the angle brackets. Document Number: External, Revision 0.5 7

8 1.3 Related Documentation Table 1-1. Related Documentation Description Intel Arria 10 Avalon-ST Interface with SR-IOV PCIe Solutions User Guide Document Number or Location This document is Intel Arria 10 PCIe* SR-IOV datasheet. Intel Software Developers Manual This document contains all three volumes of the Intel 64 and IA-32 Architecture Software Development Manual: Basic Architecture, Order Number ; Instruction Set Reference A-Z, Order Number ; System Programming Guide, Order Number Refer to all three volumes when evaluating your design needs. manuals/64-ia-32-architectures-software-developer-manual pdf Intel Virtualization Technology for Directed I/O Architecture Specification This document describes the Intel Virtualization Technology for Directed I/O (Intel VT for Directed I/O); specifically, it describes the components supporting I/O virtualization as it applies to platforms that use Intel processors and core logic chipsets complying with Intel platform specifications. roduct-specifications/vt-directed-io-spec.pdf 8 Document Number: External, Revision 0.5

9 1.4 Glossary Table 1-2. Acronyms and Definition Table Acronyms Description Expansion AFU Accelerated Function Unit Hardware accelerator implemented in FPGA logic that accelerates or intends to accelerate an application. ALI AFU Link Interface This is the interface between software and CCI-P. ASE AFU Simulation Environment A co-development and simulation tool suite available in software SDK. CA Caching Agent A Caching Agent (CA) makes read and write requests to the coherent memory in the system. It is also responsible for servicing snoops generated by other Intel QuickPath Interconnect (Intel QPI) agents in the system. CCI-P Core Cache Interface Interface between the AFU and the FPGA Interface Unit (FIU). CL Cache Line 64-byte cache line DPI Direct Programming Interface A set of features in SystemVerilog that allows export/import of parameters to/from a C function. FIU FPGA Interface Unit The Intel UPI and PCIe* on FPGA together form the FIU sub-block. FPGA Field Programmable Grid Array PA Physical Address Physical Address of the host machine IP Intellectual property A sample AFU IPC Inter-Process Communication Refers to constructs in Linux* like shared memory (/dev/shm) and message queues (/dev/mqueue); these are leveraged for ASE core functionality. KiB 1024 bytes The term KiB is for 1024 bytes and KB for 1000 bytes. When referring to memory, KB is often used and KiB is implied. When referring to clock frequency, khz is used, and here K is Mdata Message Tag Data This is a user-defined field, which is relayed from Tx header to the Rx header. It may be used to tag requests with transaction ID or channel ID. Msg Message Message - a control notification NLB Native Loopback Adapter Intel s proprietary interconnect protocol between Intel cores or other IP PAR Place and Route A set of runtime and software development tools that facilitate the deployment of systems consisting of a collection of non-uniform, asymmetric compute resources. RdLine_I 1 Read Line Invalid Memory Read Request, with FPGA cache hint set to Invalid, i.e., do not cache it. The line will not cached in FPGA, but may cause FPGA cache pollution. 1 The cache tag is used to track the request status for all outstanding requests on UPI. Therefore, even though RdLine_I is marked Invalid upon completion, it consumes the cache tag temporarily to track the request status over UPI. This action may result in the eviction of a cache line, resulting in cache pollution. The Document Number: External, Revision 0.5 9

10 Acronyms Description Expansion RdLine_S Read Line Shared Memory Read Request, with FPGA cache hint set to shared. An attempt will be made to keep it in FPGA cache in a shared state. Rx Receive Receive or input from an AFU s perspective Tx Transmit Transmit or output from an AFU s perspective Upstream Direction up to CPU Logical direction towards CPU. Example, upstream port, means port going to CPU. UMsg UMsgH Intel UPI Unordered Message from CPU to AFU Unordered Message Hint from CPU to AFU Intel Ultra Path Interconnect (Intel UPI) An unordered notification with a 64-byte payload This is a hint to a subsequent UMsg. No data payload. Intel s proprietary coherent interconnect protocol between Intel cores or other IP. WrLine_I Write Line Invalid Memory Write Request, with FPGA cache hint set to Invalid. FIU will write the data with no intention of keeping the data in FPGA cache. WrLine_M Write Line Modified Memory Write Request, with FPGA cache hint set to Modified. FIU will write the data and leave it in the FPGA cache in Modified state. WrPush_I Write Push Invalid Memory Write Request, with FPGA cache hint set to Invalid. FIU writes the data into the processor s Last Level Cache (LLC) with no intention of keeping the data in FPGA cache. The LLC it writes to is always the LLC associated with the processor where the DRAM address is homed. advantage of using RdLine_I is that it is not tracked by CPU directory; thus it will prevent snooping from CPU. 10 Document Number: External, Revision 0.5

11 2 Introduction CCI-P is the hardware-side signaling interface between the Accelerated Function Unit (AFU) and the FPGA Interface Unit (FIU). This document defines the signaling interface. It specifies the access types, the request format and the memory model, and the mandatory AFU CSRs. It provides timing diagrams and AFU design guidelines. CCI-P provides an abstraction of the physical links between the FPGA and CPU. An AFU sees a unified interface with four virtual channels and a unified address space. CCI-P uses data payloads with up to four cache lines (4 CL). Table 2-1 lists some key CCI-P features. Table 2-1. CCI-P Features Feature Data Transfer Size Addressing Mode Addressing Width (CL aligned addresses) Caching Hints Virtual Channels Response Ordering MMIO Read and Write FPGA to CPU Interrupt Interface Clk frequency CCI-P 64, 128, 256B Physical Addressing Mode 42 bits Yes VA, VL0, VH0, VH1 Out of order responses Supported Not Supported 400 MHz CCI-P introduces two architectural concepts: Device Feature Lists (DFLs) and Basic Building Blocks (BBBs). DFL defines a structure for grouping like functionalities and enumerating them. BBB defines an architecture for wrapping features into building blocks. You can incorporate these building blocks into your AFU. BBBs are source-visible reference designs; other than a few mandatory registers, there are no other requirements imposed on a BBB. For example, the Memory Properties Factory (MPF) is a BBB that translates virtual memory addresses to physical memory addresses for memory shared between the Intel Xeon processor and the FPGA. MPF also does read response ordering and provides data hazard resolution. Section 5 provides more information on BBBs. Document Number: External, Revision

12 2.1 Multi-chip and Discrete Package with Intel FPGA Block Diagram FPGA logic (as shown in Figure 2-1) is divided into two parts: the Intel-provided FPGA Interface Unit (FIU) represented by the blue box (called the blue bitstream) and the userdeveloped AFU represented by the green box (called the green bitstream). The blue bitstream is the system/platform code which is configured at boot time and remains in memory to manage system buses. The green bitstream is in a partial configuration region and may be updated on a live system. The FIU implements all the key features required for deployment and manageability of FPGA using an Intel Xeon processor within the datacenter. The FIU implements the interface protocols for links between the CPU and FPGA. The FIU also provides platform capabilities such as Intel Virtual Technology (Intel VT) for Directed I/O (Intel VT-d), security, error monitoring, performance monitoring, power and thermal management, partial reconfiguration of the AFU, etc. Note: The three physical links: PCIe0, PCIe1, and UPI. These physical links are multiplexed as virtual channels on the CCI-P interface. Refer to Section 2.3 for more information about physical and virtual channels. The System Management Bus (SMBUS) interface running between the Intel Xeon processor, and the MCP or DCP with Intel FPGA IP is SMBUS-like; it does not follow published SMBUS specifications. It is used for out-of-band temperature monitoring, configuration during the bootstrap process, and platform debug purposes. 12 Document Number: External, Revision 0.5

13 CCI-P Figure 2-1. High Level Block Diagram of MCP/DCP with Intel FPGA IP Logic Intel Xeon Intel IP: FPGA Interface Unit (FIU) SMBus slave PCIe Gen3x8 EP0 PCIe Gen3x8 EP1 Coherent intf UPI 9.2G Cache controller Data Channel Control Channel BDX only blocks SKX only blocks Optionalparameterized CCI-U CCI-U CCI-U FPGA Management Engine (FME) 1. thermal monitor 2. power monitor 3. performance monitor 4. Partial Reconfiguration 5. global errors Fabric IOMMU & Device TLB CCI-P Port0 - SignalTap - UMsg - port reset - port errors PR Unit AFU 0 Refer to Figure 2-2 for a list of platform capabilities. Unified address space Even though FIU has three physical links going to the CPU, the AFU maintains a single view of the system address space. A write to address X directed over coherent interface or PCIe goes to the same cache line in the system memory. Intel VT-d support MCP/DCP with Intel FPGA IP has hardware support for memory isolation. Partial Reconfiguration (PR) of AFU Document Number: External, Revision

14 PR uses Altera FPGA technology to allow a user to reconfigure parts of the FPGA device dynamically, while the remainder of the FPGA continues to operate. MCP/DCP with Intel FPGA IP supports one AFU. Remote Debug MCP/DCP with Intel FPGA IP product enables remote access to SignalTap II Logic Analyzer for in system debug. The remote access gives capability to use the SignalTap II Logic Analyzer through network when physical access is not available as would be an expected debug usage in a data center environment. Table 2-2. Comparison of Platform Capabilities Capability Intel Xeon Processor E v4 Family with FPGA IP Current MCP/DCP with Intel FPGA IP Unified Address space Yes Yes Intel VT-d support for AFU No Yes Partial Reconfiguration Yes Yes RemoteDebug Yes Yes FPGA Cache size 64 KiB direct mapped 128 KiB direct mapped 2.2 Development Model The two AFU development models supported are Hardware Description Language (HDL) design and OpenCL design. 1. HDL design This is the traditional FPGA development flow, where users design an AFU in an HDL language like Verilog, System Verilog or VHDL adhering to the CCI-P interface specification. Users then compile their code (the RTL) through the Intel Quartus tool chain to generate an AFU bitstream. 2. OpenCL design The OpenCL SDK is a framework for writing programs at a higher level of C-like abstraction. Users develop an AFU in OpenCL C and compile it along with the MCP/DCP with Intel FPGA IP Board Support Package (BSP) to generate an FPGA bitstream and a software executable. For best performance, the OpenCL code must be optimized for the MCP/DCP with Intel FPGA IP platform. Applications can even simultaneously utilize multiple distinct implementations of the same service API. 14 Document Number: External, Revision 0.5

2.3 Memory Hierarchy This section explains the memory hierarchy in the MCP/DCP Intel FPGA IP system. Refer to Figure 2-2. The green dotted box shows the multi-processor coherence domain.

15 2.3 Memory Hierarchy This section explains the memory hierarchy in the MCP/DCP Intel FPGA IP system. Refer to Figure 2-2. The green dotted box shows the multi-processor coherence domain. The FIU on the FPGA extends the coherence domain from the processor to the FPGA, encompassing a cache implemented on the FPGA (called the FPGA cache). The FIU implements a cache controller and UPI Caching Agent (CA). The CA makes read and write requests to coherent system memory and services snoop requests to the FIU cache. Figure 2-2. MCP/DCP with Intel FPGA IP System Memory Hierarchy, 1 Processor Topology The CCI-P interface abstracts the physical links to the processor and provides simple load/store semantics to the AFU for accessing system memory. The physical links are presented as virtual channels on the CCI-P interface. Each request can select the virtual channel. The virtual channels are called VL0, VH0, and VH1. There is a fourth called VA (for V Auto) where the FIU maps requests to the three physical buses, optimizing for bandwidth. Refer to Table 2-3. The response header identifies which VC was selected by the FIU. For a single-processor system, AFU sees a three-level memory hierarchy: FIU Cache (2) Processor Last Level Cache (LLC) (3) DRAM The memory access latency increases from (1) to (3). Note that the AFU accesses 2 nd and 3 rd level memory along three independent paths, each with a different latency. Table 2-3 lists the different possible AFU Memory Read operations in Document Number: External, Revision

16 increasing order of latency. Each row shows the request path, and the node that services the request is highlighted in GREEN. Table 2-3. AFU Memory Read Paths Requests FPGA Cache Processor LLC DRAM FPGA Cache Hit Hit (only applies to VL0) Processor Cache Hit Miss Hit All Cache Miss Miss Miss Read If still developing experience with the CCI-P interface, choose the VA channel. This channel is optimized for maximum bandwidth and producer-consumer type data flows. Refer to Section 3.12 for ordering rules. When choosing VA, the FIU makes a decision to steer your request to a physical link based on the following: Caching hint Data payload size Link utilization Cacheable requests will be biased towards the UPI link. 64B requests will be biased towards UPI link. A cache line is 64 byte. A multi-cache line read/write will NOT be split, it is guaranteed to be processed by a single physical link. VA will attempt to balance the load across the virtual channels. The cache is along the VL0 data path. The VC steering decision is made before the cache lookup. You could incur a high memory latency, if the requested cache line is cached in FPGA, and the request got steered to VH*. In this case, the processor will have to snoop the FPGA cache, in order to complete the VH request. 16 Document Number: External, Revision 0.5

17 3 CCI-P Interface CCI-P provides access to two types of memory: main memory and input/output (I/O) memory. Main Memory I/O Memory Subsequent to this section, main memory is just referred to as memory. This is the memory attached to the processor and exposed to the operating system. Requests from the AFU to main memory are called upstream requests. I/O memory is implemented within the I/O device, which in our case is the AFU. How this memory is implemented and organized is up to the AFU. The AFU may choose flip-flops, M20Ks or MLABs. The CCI-P interface defines a request format to access I/O memory using Memory Mapped I/O (MMIO) requests. Requests from the processor to I/O Memory are called downstream requests. The AFU s MMIO address space is 256 KB. Figure 3-1 shows all CCI-P signals grouped into three Tx channels, two Rx channels and some additional control signals. Tx/Rx Channels The flow direction is from the AFU point of view. Tx flows from AFU to FIU. Rx flows from FIU to AFU. Grouping of signals that together completely defines the request or response. Figure 3-1 reflects the organization shown in the files ccip_std_afu.sv and ccip_if_pkg.sv. Document Number: External, Revision

18 Figure 3-1. CCI-P Signals 18 Document Number: External, Revision 0.5

19 3.1 Features Table 3-1 summarizes the features unique to the CCI-P interface for the AFU. Table 3-1. CCI-P Features Summary Virtual Channels VL0 VH0 VH1 VA Memory Request Addressing Mode Address Width Physical links are presented to the AFU as channels. The AFU can select the virtual channel for each memory request. Low latency virtual channel. (Mapped to UPI) High latency virtual channel. (Mapped to PCIe0). Protocol efficiency is better for larger data payloads. High latency virtual channel. (Mapped to PCIe1). Protocol efficiency is better for larger data payloads. Virtual Auto: FIU auto selects the link based on link utilization, request caching hint, and payload size. Latency: expect to see high variance BW: expect to see high steady state BW AFU read/write to memory Physical address 42 bits (CL address) Data Lengths 64B 128B 256B Byte Addressing FPGA Caching Hint <request>_i <request>_s <request>_m MMIO Request Not supported The AFU can ask the FIU to cache the CL in a specific state. For requests directed to VL0, FIU attempts to cache the data in the requested state, given as a hint. Except for WrPush_I, cache hint requests on VH0/1 are ignored. Note: The caching hint is only a hint and provides no guarantee of final cache state. Ignoring a cache hint, impacts performance but does not impact functionality. No intention to cache Desire to cache in shared (S) state Desire to cache in modified (M) state CPU read/write to AFU I/O Memory MMIO Read payload 4B 8B MMIO Write payload 4B 8B 64B MMIO writes could be combined by the x86 Write Combining buffer UMsg Unordered Message UMsgs data payload CPU read/write to AFU I/O Memory 64B # UMsg supported 8 per AFU Document Number: External, Revision

20 3.2 Signaling Information All CCI-P signals must be synchronous to pclk. All signals are active high, unless explicitly mentioned. Active low signals use a suffix _n. Intel recommends using the CCI-P structures defined inside ccip_if_pkg.sv file. This is included in the RTL package. All AFU output signals must be registered. AFU output bits marked as RSVD are reserved and must be driven to 0. AFU output bits marked as RSVD-DNC, are don t care bits. The AFU can drive either 0 or 1. All AFU input signals must also be registered. AFU input bits marked as RSVD must be treated as don t care (X) by the AFU. Code 3-1 shows the port map for the ccip_std_afu module. The AFU must be instantiated under here. The subsequent sections explains the interface signals. Code 3-1. ccip_std_afu Port Map $ module ccip_std_afu( // CCI-P Clocks and Resets input logic pclk, // 400MHz - CCI-P clock domain. Primary // interface clock input logic pclkdiv2, // 200MHz - CCI-P clock domain. input logic pclkdiv4, // 100MHz - CCI-P clock domain. input logic uclk_usr, // User clock domain. input logic uclk_usrdiv2, // User clock domain. Half the programmed // frequency input logic pck_cp2af_softreset, // CCI-P ACTIVE HIGH Soft // Reset input logic [1:0] pck_cp2af_pwrstate, // CCI-P AFU Power State input logic pck_cp2af_error, // CCI-P Protocol Error // Detected // Interface structures input t_if_ccip_rx pck_cp2af_srx, // CCI-P Rx Port output t_if_ccip_tx pck_af2cp_stx // CCI-P Tx Port ); 3.3 Read from/write to Main Memory The AFU makes a memory read request to the FIU over Channel 0 (C0), using Tx signals, and receives the response over C0, using Rx signals. AFU drives the C0 valid signal to indicate that C0 Hdr contains a request. The c0_reqmemhdr structure provides a convenient mapping from flat bit-vector to read request fields. The req_type signal provides a cache hint (RDLINE_I, Invalid or RDLINE_S, Shared). The mdata field is a user defined request ID. 20 Document Number: External, Revision 0.5

21 Then, the FIU responds over C0. The resp_type signal in the c0_rspmemhdr structure indicates response type (Memory Read or UMsg Received). The data field in C0 contains the data that were read. The mdata field in the c0_rspmemhdr structure contains the same value that went out with the request. The AFU makes a memory write request to the FIU over Channel 1 (C1), using Tx signals, and receives the response over C1, using Rx signals. AFU drives the C1 valid signal to indicate that C1 Hdr contains a request. The c1_reqmemhdr structure provides a convenient mapping from flat bit-vector to write request fields. The req_type signal provides request type and cache hint. Then, the FIU responds over C1 using Rx signals. The resp_type field in the c0_rspmemhdr structure indicates whether the response is for a memory write. The mdata field in the c1_respmemhdr structure contains the same value that went out with the write request. Write memory requests need explicit synchronization using WrFence. 3.4 UMsg UMsg provides the same functionality as a spin loop from the AFU, without burning the CCI-P read bandwidth. Think of it as a spin loop optimization, where a monitoring agent inside the FPGA cache controller is monitoring snoops to cache lines allocated by the driver. When it sees a snoop to the cache line, it reads the data back and sends an UMsg to the AFU. UMsg flow makes use of the cache coherency protocol to implement a high speed unordered messaging path from CPU to AFU. This process consists of two stages as shown in Figure 3-2. The first stage is initialization, this is where SW pins the UMsg Address Space (UMAS) and shares the UMAS start address with the FPGA cache controller. Once this is done, the FPGA cache controller reads each cache line in the UMAS and puts it as shared state in the FPGA cache. The second stage is actual usage, where the CPU writes to the UMAS. A CPU write to UMAS generates a snoop to FPGA cache. The FPGA responds to the snoop and marks the line as invalid. The CPU write request completes, and the data become globally visible. A snoop in UMAS address range, triggers the Monitoring Agent (MA), which in turn sends out a read request to CPU for the Cache Line (CL) and optionally sends out an UMsg with Hint (UMsgH) to the AFU. When the read request completes, an UMsg with 64B data is sent to the AFU. Document Number: External, Revision

22 Usage Intialization Figure 3-2. UMsg Initialization and Usage Flow CPU Memory FPGA QPI Agent Setup UMAS (Pinned Memory) Inform FPGA of UMAS location AFU CPU Writes to UMAS CPU Wr causes a Snoop to FPGA UMsgH For ultra low latency, Snp itself is used as a UMsgH FPGA gets the read data UMsg + 64B data Snp + Read Data is sent as UMsg Functionally, UMsg is equivalent to a spin loop or a monitor and mwait instruction on an Intel Xeon processor. Some key characteristics of UMsgs: 1. Just as spin loops to different addresses in a multi-threaded application have no relative ordering guarantee, UMsgs to different addresses have no ordering guarantee between them. 2. Every CPU write to a UMAS CL, may not result in a corresponding UMsg. The AFU may miss an intermediate change in the value of a CL, but it is guaranteed to see the newest data in the CL. Again it helps to think of this like a spin loop: if the producer thread updates the flag CL multiple times, it is possible that polling thread misses an intermediate change in value, but it is guaranteed to see the newest value. Here is an example usage. Software updates to a descriptor queue pointer may be mapped to an UMsg. The pointer is always expected to increment. The UMsg will guarantee that AFU sees the final value of the pointer, it may miss intermediate updates to the pointer, which is acceptable. 1. The UMsg will use the FPGA cache, as a result it could cause cache pollution, a situation in which a program unnecessarily loads data into the cache and causes other needed data to be evicted, thus degrading performance. 2. Because the CPU may exhibit false snooping, UMsgH should be treated as a hint. That is, you can start a speculative execution or pre-fetch based on UMsgH, but you should wait for UMsg before committing the results. 22 Document Number: External, Revision 0.5

23 3. The UMsg provides the same latency as an AFU read polling using RdLine_S, but it saves CCI-P channel bandwidth which can be used for read traffic. 3.5 MMIO Cycles to I/O Memory MMIO Write requests posted AFU must not return a response. MMIO Read requests non-posted AFU must return a response. Key points: Read data length supported = 4B, 8B Write data length supported = 4B, 8B AFU must support 8B MMIO accesses to I/O memory and register file. 4B accesses are optional. It can be avoided by coordinating with the software application developer. Maximum outstanding MMIO read requests is limited to 64. MMIO read request timeout value = 512 pclk cycles Maximum MMIO request rate = 1 request per 2 pclks cycles MMIO reads to undefined AFU registers should still return a response. The FIU makes an MMIO read request to the AFU over C0, using Rx signals. The mmiordvalid indicates that C0 Hdr contains a MMIO read request. The c0_reqmmiohdr structure provides a convenient mapping from flat bit-vector to MMIO read request fields {address, length, tid}. Then, the AFU drives a response over C2 using Tx signals. The C2 signal mmiordvalid indicates that the C2 Hdr and data fields contain the MMIO Read response. The c0_rspmmiohdr.tid field must match that provided in c0_reqmmiohdr.tid; this is used to match the response against request. It is illegal to split an 8B MMIO Read request into 2, 4B MMIO Read responses. The FIU makes an MMIO write request to the AFU over C0, using Rx signals. mmiowrvalid indicates that the c0_reqmmiohdr structure is an MMIO write request and contains the IO address to be written. The C0 data field contains the data to be written. For generating 64B MMIO Writes to AFU, use AVX-512 writes in MCP/DCP and later processors. It is not feasible to guarantee 64B MMIO writes from earlier processors. Document Number: External, Revision

24 3.6 CCI-P Tx Signals Code 3-2. Tx Interface Structure Inside ccip_if_pkg.sv $ typedef struct packed { t_if_ccip_c0_tx c0; t_if_ccip_c1_tx c1; t_if_ccip_c2_tx c2; } t_if_ccip_tx; There are three Tx channels: The C0 and C1 Tx channels are used for memory requests. They provide independent flow control. The C0 Tx channel is used for memory read requests; the C1 Tx channel is used for memory write requests. The C2 Tx channel is used to return MMIO Read response to the FIU. The CCI-P port guarantees to accept responses on C2; therefore, it has no flow control. Code 3-3. Tx Channel Structure Inside ccip_if_pkg.sv // Channel 0 : Memory Reads typedef struct packed { t_ccip_c0_reqmemhdr hdr; // Request Header logic valid; // Request Valid } t_if_ccip_c0_tx; // corresponding AlmostFull inside t_if_ccip_rx.c0txalmfull // Channel 1 : Memory Writes typedef struct packed { t_ccip_c1_reqmemhdr hdr; // Request Header t_ccip_cldata data; // Request Data logic valid; // Request Wr Valid } t_if_ccip_c1_tx; // corresponding AlmostFull inside t_if_ccip_rx.c1txalmfull // Channel 2 : MMIO Read response typedef struct packed { t_ccip_c2_rspmmiohdr hdr; // Response Header logic mmiordvalid; // Response Read Valid t_ccip_mmiodata data; // Response Data } t_if_ccip_c2_tx; Each Tx channel has a valid signal to qualify the corresponding header and data signals within the structure. Table 3-2 describes the signals that make up the CCI-P Tx interface. 24 Document Number: External, Revision 0.5

25 Table 3-2. Tx Channel Description Signal Width Direction Description pck_af2cp_stx.c0.hdr 74b Output Channel 0 request header.refer to Table 3-3. Tx Header Field Definitions. pck_af2cp_stx.c0.valid 1b Output When set to 1, it indicates channel 0 request header is valid. pck_cp2af_srx.c0txalmfull 1b Input When set to 1, Tx Channel0 is almost full. After this signal is set, AFU is allowed to send a maximum of 8 requests. When set to 0, AFU can start sending requests immediately. pck_af2cp_stx.c1.hdr 80b Output Channel 1 request header. Refer to Table 3-3. Tx Header Field Definitions. pck_af2cp_stx.c1.data 512b Output Channel 1 data pck_af2cp_stx.c1.valid 1b Output When set to 1, it indicates channel 1 request header and data is valid. pck_cp2af_srx.c1txalmfull 1b Input When set to 1, Tx Channel1 is almost full. After this signal is set, AFU is allowed to send a maximum of 8 requests or data. When set to 0, AFU can start sending requests immediately. pck_af2cp_stx.c2.hdr 9b Output Channel 2 response header. Refer to Table 3-3. Tx Header Field Definitions. pck_af2cp_stx.c2.mmiordvalid 1b Output When set to 1, indicates Channel 2 response header and data is valid. pck_af2cp_stx.c2.data 64b Output Channel 2 data. MMIO Read Data that AFU returns to FIU. For 4B reads, data must be driven on bits [31:0]. For 8B reads, AFU must drive one 8B data response. Response cannot be split into two 4B responses. Document Number: External, Revision

26 3.7 Tx Header Format Table 3-3. Tx Header Field Definitions Field Description mdata tid vc_sel Metadata: user defined request ID that is returned unmodified from request to response header. For multi-cl writes on C1 Tx, mdata is only valid for the header when sop=1. Transaction ID: AFU must return the tid MMIO Read request to response header. It is used to match the response against the request. Virtual Channel selected 2 h0 VA 2 h1 VL0 2 h2 VH0 2 h3 VH1 All CLs that form a multi-cl write request are routed over the same virtual channel (VC). req_type Request types listed in Table 3-4. sop cl_len address Start of Packet for multi-cl memory write 1 b1 marks the first header. Must write in increasing address order. 1 b0 subsequent headers Length for memory requests 2 b00 64B 2 b01 128B 2 b11 256B 64B aligned Physical Address, that is, byte_address>>6 The address must be naturally aligned with regards to the cl_len field. Example for cl_len=2 b01, the address must be divisible by 128B, similarly for cl_len=2 b11, the address must be divisible by 256B. 26 Document Number: External, Revision 0.5

27 Table 3-4. Tx Header Field Definitions Request Type Encoding Data Description Hdr Format t_if_ccip_c0_tx: enum t_ccip_c0_req ereq_rdline_i 4 h0 No Memory read request with no intention to cache. C0 Memory Request Header. Refer to Table 3-5. ereq_rdline_s 4 h1 No Memory read request with caching hint set to Shared. t_if_ccip_c1_tx: enum t_ccip_c1_req ereq_wrline_i 4 h0 Yes Memory write request with no intention of keeping the data in FPGA cache. C1 Memory Request Hdr Refer to Table 3-6. ereq_wrline_m 4 h1 Yes Memory write request with caching hint set to Modified. ereq_wrpush_i 4 h2 Yes Memory Write Request, with caching hint set to Invalid. FIU writes the data into the processor s last level cache (LLC) with no intention of keeping the data in FPGA cache. The LLC it writes to is always the LLC associated with the processor where the DRAM address is homed. ereq_wrline_i 4 h0 Yes Memory write request with no intention of keeping the data in FPGA cache. Fence Hdr Refer to Table 3-7. t_if_ccip_c2_tx does not have a request type field MMIO Rd N.A. Yes MMIO read response MMIO Rd Response Hdr Refer to Table 3-8. All unused encodings are considered reserved. Table 3-5. C0 Read Memory Request Header Format Structure: t_ccip_c0_reqmemhdr Bit # Bits Field [73:72] 2 vc_sel [71:70] 2 RSVD [69:68] 2 cl_len [67:64] 4 req_type [63:58] 6 RSVD [57:16] 42 address [15:0] 16 mdata Document Number: External, Revision

28 Table 3-6. C1 Write Memory Request Header Format Structure: t_ccip_c1_reqmemhdr Bit # Bits Field SOP=1 [79:74] [73:72] [71] [70] [69:68] [67:64] [63:58] [57:18] [17:16] [15:0] Field SOP=0 6 RSVD RSVD 2 vc_sel RSVD-DNC 1 sop=1 sop=0 1 RSVD RSVD 2 cl_len RSVD-DNC 4 req_type req_type 6 RSVD RSVD 40 RSVD-DNC address 2 address 16 mdata RSVD-DNC Table 3-7. C1 Fence Header Format Structure: t_ccip_c1_reqfencehdr Bit # Bits Field [79:74] [73:72] [71:68] [67:64] [63:16] [15:0] 6 RSVD 2 vc_sel 4 RSVD 4 req_type 48 RSVD 16 mdata Table 3-8. C2 MMIO Response Header Format [8:0] Bit # Bits Field 9 tid 3.8 CCI-P Rx Signals Code 3-4. Rx Interface Structure Inside ccip_if_pkg.sv typedef struct packed { logic c0txalmfull; // C0 Request Channel Almost Full logic c1txalmfull; // C1 Request Channel Almost Full t_if_ccip_c0_rx c0; t_if_ccip_c1_rx c1; } t_if_ccip_rx; 28 Document Number: External, Revision 0.5

29 There are two Rx channels. Channel 0 interleaves memory responses, MMIO requests and UMsgs. Channel 1 returns responses for AFU requests initiated on Tx Channel 1. The c0txalmfull and c1txalmfull signals are inputs to the AFU. Although they are declared with the Rx signals structure, they logically belong to the Tx interface and so were described in the previous section. Rx Channels have no flow control. The AFU must accept responses for memory requests it generated. The AFU must pre-allocate buffers before generating a memory request. The AFU must also accept MMIO requests. Code 3-5. Rx Channel Structure Inside ccip_if_pkg.sv typedef struct packed { logic c0txalmfull; // C0 Request Channel Almost Full logic c1txalmfull; // C1 Request Channel Almost Full t_if_ccip_c0_rx c0; t_if_ccip_c1_rx c1; } t_if_ccip_rx; Rx Channel 0 has separate valid signals for memory requests and MMIO requests. Only one of those valid signals can be set in a cycle. MMIO request has a separate valid signal for MMIO Read and MMIO Write. When either mmiordvalid or mmiowrvalid is set the message is an MMIO request and should be processed by casting t_if_ccip_c0_rx.hdr to t_ccip_c0_reqmmiohdr. Table 3-9. Rx Channel Signal Description pck_cp2af_srx.c0.hdr Signal Width Direction Description 28b Input Channel 0 response header or MMIO request header. Refer to 3-10 Rx Header Field Definitions. pck_cp2af_srx.c0.data 512b Input Channel 0 Data bus Memory Read Response and UMsg: Returns 64B data pck_cp2af_srx.c0.resp_valid 1b Input MMIO Write Request: For 4B write, data driven on bits [31:0] For 8B write, data driven on bits [63:0] When set to 1, it indicates header and data on Channel 0 are valid. The header must be interpreted as a memory response, decode resp_type field. pck_cp2af_srx.c0.mmiordvalid 1b Input When set to 1, it indicates a MMIO Read request Channel 0. pck_cp2af_srx.c0.mmiowrvalid 1b Input When set to 1, it indicates a MMIO Write request on Chanel 0. pck_cp2af_srx.c1.hdr 28b Input Channel 1 response header. Refer to 3-10 Rx Header Field Definitions pck_cp2af_srx.c1.respvalid 1b Input When set to 1, it indicates header on channel 1 is a valid response. Document Number: External, Revision

30 3.8.1 Rx Header and Rx Data Format Table Rx Header Field Definitions mdata Field Description Metadata: User defined request ID, returned unmodified from memory request to response header. For multi-cl memory response, the same mdata is returned for each CL. vc_used format Virtual channel used: when using VA, this field identifies the virtual channel selected for the request by FIU. For other VCs it returns the request VC. When using multi-cl memory write requests, FIU may return a single response for the entire payload or a response per CL in the payload. 1 b0 Unpacked write response: returns a response per CL. Look up the cl_num field to identify the cache line. NOTE: Using unpacked write response is not used with MPF as responses to AFU are always packed. 1 b1 Packed write response: returns a single response for entire payload. cl_num field gives the payload size, that is, 1 CL, 2 CLs, or 4CLs. cl_num format=0 For a response with >1CL data payload, this field identifies the cl_num. 2 h0 1st CL. Lowest Address 2 h1 2nd CL 2 h3 4th CL. Highest Address Responses may be returned out of order. hit_miss MMIO Length MMIO Address UMsg ID UMsg Type format=1 This field identifies the data payload size. 2 h0 1 CL or 64B 2 h1 2 CL or 128B 2 h3 4 CL or 256B Cache Hit/Miss status. AFU can use this to generate fine grained hit/miss statistics for various modules. 1 h0 Cache Miss 1 h1 Cache Hit Length for MMIO requests: 2 h0 4B 2 h1 8B Double word (DWORD) aligned MMIO address offset, that is, byte Address>>2. Identifies the CL corresponding to the UMsg Two type of UMsg are supported: 1 b1 UMsgH (Hint) without data 1 b0 UMsg with Data 30 Document Number: External, Revision 0.5

31 Table AFU Rx Response Encodings and Channels Mapping Request Type Encoding Data Payload Hdr Format t_if_ccip_c0_rx: enum t_ccip_c0_rsp ersp_rdline 4 h0 Yes Memory Response Header. Refer to Table Qualified with c0.rspvalid MMIO Read N.A. No MMIO Request Header. Refer to Table MMIO Write N.A. Yes ersp_umsg 4 h4 Yes/No UMsg Response Header. Refer to Table Qualified with c0.rspvalid t_if_ccip_c1_rx: enum t_ccip_c1_rsp ersp_rdline 4 h0 Yes Memory Response Header. Refer to Table Qualified with c0.rspvalid MMIO Read N.A. No MMIO Request Header. Refer to Table Table C0 Memory Read Response Header Format Structure: t_ccip_c0_rspmemhdr Bit # Bits Field [27:26] 2 vc_used [25] 1 RSVD [24] 1 hit_miss [23:22] 2 RSVD [21:20] 2 cl_num [19:16] 4 resp_type [15:0] 16 mdata Table MMIO Request Header Format Bit # Bits Field [27:12] 16 address [11:10] 2 length [9] 1 RSVD [8:0] 9 TID Document Number: External, Revision

32 Table C1 Memory Write Response Header Format Structure: t_ccip_c1_rspmemhdr Bit # Bits Field [27:26] [25] [24] [23] [22] [21:20] [19:16] [15:0] 2 vc_used 1 RSVD 1 hit_miss 1 format 1 RSVD 2 cl_num 4 resp_type 16 mdata Table UMsg Header Format Bit # Bits Field [27:20] [19:16] [15] [14:3] [2:0] 8 RSVD 4 resp_type 1 UMsg Type 12 RSVD 3 UMsg ID Table WrFence Header Format Structure: t_ccip_c1_rspfencehdr Bit # Bits Field [27:20] [19:16] [15:0] 8 RSVD 4 resp_type 16 mdata 3.9 Multi-Cache Line Memory Requests To achieve highest link efficiency, pack the memory requests into large transfer sizes. Use the multi-cl requests for this. Listed below are the characteristics of multi-cl memory requests: Highest memory bandwidth is achieved when using a data payload of 4CLs. Memory write request should always begin with the lowest address first. SOP=1 in the c1_reqmemhdr marks the first CL. All subsequent headers in the multi-cl request must drive the corresponding CL address. An N CL memory write request takes N cycles on Channel 1. It is legal to have bubbles between the cycles that form a multi-cl request, but one request cannot be interleaved with another request. It is illegal to start a new request without completing the entire data payload for a multi-cl write request. FIU guarantees to complete the multi-cl VA requests on a single VC. 32 Document Number: External, Revision 0.5

The memory request address must be naturally aligned. A 2CL request should start on a 2CL boundary. Its CL address must be divisible by 2. A 4CL request should be aligned on a 4CL boundary.

valid pck_af2cp_stx.c1.data D0 D1 D2 D3 D4 D5 D6 D7 D8 pck_af2cp_stx.c1.hdr.

33 The memory request address must be naturally aligned. A 2CL request should start on a 2CL boundary. Its CL address must be divisible by 2. A 4CL request should be aligned on a 4CL boundary. Its CL address must be divisible by 4. Figure 3-3 is an example of a multi-cl Memory Write Request. Figure 3-3. Multi-CL Memory Request pclk pck_af2cp_stx.c1.valid pck_af2cp_stx.c1.data D0 D1 D2 D3 D4 D5 D6 D7 D8 pck_af2cp_stx.c1.hdr.vc_sel pck_af2cp_stx.c1.hdr.sop pck_af2cp_stx.c1.hdr.cl_len pck_af2cp_stx.c1.hdr.addr[41:2] pck_af2cp_stx.c1.hdr.addr[1:0] VA VH0 VL0 VH1 h1 h0 h0 h0 h1 h0 h1 h1 h0 h3 h1 h0 h1 h1040 h1041 h1043 h1044 h0 h1 h2 h3 h0 h1 h1 h0 h1 pck_af2cp_stx.c1.hdr.req_type WrLine_I WrLine_M WrLin e_m WrLine_I pck_af2cp_stx.c1.hdr.mdata h10 h11 h12 h13 Figure 3-4 is an example for a Memory Write Response Cycles. For unpacked response, the individual CLs could return out of order. Figure 3-4. Multi-CL Memory Write Responses Document Number: External, Revision

Figure 3-5 is an example of a Memory Read Response Cycle. The read response can be reordered within itself; that is, there is no guaranteed ordering between individual CLs of a multi-cl Read.

10 Additional Control Signals Unless otherwise mentioned, all signals are active high. Table 3-17.

34 Figure 3-5 is an example of a Memory Read Response Cycle. The read response can be reordered within itself; that is, there is no guaranteed ordering between individual CLs of a multi-cl Read. All CLs within a multi-cl response have the same mdata and same vc_used. Individual CLs of a multi-cl Read are identified using the cl_num field. Figure 3-5. Multi-CL Memory Read Responses 3.10 Additional Control Signals Unless otherwise mentioned, all signals are active high. Table Clock and Reset Signal Width Direction Description pck_cp2af_softreset 1b Input Synchronous ACTIVE HIGH soft reset. When set to 1, AFU must reset all logic. Minimum Reset pulse width is 256 pclk cycles. All outstanding CCI-P requests will be flushed before de-asserting soft reset. A soft reset will not reset the FIU. pclk 1b Input Primary interface clock. All CCI-P interface signals are synchronous to this clock. Clock frequency is listed in Section pclkdiv2 1b Input Synchronous and in phase with pclk. 0.5x clock frequency. pclkdiv4 1b Input Synchronous and in phase with pclk. 0.25x clock frequency. 34 Document Number: External, Revision 0.5

35 Signal Width Direction Description uclk_usr 1b Input The user defined clock is not synchronous with the pclk. AFU must synchronize the signals to pclk domain before driving the CCI-P interface. Default frequency is MHz. Intel Quartus partial reconfiguration flow does not allow PLLs to be instantiated in the reconfigurable region (that is, the AFU). The AFU load utility will program the user defined clock frequency before de-asserting pck_cp2af_softreset. uclk_usrdiv2 1b Input Synchronous with uclk_usr and 0.5x the frequency. pck_cp2af_pwrstate 2b Input Indicates the current AFU power state request. In response to this, the AFU must attempt to reduce its power consumption. If sufficient power reduction is not achieved, the AFU may be Reset. 2 h0 AP0 - Normal operation mode 2 h1 AP1 - Request for 50% power reduction 2 h2 Reserved, illegal 2 h3 AP2 - Request for 90% power reduction When pck_cp2af_pwrstate is set to AP1, the FIU will start throttling the memory request path to achieve 50% throughput reduction. The AFU is also expected to reduce it power utilization to 50%, by throttling back accesses to FPGA internal memory resources and its compute engines. Similarly upon transition to AP2, the FIU will throttle the memory request paths to achieve 90% throughput reduction over normal state, and AFU in turn is expected to reduce its power utilization to 90%. pck_cp2af_error 1b Input CCI-P protocol error has been detected and logged in the PORT Error register. This register is visible to the AFU. It can be used as trigger for signal taps. When such an error is detected, the CCI-P interface stops accepting new requests and sets AlmFull is set to 1. There is no expectation to complete outstanding requests. The AFU is not reset. Document Number: External, Revision

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Acceleration Stack for Intel Xeon