Michael Adler 2017/05

FPGA Design Philosophy Standard platform code should: Provide base services and semantics needed by all applications Consume as little FPGA area as possible FPGAs overcome a huge frequency disadvantage relative to CPUs through application-specific, spatial solutions Application-specific memory semantics are part of this solution Instantiate exactly the memory semantics required Avoid wasting FPGA resources on unnecessary memory management 1

CCI Core Cache Interface CCI provides a base platform memory interface: just reads and writes Simple request/response interface Physical addresses No order guarantees Writes may bypass other writes, even to the same address Few guarantees of consistency between FPGA-generated reads and writes These minimal requirements satisfy major classes of algorithms, e.g.: Double buffered kernels that read from and write to different buffers Streaming kernels that read from one memory-mapped FIFO and write to another 2

Higher Level Memory Services and Semantics Some applications need one or more of: Virtually addressed memory for large, contiguous buffers or sharing pointers with software Ordered read responses Write/read memory consistency guarantees Applications may not need all of these attributes 3

MPF Memory Properties Factory Provides a common collection of memory semantic extensions to CCI Applications instantiate only the semantics they require Each MPF block is implemented as a CCI to CCI shim Consume CCI requests Implement some feature (e.g. translate virtual addresses to physical) Produce transformed CCI requests Application-specific memory hierarchies are formed by composing MPF shims 4

Abstract Architecture AFU: Accelerated Function Unit (User IP) Also knows as Green Bitstream FIU: FPGA Interface Unit (Intel-provided) Blue Bitstream FPGA connects to system memory via one or more physical channels AFU CCI CCI FIU Physical Channel Physical Channel Physical Channel System Memory CCI exposes physical channels as a single, multiplexed read/write memory interface FPGA 6

Abstract Architecture AFU: Accelerated Function Unit (User IP) Also knows as Green Bitstream FIU: FPGA Interface Unit (Intel-provided) Blue Bitstream FPGA connects to system memory via one or more physical channels CCI MPF CCI AFU FIU Physical Channel Physical Channel Physical Channel System Memory CCI exposes physical channels as a single, multiplexed read/write memory interface AFU may instantiate MPF as a CCI to CCI bridge, maintaining the same interface but adding new semantics FPGA 7

Base CCI Various clocks, reset, power control One request struct: pck_af2cp_stx (type t_if_ccip_tx) One response struct: pck_cp2af_srx (type t_if_ccip_rx) Request and response structures contain: One channel for memory reads (c0) One channel for memory writes (c1) Requests may target specific system physical channels or command the FIU to choose the least busy channel 8

Physical Channels Muliplexed as a single bus in CCI-P Addressable independently using vc_sel field in request header Ganged together by blue bitstream as a single high-bandwidth logical channel with evc_va tag CCI-P channels have deliberate races: No guarantee that reads to one channel will return the result of a write to a different channel, even when the write has already returned an ACK! Consistent with design philosophy: base platform supports only universal requirements. Races are no problem when streaming or double buffering. Write fence to evc_va channel synchronizes all channels but is too slow for frequent use. 9

CCI Memory Spaces Without IOMMU AAL allocates shared memory in process virtual space Physical memory pinned to I/O space FIU performs no address translation AFU requests host physical addresses FIU emits host physical addresses Virtual addresses in AFU require translation MPF VTP (Virtual to Physical) acts as a TLB Accepts process virtual addresses from AFU Translates to host physical addresses AAL Process Virtual Host Physical FIU: Host Physical MPF VTP AFU: Process Virtual CPU FPGA 11

CCI Memory Spaces With IOMMU Host kernel defines guest virtual machine Guest physical address space protects host memory Guest can write only to AAL managed memory FIU translates guest physical to host physical by querying IOMMU Ideally: Guest physical == Process virtual Would require no AFU translation! Can t have it yet. Kernels don t support it. Hypervisors are designed for guest kernels that manage disjoint guest virtual spaces in virtual machines. AAL Process Virtual Host Physical IOMMU FIU: Guest Physical AFU: Guest Physical CPU FPGA 12

CCI Memory Spaces With IOMMU Host kernel defines guest virtual machine Guest physical address space protects host memory Guest can write only to AAL managed memory FIU translates guest physical to host physical by querying IOMMU Ideally: Guest physical == Process virtual Would require no AFU translation! Can t have it yet. Kernels don t support it. Hypervisors are designed for guest kernels that manage disjoint guest virtual spaces in virtual machines. MPF translates process virtual to guest physical AAL Process Virtual Host Physical IOMMU FIU: Guest Physical MPF VTP AFU: Process Virtual CPU FPGA 13

MPF Composable Shims All MPF shims may be enabled or disabled independently: VTP: Virtual to physical address translation ROB: Reorder buffer to sort read responses and return them in request order WRO: Intra-line write/read ordering VC Map: Map requests to system memory channels explicitly PWRITE: Partial (masked) write emulation using read-modify-write Note: Some shims depend on other shims, e.g: WRO on VC Map PWRITE on WRO 15

VTP: Virtual to Physical Resembles a traditional TLB Separate translation tables for 4KB and 2MB pages Level 1: 512 entry TLB, direct mapped, one per request channel Level 2: 512 entry four-way set-associative TLB, shared across all channels Hardware, caching page table walker Size and associativity of each table is configurable No prefetcher: we have not found a need L2 Misses are rare with 2MB pages MPF s caching page table walker generates only one memory read per 16MB of memory with stride one streams and 2MB pages Planning to add support for 1GB pages for SKX 16

VTP Software Component VTP maintains its own page table, shared with the VTP FPGA shim Applications allocate & deallocate memory with VTP software service: mpfvtpbufferallocate() mpfvtpbufferfree() The VTP page table is updated as a side effect of allocation Allocation & deallocation may occur at any time during a run See test/test-mpf/base/sw/cci_test.h and fpga_svc_wrapper.cpp (new driver version) aal_svc_wrapper.cpp (AAL version) 17

ROB: Reorder Buffer CCI returns read responses unordered AFU tags read requests with a unique number FIU returns the tag with the response ROB sorts read responses Eliminates need for AFU tagging CCI reads behave more like FPGA-attached DRAM ROB is sized to enable maximum bandwidth ROB adds latency, especially when physical channels have different latencies 18

WRO: Write/Read Ordering CCI provides no intra- or inter-line order guarantees Even conflicting writes are unordered CCI leaves synchronization to the AFU: track write ACKs or use fences Fences are slow! No problem for kernels that: Maintain discrete read and write spaces Write each address only once Emit fences infrequently Avoid fences when they would be required frequently 19

WRO: Write/Read Ordering WRO guarantees that requests within a line complete in order Writes to the same line retire in order Reads always return the most recently written value Reads have priority when arriving in the same cycle as a conflicting write Still no guarantees about inter-line consistency! Write/read hazard detection is implemented as a collection of filters CAMs would be too expensive to support the number of requests in flight Filters are sized to balance FPGA area against the rate of false conflicts Multiple reads to the same location are permitted Filter sizes are configurable 20

VC Map: Address-Based Host Channel Mapping AFUs that enable WRO almost always required VC Map CCI channels have deliberate races (see Physical Channels slide) VC Map avoids inter-channel races: AFU passes requests to MPF using evc_va, the same mechanism for CCI mapping VC Map selects explicit physical channels before routing requests to CCI Channel mapping is a function of a request s address: a given address is always mapped to the same channel 21

VC Map Optimizations Channel mapping is surprisingly complicated Optimal throughput is achieved only when requests are balanced across physical channels Optimal request rate balance varies with the types and sizes of requests VC Map dynamically responds to AFU-generated traffic, picking the optimal request rates to each physical channel VC Map may choose to rebalance traffic as request patterns vary. It must: Stop all AFU traffic by asserting Almost Full Wait for all current traffic to retire Emit a write fence 22

PWRITE: Partial Write Emulation CCI currently provides no mechanism for masked writes PWRITE emulates masked writes: Reading the requested line Updating the masked bytes Writing the merged data MPF extends the write request header with byte mask bits PWRITE does not lock the line. It is not an atomic operation! Conflicting CPU stores in the middle of a PWRITE sequence may be lost WRO and VC Map may be used to guarantee order within the FPGA 23

Instantiating MPF in an AFU See test/test-mpf/test_mem_perf in the MPF distribution for a relatively simple streaming access example MPF uses SystemVerilog interfaces to represent CCI-P wires 1:1 mapping from CCI-P structs to cci_mpf_if buses MPF shims have multiple CCI-P buses: one toward AFU, one toward FIU Interfaces simplify arguments to MPF shims MPF module ccip_wires_to_mpf() converts CCI-P wires to a cci_mpf_if 25

CCI-P Wires to MPF Interface // // Expose FIU as an MPF interface // cci_mpf_if fiu(.clk(pclk)); // The CCI wires to MPF mapping connections have identical naming to // the standard AFU. The module exports an interface named "fiu". ccip_wires_to_mpf #( // All inputs and outputs in PR region (AFU) must be registered!.register_inputs(1),.register_outputs(1) ) map_ifc( // All CCI-P wire names are passed in along with fiu.* ); 26

Instantiate MPF hw/rtl/cci_mpf.sv has extensive comments cci_mpf_if afu(.clk(pclk)); cci_mpf #(.SORT_READ_RESPONSES(1),.PRESERVE_WRITE_MDATA(0),.ENABLE_VTP(1),.ENABLE_VC_MAP(0),.ENABLE_DYNAMIC_VC_MAPPING(1),.ENFORCE_WR_ORDER(0),.ENABLE_PARTIAL_WRITES(0),.DFH_MMIO_BASE_ADDR(MPF_DFH_MMIO_ADDR) ) mpf (.clk(pclk),.fiu,.afu ); 27

AFU Option #1: Expose MPF as CCI See the use of mpf2af_srxport and af2mpf_stxport in sample/afu/ccip_mpf_nlb.sv Note MPF extension header bits that must be set, e.g.: // Treat all addresses as virtual. afu.c0tx.hdr.ext.addrisvirtual = 1'b1; // Enable evc_va to physical channel mapping. This will only // be triggered when ENABLE_VC_MAP is enabled. afu.c0tx.hdr.ext.mapvatophyschannel = 1'b1; // Enforce load/store and store/store ordering within lines. // This will only be triggered when ENFORCE_WR_ORDER is enabled. afu.c0tx.hdr.ext.checkloadstoreorder = 1'b1; 28

AFU Option #2: Use MPF Interface Directly Example: test/test-mpf/base/hw/rtl/cci_test_afu.sv and test/testmpf/test_random/hw/rtl/test_random.sv MPF interface defined in hw/rtl/cci-mpf-if/cci_mpf_if.vh 29

AFU Option #2 Example t_cci_mpf_reqmemhdrparams rd_params; t_cci_mpf_c0_reqmemhdr rd_hdr; always_comb begin // Construct a request header rd_params = cci_mpf_defaultreqhdrparams(); rd_params.checkloadstoreorder = enable_wro; rd_params.vc_sel = evc_va; rd_params.mapvatophyschannel = 1'b1; end rd_hdr = cci_mpf_c0_genreqhdr((rdline_mode_s? ereq_rdline_s : ereq_rdline_i), rd_rand_addr, t_cci_mdata'(0), rd_params); rd_hdr.base.cl_len = rd_addr_num_beats; always_ff @(posedge clk) begin // Write the request header (2 nd argument sets the valid bit) fiu.c0tx <= cci_mpf_genc0txreadreq(rd_hdr, (state == STATE_RUN) &&! c0txalmfull); end end if (reset) begin fiu.c0tx.valid <= 1'b0; 30

Building MPF Software Libraries Both new FPGA driver and AAL are currently supported CMake script builds each one only when header files from FPGA driver or AAL are found Put FPGA/AAL header paths in C/C++ include environment variables Put compiled FPGA/AAL paths in LD_LIBRARY_PATH and LIBRARY_PATH Use CMake (build directory can be anywhere just adjust../ to match): cd <MPF Path>/sw mkdir build; cd build cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_INSTALL_PREFIX=<target dir>../ make; make install 31

MPF Libraries in Software Add MPF s installed include directory to C_INCLUDE_PATH and CPLUS_INCLUDE_PATH Add MPF s installed lib directory to LD_LIBRARY_PATH and LIBRARY_PATH MPF install target may be the same tree is the base FPGA library or AAL Building with MPF and the new FPGA library (libmpf.so): Link your application with libmpf.so (Both hardware and ASE are supported) See test/test-mpf/base/sw/fpga_svc_wrapper.cpp for an example Building with MPF and AAL (libmpf_aal.so): Load libmpf_aal dynamically using the AAL service discovery mechanism See test/test-mpf/base/sw/aal_svc_wrapper.cpp for an example 32

MPF Internal Configuration Options Some applications, especially those with large memory footprints, may benefit from resizing MPF structures hw/rtl/cci_mpf_config.vh defines and documents many options Configure VTP TLB size and associativity Configure WRO hashing parameters These options may be set without modifying source, e.g. set_global_assignment -name VERILOG_MACRO "VTP_N_C0_TLB_2MB_SETS=1024" 34

MPF Primary Pipeline 400 MHz CCI-P VTP TLB CCI-P FIU MPF Edge VTP WRO ROB MPF Edge AFU Write Data Blue Green 35

MPF Primary Pipeline Edge modules always instantiated Write data bypasses MPF pipeline to save area. Quartus automatically deletes unconsumed data wires in internal MPF interface objects. AFU edge validates header settings Multi-beat write requests converted to a single representative request AFU edge forwards all write data beats to FIU edge module FIU edge reconstructs all beats from the representative request FIU edge (EOP module) guarantees that all write responses are packed. AFUs using MPF may ignore the c1rx format header field. MPF shims are instantiated as needed based on configuration parameters 36

VTP Micro-Architecture L1 hits flow around L1 misses Separate direct-mapped L1 cache for reads and writes, 2MB and 4KB pages Shared set-associative L2 cache for reads and writes, separate 2MB and 4KB Page table walker is not pipelined but does cache recent page table lines Default sizes are optimized for M20k minimum depth and bandwidth Programs with some page locality should require no VTP tuning Programs with completely random access patterns to very large footprints may benefit from larger caches. See hw/rtl/cci_mpf_config.vh. 37

Debugging Address Translation Failures When VTP encounters an untranslatable address it halts VTP statistics software interface indicates a failure: numfailedtranslations will be non-zero ptwalklastvaddr will hold the untranslatable virtual address 38

WRO Micro-Architecture Ingress buffer Computes read/write conflict epochs using hashed filters: Ingress filters are small, only tracking lifetimes inside the request channels WRO primary pipeline has separate large filters, described later Non-conflicting read and write pipelines flow independently Independent flow required to avoid starving multi-beat write requests: Multi-beat writes require multiple AFU request cycles Multi-beat reads require only one AFU request cycle Lock-step 4 line read/write requests would emit one write for every 4 reads 39

WRO Micro-Architecture Primary Pipeline Hashed counting filter for reads, single bit filter for writes Non-conflicting requests flow around blocked requests Conflicting requests loop back to the head of the primary pipeline New traffic is blocked only when the pipeline fills with conflicting requests Default depth (and latency) is 4 cycles. Growing the pipeline adds latency but increases the capacity for flowing around blocked requests. Configurable hashing. See hw/rtl/cci_mpf_config.vh. 40

MPF Shim Latencies for READ Requests (Cycles) (Writes are similar but less latency sensitive) Shim Request Cycles Response Cycles AFU Edge (mandatory) 0 when WRO disabled 1 when WRO enabled ROB (response order) 0 4 (minimum) VC Map 1 0 WRO (write/read order) 10 (minimum) 0 VTP (virtual to physical) 5 (L1 hit typical case) 0 EOP (mandatory) 0 1 FIU Edge (mandatory) 0 0 1 41