Michael Adler 2017/05

Similar documents
Michael Adler 2017/09

Intel Xeon with FPGA IP Asynchronous Core Cache Interface (CCI-P) Shim

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Chapter 8. Virtual Memory

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Intel FPGA IP Core Cache Interface (CCI-P)

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Caching Basics. Memory Hierarchies

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Advanced Memory Organizations

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

Page 1. Multilevel Memories (Improving performance using a little cash )

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Chapter Seven Morgan Kaufmann Publishers

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

I/O Buffering and Streaming

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

LECTURE 5: MEMORY HIERARCHY DESIGN

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Adapted from David Patterson s slides on graduate computer architecture

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Optimising for the p690 memory system

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Architecture Specification

Review: Hardware user/kernel boundary

Structure of Computer Systems

First-In-First-Out (FIFO) Algorithm

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Copyright 2012, Elsevier Inc. All rights reserved.

Virtual Memory: From Address Translation to Demand Paging

Chapter 9 Memory Management

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Page 1. Memory Hierarchies (Part 2)

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Lecture-18 (Cache Optimizations) CS422-Spring

Advanced cache optimizations. ECE 154B Dmitri Strukov

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

EE382 Processor Design. Processor Issues for MP

The Nios II Family of Configurable Soft-core Processors

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 8: Virtual Memory. Operating System Concepts

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Intel Acceleration Stack for Intel Xeon CPU with FPGAs Version 1.2 Release Notes

L9: Storage Manager Physical Data Organization

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Memory Hierarchy. Slides contents from:

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Portland State University ECE 588/688. Cray-1 and Cray T3E

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE 4/521 Introduction to Operating Systems. Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018

Chapter 12. File Management

Chapter 8 Main Memory

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Introduction to cache memories

Chapter 8: Memory-Management Strategies

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

Accelerator Functional Unit (AFU) Developer s Guide

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Effect of memory latency

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Transcription:

Michael Adler 2017/05

FPGA Design Philosophy Standard platform code should: Provide base services and semantics needed by all applications Consume as little FPGA area as possible FPGAs overcome a huge frequency disadvantage relative to CPUs through application-specific, spatial solutions Application-specific memory semantics are part of this solution Instantiate exactly the memory semantics required Avoid wasting FPGA resources on unnecessary memory management 1

CCI Core Cache Interface CCI provides a base platform memory interface: just reads and writes Simple request/response interface Physical addresses No order guarantees Writes may bypass other writes, even to the same address Few guarantees of consistency between FPGA-generated reads and writes These minimal requirements satisfy major classes of algorithms, e.g.: Double buffered kernels that read from and write to different buffers Streaming kernels that read from one memory-mapped FIFO and write to another 2

Higher Level Memory Services and Semantics Some applications need one or more of: Virtually addressed memory for large, contiguous buffers or sharing pointers with software Ordered read responses Write/read memory consistency guarantees Applications may not need all of these attributes 3

MPF Memory Properties Factory Provides a common collection of memory semantic extensions to CCI Applications instantiate only the semantics they require Each MPF block is implemented as a CCI to CCI shim Consume CCI requests Implement some feature (e.g. translate virtual addresses to physical) Produce transformed CCI requests Application-specific memory hierarchies are formed by composing MPF shims 4

Abstract Architecture AFU: Accelerated Function Unit (User IP) Also knows as Green Bitstream FIU: FPGA Interface Unit (Intel-provided) Blue Bitstream FPGA connects to system memory via one or more physical channels AFU CCI CCI FIU Physical Channel Physical Channel Physical Channel System Memory CCI exposes physical channels as a single, multiplexed read/write memory interface FPGA 6

Abstract Architecture AFU: Accelerated Function Unit (User IP) Also knows as Green Bitstream FIU: FPGA Interface Unit (Intel-provided) Blue Bitstream FPGA connects to system memory via one or more physical channels CCI MPF CCI AFU FIU Physical Channel Physical Channel Physical Channel System Memory CCI exposes physical channels as a single, multiplexed read/write memory interface AFU may instantiate MPF as a CCI to CCI bridge, maintaining the same interface but adding new semantics FPGA 7

Base CCI Various clocks, reset, power control One request struct: pck_af2cp_stx (type t_if_ccip_tx) One response struct: pck_cp2af_srx (type t_if_ccip_rx) Request and response structures contain: One channel for memory reads (c0) One channel for memory writes (c1) Requests may target specific system physical channels or command the FIU to choose the least busy channel 8

Physical Channels Muliplexed as a single bus in CCI-P Addressable independently using vc_sel field in request header Ganged together by blue bitstream as a single high-bandwidth logical channel with evc_va tag CCI-P channels have deliberate races: No guarantee that reads to one channel will return the result of a write to a different channel, even when the write has already returned an ACK! Consistent with design philosophy: base platform supports only universal requirements. Races are no problem when streaming or double buffering. Write fence to evc_va channel synchronizes all channels but is too slow for frequent use. 9

CCI Memory Spaces Without IOMMU AAL allocates shared memory in process virtual space Physical memory pinned to I/O space FIU performs no address translation AFU requests host physical addresses FIU emits host physical addresses AAL Process Virtual Host Physical FIU: Host Physical CPU FPGA AFU: Host Physical 10

CCI Memory Spaces Without IOMMU AAL allocates shared memory in process virtual space Physical memory pinned to I/O space FIU performs no address translation AFU requests host physical addresses FIU emits host physical addresses Virtual addresses in AFU require translation MPF VTP (Virtual to Physical) acts as a TLB Accepts process virtual addresses from AFU Translates to host physical addresses AAL Process Virtual Host Physical FIU: Host Physical MPF VTP AFU: Process Virtual CPU FPGA 11

CCI Memory Spaces With IOMMU Host kernel defines guest virtual machine Guest physical address space protects host memory Guest can write only to AAL managed memory FIU translates guest physical to host physical by querying IOMMU Ideally: Guest physical == Process virtual Would require no AFU translation! Can t have it yet. Kernels don t support it. Hypervisors are designed for guest kernels that manage disjoint guest virtual spaces in virtual machines. AAL Process Virtual Host Physical IOMMU FIU: Guest Physical AFU: Guest Physical CPU FPGA 12

CCI Memory Spaces With IOMMU Host kernel defines guest virtual machine Guest physical address space protects host memory Guest can write only to AAL managed memory FIU translates guest physical to host physical by querying IOMMU Ideally: Guest physical == Process virtual Would require no AFU translation! Can t have it yet. Kernels don t support it. Hypervisors are designed for guest kernels that manage disjoint guest virtual spaces in virtual machines. MPF translates process virtual to guest physical AAL Process Virtual Host Physical IOMMU FIU: Guest Physical MPF VTP AFU: Process Virtual CPU FPGA 13

MPF Composable Shims All MPF shims may be enabled or disabled independently: VTP: Virtual to physical address translation ROB: Reorder buffer to sort read responses and return them in request order WRO: Intra-line write/read ordering VC Map: Map requests to system memory channels explicitly PWRITE: Partial (masked) write emulation using read-modify-write Note: Some shims depend on other shims, e.g: WRO on VC Map PWRITE on WRO 15

VTP: Virtual to Physical Resembles a traditional TLB Separate translation tables for 4KB and 2MB pages Level 1: 512 entry TLB, direct mapped, one per request channel Level 2: 512 entry four-way set-associative TLB, shared across all channels Hardware, caching page table walker Size and associativity of each table is configurable No prefetcher: we have not found a need L2 Misses are rare with 2MB pages MPF s caching page table walker generates only one memory read per 16MB of memory with stride one streams and 2MB pages Planning to add support for 1GB pages for SKX 16

VTP Software Component VTP maintains its own page table, shared with the VTP FPGA shim Applications allocate & deallocate memory with VTP software service: mpfvtpbufferallocate() mpfvtpbufferfree() The VTP page table is updated as a side effect of allocation Allocation & deallocation may occur at any time during a run See test/test-mpf/base/sw/cci_test.h and fpga_svc_wrapper.cpp (new driver version) aal_svc_wrapper.cpp (AAL version) 17

ROB: Reorder Buffer CCI returns read responses unordered AFU tags read requests with a unique number FIU returns the tag with the response ROB sorts read responses Eliminates need for AFU tagging CCI reads behave more like FPGA-attached DRAM ROB is sized to enable maximum bandwidth ROB adds latency, especially when physical channels have different latencies 18

WRO: Write/Read Ordering CCI provides no intra- or inter-line order guarantees Even conflicting writes are unordered CCI leaves synchronization to the AFU: track write ACKs or use fences Fences are slow! No problem for kernels that: Maintain discrete read and write spaces Write each address only once Emit fences infrequently Avoid fences when they would be required frequently 19

WRO: Write/Read Ordering WRO guarantees that requests within a line complete in order Writes to the same line retire in order Reads always return the most recently written value Reads have priority when arriving in the same cycle as a conflicting write Still no guarantees about inter-line consistency! Write/read hazard detection is implemented as a collection of filters CAMs would be too expensive to support the number of requests in flight Filters are sized to balance FPGA area against the rate of false conflicts Multiple reads to the same location are permitted Filter sizes are configurable 20

VC Map: Address-Based Host Channel Mapping AFUs that enable WRO almost always required VC Map CCI channels have deliberate races (see Physical Channels slide) VC Map avoids inter-channel races: AFU passes requests to MPF using evc_va, the same mechanism for CCI mapping VC Map selects explicit physical channels before routing requests to CCI Channel mapping is a function of a request s address: a given address is always mapped to the same channel 21

VC Map Optimizations Channel mapping is surprisingly complicated Optimal throughput is achieved only when requests are balanced across physical channels Optimal request rate balance varies with the types and sizes of requests VC Map dynamically responds to AFU-generated traffic, picking the optimal request rates to each physical channel VC Map may choose to rebalance traffic as request patterns vary. It must: Stop all AFU traffic by asserting Almost Full Wait for all current traffic to retire Emit a write fence 22

PWRITE: Partial Write Emulation CCI currently provides no mechanism for masked writes PWRITE emulates masked writes: Reading the requested line Updating the masked bytes Writing the merged data MPF extends the write request header with byte mask bits PWRITE does not lock the line. It is not an atomic operation! Conflicting CPU stores in the middle of a PWRITE sequence may be lost WRO and VC Map may be used to guarantee order within the FPGA 23

Instantiating MPF in an AFU See test/test-mpf/test_mem_perf in the MPF distribution for a relatively simple streaming access example MPF uses SystemVerilog interfaces to represent CCI-P wires 1:1 mapping from CCI-P structs to cci_mpf_if buses MPF shims have multiple CCI-P buses: one toward AFU, one toward FIU Interfaces simplify arguments to MPF shims MPF module ccip_wires_to_mpf() converts CCI-P wires to a cci_mpf_if 25

CCI-P Wires to MPF Interface // // Expose FIU as an MPF interface // cci_mpf_if fiu(.clk(pclk)); // The CCI wires to MPF mapping connections have identical naming to // the standard AFU. The module exports an interface named "fiu". ccip_wires_to_mpf #( // All inputs and outputs in PR region (AFU) must be registered!.register_inputs(1),.register_outputs(1) ) map_ifc( // All CCI-P wire names are passed in along with fiu.* ); 26

Instantiate MPF hw/rtl/cci_mpf.sv has extensive comments cci_mpf_if afu(.clk(pclk)); cci_mpf #(.SORT_READ_RESPONSES(1),.PRESERVE_WRITE_MDATA(0),.ENABLE_VTP(1),.ENABLE_VC_MAP(0),.ENABLE_DYNAMIC_VC_MAPPING(1),.ENFORCE_WR_ORDER(0),.ENABLE_PARTIAL_WRITES(0),.DFH_MMIO_BASE_ADDR(MPF_DFH_MMIO_ADDR) ) mpf (.clk(pclk),.fiu,.afu ); 27

AFU Option #1: Expose MPF as CCI See the use of mpf2af_srxport and af2mpf_stxport in sample/afu/ccip_mpf_nlb.sv Note MPF extension header bits that must be set, e.g.: // Treat all addresses as virtual. afu.c0tx.hdr.ext.addrisvirtual = 1'b1; // Enable evc_va to physical channel mapping. This will only // be triggered when ENABLE_VC_MAP is enabled. afu.c0tx.hdr.ext.mapvatophyschannel = 1'b1; // Enforce load/store and store/store ordering within lines. // This will only be triggered when ENFORCE_WR_ORDER is enabled. afu.c0tx.hdr.ext.checkloadstoreorder = 1'b1; 28

AFU Option #2: Use MPF Interface Directly Example: test/test-mpf/base/hw/rtl/cci_test_afu.sv and test/testmpf/test_random/hw/rtl/test_random.sv MPF interface defined in hw/rtl/cci-mpf-if/cci_mpf_if.vh 29

AFU Option #2 Example t_cci_mpf_reqmemhdrparams rd_params; t_cci_mpf_c0_reqmemhdr rd_hdr; always_comb begin // Construct a request header rd_params = cci_mpf_defaultreqhdrparams(); rd_params.checkloadstoreorder = enable_wro; rd_params.vc_sel = evc_va; rd_params.mapvatophyschannel = 1'b1; end rd_hdr = cci_mpf_c0_genreqhdr((rdline_mode_s? ereq_rdline_s : ereq_rdline_i), rd_rand_addr, t_cci_mdata'(0), rd_params); rd_hdr.base.cl_len = rd_addr_num_beats; always_ff @(posedge clk) begin // Write the request header (2 nd argument sets the valid bit) fiu.c0tx <= cci_mpf_genc0txreadreq(rd_hdr, (state == STATE_RUN) &&! c0txalmfull); end end if (reset) begin fiu.c0tx.valid <= 1'b0; 30

Building MPF Software Libraries Both new FPGA driver and AAL are currently supported CMake script builds each one only when header files from FPGA driver or AAL are found Put FPGA/AAL header paths in C/C++ include environment variables Put compiled FPGA/AAL paths in LD_LIBRARY_PATH and LIBRARY_PATH Use CMake (build directory can be anywhere just adjust../ to match): cd <MPF Path>/sw mkdir build; cd build cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_INSTALL_PREFIX=<target dir>../ make; make install 31

MPF Libraries in Software Add MPF s installed include directory to C_INCLUDE_PATH and CPLUS_INCLUDE_PATH Add MPF s installed lib directory to LD_LIBRARY_PATH and LIBRARY_PATH MPF install target may be the same tree is the base FPGA library or AAL Building with MPF and the new FPGA library (libmpf.so): Link your application with libmpf.so (Both hardware and ASE are supported) See test/test-mpf/base/sw/fpga_svc_wrapper.cpp for an example Building with MPF and AAL (libmpf_aal.so): Load libmpf_aal dynamically using the AAL service discovery mechanism See test/test-mpf/base/sw/aal_svc_wrapper.cpp for an example 32

MPF Internal Configuration Options Some applications, especially those with large memory footprints, may benefit from resizing MPF structures hw/rtl/cci_mpf_config.vh defines and documents many options Configure VTP TLB size and associativity Configure WRO hashing parameters These options may be set without modifying source, e.g. set_global_assignment -name VERILOG_MACRO "VTP_N_C0_TLB_2MB_SETS=1024" 34

MPF Primary Pipeline 400 MHz CCI-P VTP TLB CCI-P FIU MPF Edge VTP WRO ROB MPF Edge AFU Write Data Blue Green 35

MPF Primary Pipeline Edge modules always instantiated Write data bypasses MPF pipeline to save area. Quartus automatically deletes unconsumed data wires in internal MPF interface objects. AFU edge validates header settings Multi-beat write requests converted to a single representative request AFU edge forwards all write data beats to FIU edge module FIU edge reconstructs all beats from the representative request FIU edge (EOP module) guarantees that all write responses are packed. AFUs using MPF may ignore the c1rx format header field. MPF shims are instantiated as needed based on configuration parameters 36

VTP Micro-Architecture L1 hits flow around L1 misses Separate direct-mapped L1 cache for reads and writes, 2MB and 4KB pages Shared set-associative L2 cache for reads and writes, separate 2MB and 4KB Page table walker is not pipelined but does cache recent page table lines Default sizes are optimized for M20k minimum depth and bandwidth Programs with some page locality should require no VTP tuning Programs with completely random access patterns to very large footprints may benefit from larger caches. See hw/rtl/cci_mpf_config.vh. 37

Debugging Address Translation Failures When VTP encounters an untranslatable address it halts VTP statistics software interface indicates a failure: numfailedtranslations will be non-zero ptwalklastvaddr will hold the untranslatable virtual address 38

WRO Micro-Architecture Ingress buffer Computes read/write conflict epochs using hashed filters: Ingress filters are small, only tracking lifetimes inside the request channels WRO primary pipeline has separate large filters, described later Non-conflicting read and write pipelines flow independently Independent flow required to avoid starving multi-beat write requests: Multi-beat writes require multiple AFU request cycles Multi-beat reads require only one AFU request cycle Lock-step 4 line read/write requests would emit one write for every 4 reads 39

WRO Micro-Architecture Primary Pipeline Hashed counting filter for reads, single bit filter for writes Non-conflicting requests flow around blocked requests Conflicting requests loop back to the head of the primary pipeline New traffic is blocked only when the pipeline fills with conflicting requests Default depth (and latency) is 4 cycles. Growing the pipeline adds latency but increases the capacity for flowing around blocked requests. Configurable hashing. See hw/rtl/cci_mpf_config.vh. 40

MPF Shim Latencies for READ Requests (Cycles) (Writes are similar but less latency sensitive) Shim Request Cycles Response Cycles AFU Edge (mandatory) 0 when WRO disabled 1 when WRO enabled ROB (response order) 0 4 (minimum) VC Map 1 0 WRO (write/read order) 10 (minimum) 0 VTP (virtual to physical) 5 (L1 hit typical case) 0 EOP (mandatory) 0 1 FIU Edge (mandatory) 0 0 1 41