AN 690: PCI Express DMA Reference Design for Stratix V Devices

Similar documents
PCI Express Avalon-MM DMA Reference Design

AN 829: PCI Express* Avalon -MM DMA Reference Design

PCI Express High Performance Reference Design

PCI Express High Performance Reference Design

PCI Express High Performance Reference Design

PCI Express Multi-Channel DMA Interface

Arria 10 Avalon-MM DMA Interface for PCIe Solutions

PCI Express*: Migrating to Intel Stratix 10 Devices for the Avalon Streaming Interface

Arria 10 Avalon-MM Interface for PCIe Solutions

Arria V GZ Avalon-MM Interface for PCIe Solutions

Intel Stratix 10 H-Tile PCIe Link Hardware Validation

Arria V Avalon-MM Interface for PCIe Solutions

Arria 10 Avalon-ST Interface for PCIe Solutions

Nios II Performance Benchmarks

Intel Arria 10 or Intel Cyclone 10 Avalon -MM DMA Interface for PCIe* Solutions User Guide

Arria 10 Avalon-ST Interface with SR-IOV PCIe Solutions

MAX 10 User Flash Memory User Guide

Arria 10 Avalon-MM Interface for PCIe Solutions

100G Interlaken MegaCore Function User Guide

Intel Arria 10 Avalon -ST Interface with SR-IOV PCIe* Solutions User Guide

Intel Arria 10 or Intel Cyclone 10 Avalon -MM DMA Interface for PCIe* Solutions User Guide

Using the Transceiver Reconfiguration Controller for Dynamic Reconfiguration in Arria V and Cyclone V Devices

AN 836: RapidIO II Reference Design for Avalon-ST Pass-Through Interface

Understanding Performance of PCI Express Systems

Recommended Protocol Configurations for Stratix IV GX FPGAs

Arria 10 Hard IP for PCI Express

SerialLite III Streaming IP Core Design Example User Guide for Intel Stratix 10 Devices

AN 610: Implementing Deterministic Latency for CPRI and OBSAI Protocols in Altera Devices

Dynamic Reconfiguration of PMA Controls in Stratix V Devices

SerialLite III Streaming IP Core Design Example User Guide for Intel Arria 10 Devices

AN 575: PCI Express-to-DDR2 SDRAM Reference Design

Stratix 10 Avalon-ST Interface for PCIe Solutions User Guide

Low Latency 40G Ethernet Example Design User Guide

Intel Stratix 10 Low Latency 40G Ethernet Design Example User Guide

Arria 10 Transceiver PHY User Guide

Altera JESD204B IP Core and ADI AD9680 Hardware Checkout Report

PCI Express Compiler. System Requirements. New Features & Enhancements

System Debugging Tools Overview

Nios II Embedded Design Suite 6.1 Release Notes

Low Latency 100G Ethernet Intel Stratix 10 FPGA IP Design Example User Guide

DDR and DDR2 SDRAM Controller Compiler User Guide

Intel MAX 10 User Flash Memory User Guide

4K Format Conversion Reference Design

Interlaken IP Core (2nd Generation) Design Example User Guide

AN 830: Intel FPGA Triple-Speed Ethernet and On-Board PHY Chip Reference Design

Intel Stratix 10 Avalon -MM Interface for PCI Express* Solutions User Guide

Configuration via Protocol (CvP) Implementation in Altera FPGAs User Guide

POS-PHY Level 4 MegaCore Function

25G Ethernet Intel Stratix 10 FPGA IP Design Example User Guide

Implementing 9.8G CPRI in Arria V GT and ST FPGAs

Intel Arria 10 and Intel Cyclone 10 GX Avalon -MM Interface for PCI Express* User Guide

9. Functional Description Example Designs

Generic Serial Flash Interface Intel FPGA IP Core User Guide

EFEC20 IP Core. Features

DSP Development Kit, Stratix II Edition

OTU2 I.4 FEC IP Core (IP-OTU2EFECI4Z) Data Sheet

ASMI Parallel II Intel FPGA IP Core User Guide

H-tile Hard IP for Ethernet Intel Stratix 10 FPGA IP Design Example User Guide

Intel Stratix 10 H-tile Hard IP for Ethernet Design Example User Guide

AN 830: Intel FPGA Triple-Speed Ethernet and On-Board PHY Chip Reference Design

Altera ASMI Parallel II IP Core User Guide

2.5G Reed-Solomon II MegaCore Function Reference Design

OTU2 I.9 FEC IP Core (IP-OTU2EFECI9) Data Sheet

Intel Stratix 10 Avalon -MM Interface for PCIe* Solutions User Guide

Intel Arria 10 and Intel Cyclone 10 Avalon -ST Interface for PCIe* User Guide

High Bandwidth Memory (HBM2) Interface Intel FPGA IP Design Example User Guide

10. Introduction to UniPHY IP

Simulating the ASMI Block in Your Design

8. Introduction to UniPHY IP

Board Update Portal based on Nios II Processor with EPCQ (Arria 10 GX FPGA Development Kit)

Creating PCI Express Links in Intel FPGAs

Configuration via Protocol (CvP) Implementation in Altera FPGAs User Guide

POS-PHY Level 4 POS-PHY Level 3 Bridge Reference Design

Intel Arria 10 and Intel Cyclone 10 Avalon -MM Interface for PCIe* Design Example User Guide

Using the Nios Development Board Configuration Controller Reference Designs

Debugging Transceiver Links

RapidIO MegaCore Function

Low Latency 100G Ethernet Design Example User Guide

Virtex-7 FPGA Gen3 Integrated Block for PCI Express

White Paper AHB to Avalon & Avalon to AHB Bridges

Errata Sheet for Cyclone V Devices

RapidIO Physical Layer MegaCore Function

DDR & DDR2 SDRAM Controller Compiler

FFT MegaCore Function User Guide

Intel Arria 10 and Intel Cyclone 10 GX Avalon -ST Interface for PCI Express* User Guide

Configuration via Protocol (CvP) Implementation in V-series FPGA Devices User Guide

Arria II GX FPGA Development Kit HSMC Loopback Tests Rev 0.1

Intel Arria 10 and Intel Cyclone 10 Avalon-ST Hard IP for PCIe* Design Example User Guide

Avalon Interface Specifications

Quartus II Software Version 10.0 SP1 Device Support

Single-Port Triple-Speed Ethernet and On-Board PHY Chip Reference Design

Customizable Flash Programmer User Guide

Errata Sheet for Cyclone IV Devices

Remote Update Intel FPGA IP User Guide

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

AN 779: Intel FPGA JESD204B IP Core and ADI AD9691 Hardware Checkout Report

Intel Acceleration Stack for Intel Xeon CPU with FPGAs 1.0 Errata

Intel Stratix 10 Configuration via Protocol (CvP) Implementation User Guide

NIOS II Processor Booting Methods In MAX 10 Devices

Avalon Interface Specifications

Transcription:

AN 690: PCI Express DMA Reference Design for Stratix V Devices an690-1.0 Subscribe The PCI Express Avalon Memory-Mapped (Avalon-MM) DMA Reference Design highlights the performance of the Avalon-MM 256-Bit Hard IP for PCI Express IP Core. The design includes a highperformance DMA with an Avalon-MM interface that connects to the PCI Express Hard IP. It transfers data between the Stratix V internal memory and system memory. The reference design includes a Linux-based software driver that sets up the DMA transfer. The software driver also measures and displays the performance achieved for the transfers. This reference design allows you to evaluate the performance of the PCI Express protocol in using the Avalon-MM 256-bit interface. Related Information Avalon-MM 256-Bit Hard IP for PCI Express User Guide Stratix V Hard IP for PCI Express User Guide PCI Express Base Specification Revision 3.0 Understanding PCI Express Throughput The throughput in a PCI Express system depends on the following factors: Protocol overhead Payload size Completion latency Flow control update latency Devices forming the link 2013. All rights reserved. ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of and registered in the U.S. Patent and Trademark Office and in other countries. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at www.altera.com/common/legal.html. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. ISO 9001:2008 Registered www.altera.com 101 Innovation Drive, San Jose, CA 95134

2 Throughput for Posted Writes Protocol Overhead Protocol overhead includes the following three components: 128b/130b Encoding and Decoding-Gen3 links use 128b/130b encoding. This encoding adds two synchronization (sync) bits to each 128-bit data transfer. Consequently, the encoding and decoding overhead is very small at 1.56%. The effective data rate of a Gen3 x8 link is about 8 Gbps. Data Link Layer Packets (DLLPs) and Physical Layer Packets (PLPs)-An active link also transmits DLLPs and PLPs. The PLPs consist of SKP ordered sets which are 16-24 bytes. The DLLPs are two dwords. The DLLPs implement flow control, the ACK/NAK protocol, and flow control DLLPs. TLP Packet Overhead-The overhead associated with a single TLP ranges from 5-7 dwords if the optional ECRC is not included. The overhead includes the following fields: The Start and End Framing Symbols The Sequence ID A 3- or 4-dword TLP header The Link Cyclic Redundancy Check (LCRC) 0-1024 dwords of data payload an690-1.0 The following figure illustrates the TLP packet format. Figure 1: Protocol Overhead Start SequenceID TLP Header Data Payload ECRC LCRC End 1 Byte 2 Bytes 3-4 DW 0-1024 DW 1 DW 1 DW 1 Byte Throughput for Posted Writes The theoretical maximum throughput is calculated using the following formula: Throughput = payload size / (payload size + overhead) * link data rate The following graph shows the maximum throughput with different TLP header and payload sizes. The DLLPs and PLPs are excluded from this calculation. For a 256-bye maximum payload size and a 3-dword header the overhead is five dwords. Because the interface is 256 bits, the 5-dword header requires a single bus cycle. The 256-byte payload requires 8 bus cycles. The following equation shows maximum theoretical throughput: Maximum throughput = 8 cycles/9 cycles = 88.88% * 8 Gbps = 7.2 Gbps

an690-1.0 Figure 2: Maximum Throughput for Memory Writes Specifying the Maximum Payload Size 3 Theoretical Maximum Throughput for Memory Writes (x1) 120 100 Throughput (%) 80 60 40 3 DW Header 4 DW Header 4 DW Header + ECRC 20 0 16 32 64 128 256 512 1024 2048 4096 Payload Size (Bytes) Specifying the Maximum Payload Size The Device Control register, bits [7:5] specifies the maximum TLP payload size of the current system. The Maximum Payload Size field of the Device Capabilities register, bits [2:0] specifies the maximum permissible value for the payload of the Stratix V Hard IP for PCI Express IP Core. You specify this read-only parameter, called Maximum Payload Size, in the Avalon-MM 256-Bit Hard IP for PCI Express GUI. After determining the maximum TLP payload for the current system, software records that value in the Device Control register. This value must be less than the maximum payload specified in the Maximum Payload Size field of the Device Capabilities register. Understanding Flow Control for PCI Express Flow control guarantees that a TLP is not transmitted unless the receiver has enough buffer space to accept it. There are separate credits for headers and payload data. A device needs sufficient header and payload credits before sending a TLP. When the Application Layer in the completer accepts the TLP, it frees up the RX buffer space in the completer s Transaction Layer. The completer sends a flow control update packet (FC Update DLLP) to replenish the consumed credits to the initiator. If a device has used all its credits, the throughput is limited by the rate that header and payload credits are replenished by sending FC Update DLLPs. The flow control updates depend on the maximum payload size and the latencies of two connected devices. Throughput for Reads PCI Express uses a split transaction for reads. The read transaction includes the following steps: 1. The requester sends a Memory Read Request. 2. The completer sends out the ACK DLLP to acknowledge the Memory Read Request. 3. The completer returns a Completion with Data. The completer can split the Completion into multiple completion packets.

4 Throughput for Reads Read throughput is typically lower than write throughput because reads require two transactions instead of a single write for the same amount of data. The read throughput also depends on the round trip delay between the time when the Application Layer issues a Memory Read Request and the time when the requested data returns. To maximize the throughput, the application must issue enough outstanding read requests to cover this delay. The figures below show the timing for Memory Read Requests (MRd) and Completions with Data (CplD). The first figure shows the requester waiting for the completion before issuing the subsequent requests. It results in lower throughput. The second figure shows the requester making multiple outstanding read requests to eliminate the delay after the first data returns. It has higher throughput. Figure 3: Read Request Timing an690-1.0 Rd1 Tag1 Rd2 Tag2 Low Performance Rd3 Tag3 TLP delay delay TLP CpID1 Tag1 CpID2 Tag2 High Performance Rd1 Rd2 Rd3 Rd4 Tag1 Tag2 Tag3 Tag4 Rd5 Tag1 Rd6 Tag2 Rd7 Tag3 Rd8 Tag4 Rd9 Tag1 TLP delay TLP CpID1 Tag1 CpID2 Tag2 CpID3 Tag3 CpID4 Tag4 CpID5 Tag1 CpID6 Tag2 To maintain maximum throughput for the completion data packets, the requester must optimize the following settings: The number of completions in the RX buffer The rate at which the Application Layer issues read requests and processes the completion data Read Request Size Another factor that affects throughput is the read request size. If a requester requires 4 KB data, the requester can issue four, 1 KByte read requests or a single 4 KByte read request. The 4 KByte request results in higher throughput than the four, 1 KByte reads. The read request size is limited by the Maximum Read Request Size value in Device Control register, bits [14:12]. Outstanding Read Requests A final factor that can affect the throughput is the number of outstanding read requests. If the requester sends multiple read requests to improve throughput, the number of outstanding read request is limited by the number of header tags available. The maximum number of header tags is specified by the RX Buffer credit allocation - performance for received requests in the Hard IP for PCI Express IP Core GUI.

an690-1.0 Deliverables Included with the Reference Design 5 Deliverables Included with the Reference Design The reference design includes the following components: Linux application and driver FPGA programming files for the Stratix V GX FPGA Development for Gen3 x8 Quartus II Archive Files (. qar) for the development boards, including SRAM Object File (. sof) and SignalTap II Files (.stp) Reference Design Functional Description The reference design includes the following components: An Application Layer which is an Avalon-MM DMA example design generated in Qsys The Stratix V Hard IP for PCI Express IP Core A Linux software application and driver configured specifically for this reference design Project Hierarchy The reference design uses the following directory structures: top the project directory. The top-level directory is top_hw. pcie_lib includes all design files. If you modify the design, you must copy the modified files into this folder before recompiling the design. Stratix V Hard IP for PCI Express Parameter Settings The Hard IP for PCI Express variant used in this reference design supports a 256-byte maximum payload size. The following tables list the values for all parameters. Table 1: System Settings Parameter Value Number of lanes Lane rate RX buffer credit allocation performance for received request Reference clock frequency Enable configuration via the PCIe link Use ATX PLL x8 Gen3 (8.0 Gbps) Low 100MHz Disable Disable Table 2: Base Address Register (BAR) Settings Parameter Value BAR0 Type 64-bit prefetchable memory

6 Stratix V Hard IP for PCI Express Parameter Settings an690-1.0 Parameter Value BAR0 Size BAR1 BAR5 64 K Bytes 16 bits Disable Table 3: Device Identification Register Settings Parameter Value Vendor ID Device ID Revision ID Class Code Subsystem Vendor ID Subsystem Device ID 0x00001172 0x0000E003 0x00000001 0x00000000 0x00000000 0x00002861 Table 4: PCI Express/PCI Capabilities Parameter Value Maximum payload size Completion timeout range Implement Completion Timeout Disable 256 Bytes ABCD Enable Table 5: Error Reporting Settings Parameter Value Advanced error reporting (AER) ECRC checking ECRC generation Disable Disable Disable Table 6: Link Settings Parameter Value Link port number Slot clock configuration 1 Enable

an690-1.0 Table 7: MSI and MSI-X Settings Parameter Number of MSI messages requested Implement MSI-X Table size Table offset Table BAR indicator Pending bit array (PBA) offset PBA BAR Indicator Stratix V Hard IP for PCI Express Parameter Settings Value 1 Disable 0 0x0000000000000000 0 0x0000000000000000 0 7 Table 8: Power Management Parameter Endpoint L0s acceptable latency Endpoint L1 acceptable latency Maximum of 64ns Maximum of 1us Value Table 9: PCIe Address Space Setting Parameter Address width of accessible PCIe memory space 32 Value Quartus II Settings The.qar file in the reference design package has the recommended synthesis, fitter, and timing analysis settings for the parameters specified in this reference design.

8 Avalon-MM DMA an690-1.0 Avalon-MM DMA The following figure illustrates the major interfaces and modules of the reference design: Figure 4: Block Diagram for the Avalon-MM 256-Bit Hard IP for PCI Express IP Core and DMA Engine Gen3 x8 PCI Express Design Example Descriptor Controller Avalon-MM 256-Bit Hard IP for PCI Express Endpoint PCI Express Avalon-MM Bridge Interconnect Avalon-MM Burst Master ST SRC ST Sink Read Descriptors M ST SRC Read Descriptors Read Status Data, Descriptor Memory S S S Write Descriptors DMA Data & Descriptors DMA Data ST Sink ST Src M M ST Sink M Read DMA Module Write DMA Module RX TX Mem WR Mem RD Completion TLPs 256 256 Mem WR Mem RD Completions TLPs Hard IP for PCI Express and PHY IP Core for PCI Express (PIPE) RX PCIe Link TX PCIe Link ST Sink Write Status ST SRC M Status Update Writes to Host S TX Slave S Program DMA Registers M RX Master Access Control Registers S Control Registers S = Avalon Memory Mapped (Avalon-MM) Slave M = Avalon-MM Master ST SRC = Avalon Streaming (Avalon-ST) Source ST Sink = Avalon Streaming (Avalon-ST) Sink The modules in this reference design implement the following functions: DMA Read The DMA Read module sends memory read TLPs upstream. After the completion is received, it writes the received data to the on-chip memory. DMA Write The DMA Write module reads data from on-chip memory and sends it upstream using memory write TLPs on the PCI Express link.

an690-1.0 DMA Operation Flow 9 Descriptor Controller The Descriptor Controller module manages the DMA read and write operations. This module is external to the main DMA facilitating customization of descriptor handling. The host software programs the Descriptor Controller internal registers with the location and size of the descriptor table residing in the PCI Express main memory. Based on this information, the descriptor control logic directs the DMA Read block to copy the entire table and store it in the local memory. It then fetches the table entries and directs the DMA to transfer the data between the Avalon-MM and PCIe domains one descriptor at a time. It also sends DMA status upstream via the TX slave port. TX Slave The TX Slave module propagates 32-bit Avalon-MM reads and writes upstream. External Avalon-MM masters, including the DMA control master, can access PCI Express memory space using the TX Slave. The DMA uses this path to update the DMA status upstream, including Message Signaled Interrupt (MSI) status. RX Master The RX Master module propagates single dword read and write TLPs from the Root Port to the Avalon-MM domain via a 32-bit Avalon-MM master port. Software programs the RX Master to send control, status, and descriptor information to Avalon-MM slave, including the DMA control slave. DMA Operation Flow Software completes the following steps to specify and initiate a DMA operation: 1. Software allocates a free memory space in the system memory to populate the descriptor table. 2. Software writes all descriptors into the descriptor space in the system memory. 3. Software programs the DMA registers. Programming specifies the base address of the descriptor space and the number of descriptors. a. To move data from the system memory to local memory in the FPGA, software programs the read DMA register. b. To move data from the local memory in FPGA to system memory, software programs the write DMA register. DMA Engine Operation The DMA engine completes the following steps to implement each transfer specified in the descriptor table: 1. The DMA engine issues memory read TLPs to read the descriptors from the system memory through PCI Express link. It copies the data to the local descriptor memory in the FPGA. It processes the descriptors sequentially. 2. For DMA write operations, the DMA engine reads the data from the source address in FPGA local memory in FGPA. Then, it writes the data to the destination address in system memory through the PCI Express link. 3. For DMA read operations, the DMA reads data from the source address in system memory. It copies the data to the destination address in local memory in the FPGA. 4. After the transfer is complete, the DMA engine writes the descriptor ID to the EPLAST field of the descriptor header in system memory. 5. The DMA repeats these steps for each descriptor in the table.

10 DMA Descriptor Registers DMA Descriptor Registers The following table describes the fields of the 160-bit descriptors. The descriptor table can store a maximum of 127 descriptors. The RC Write/Read Descriptor Base register of DMA Control module specifies descriptor table base address. In this reference design, a host processor programs the separately instantiated Descriptor Controller IP Core with the location and size of the descriptor table. The Descriptor Controller directs the DMA Read block to copy the entire table and store it to local memory. It then fetches the table entries and directs the DMA to implement the transfers specified. You can substitute your own descriptor control module for the Alteraprovided Descriptor Controller IP Core. If you do implement your own descriptor controller, it must use the descriptor format specified in the following table. Table 10: Descriptor Format an690-1.0 Bits [31:0] [63:32] [95:64] [127:96] [145:128] [153:146] [159:154] Name Source Low Address Source High Address Destination Low Address Destination High Address DMA Length DMA Descriptor ID Reserved Description Lower 32 bits of the source address. The source address must align to a 32-bit boundary. (2 LSB bits must be 2 b00.) Upper 32 bits of the source address. Lower 32 bits of the destination address. The destination address must align to a 32-bit boundary. (2 LSB bits must be 2 b00.) Upper 32 bits of the Avalon-MM destination address. Transfer data size in dwords. The maximum size is 1 MB. ID of this descriptor. - Table 11: Descriptor Formation in System Memory Address Offset 0x00 Type Header Description EPLAST - when enabled by the EPLAST_ENA bit in the control register, this location records the number of the last descriptor completed by the chaining DMA module. User can poll this data to know the last descriptor is finished. Reserved Reserved Reserved Reserved

an690-1.0 DMA Descriptor Registers 11 Address Offset 0x20 0x24 0x28 0x2c 0x30 0x40 0x44 0x48 0x4C 0x50 0x20 * (n+1)... Type Descriptor 0 Descriptor 1. Descriptor n Description Source Address lower 32-bit Source Address upper 32-bit Destination Address lower 32-bit Destination Address upper 32-bit DMA transfer data length in DW and descriptor ID Source Address lower 32-bit Source Address upper 32-bit Destination Address lower 32-bit Destination Address upper 32-bit DMA transfer data length in DW and descriptor ID Source Address lower 32-bit Source Address upper 32-bit Destination Address lower 32-bit Destination Address upper 32-bit DMA transfer data length in DW and descriptor ID Timing Diagrams Figure 5: Descriptor Table Avalon-MM 256-Bit Master Interface The following timing diagram illustrates the timing for the descriptor table master interface. It shows the Descriptor Controller reading the descriptor table from on-chip memory. Clk_i DTMRead_o DTMAddress_o[63:0] DTMReadData_i[255:0] DTMWaitRequest_i DTMReadDataValid_i 85BAE23177273F00 85. 85BAE23177273F40 8. 85. 8. 85. 8. 85. 85BAE23177274020 8. 85. 8. 000000000000000000000000000000363EC5D5B28DED1080. 00000000

12 DMA Descriptor Registers Figure 6: Avalon-ST 160-Bit TLP Read Control Interface an690-1.0 The following timing diagram illustrates the timing for the Avalon Streaming (Avalon-ST) read descriptor TX and RX interfaces. It shows the Descriptor Controller driving a read descriptor to the DMA Read block. It also shows the DMA Read block writing the status update to the Descriptor Controller. RdDdmaTxData_o[159:0] 0000000000000000 RdDdmaTxValid_o RdDdmaTxReady_i RdDdmaRxData_i[31:0] 00000200 00000000 000000. 0000007F RdDdmaRxValid_i Figure 7: Avalon-ST 160-Bit TLP Write Control Interface The following timing diagram illustrates the timing for the Avalon Streaming (Avalon-ST) write descriptor TX and RX interfaces. It shows the Descriptor Controller driving a write descriptor to the DMA Write block. It also DMA Write block writing the status to the Descriptor Controller. WrDdmaTxData_o[159:0] WrDdmaTxValid_o WrDdmaTxReady 002001057... 002C006253B87C4DB7E63880424615EDEA1C4 WrDdmaRxData_i[31:0] 00000204 00000204 00000205 WrDdmaRxValid_ Figure 8: Descriptor Controller Avalon-MM 32-Bit Master Interface The following timing diagram illustrates the timing for the Avalon-MM Master interface. It shows the Descriptor Controller Avalon-MM Master interface driving single dword transactions to the Avalon-MM 256-Bit Hard IP for PCI Express IP Core. DCMRead_o DCMWrite_o DCMAddress_o[63:0] DCMWriteData_o[31:0] DCMByteEnable_o[3:0] DCMWaitRequest DCMReadData_i[31:0] DCMReadDataValid 1141A90998970780 00000001 00000002 0 1 0 1 0

an690-1.0 Figure 9: Descriptor Controller Avalon-MM 32-bit Slave Interface DMA Descriptor Registers 13 The following timing diagram illustrates the timing for the Avalon-MM TX Slave interface. It shows an external host using an Avalon-MM Master performing a series of single dword writes to program the Descriptor Controller. DCSChipSelect_i DCSRead_i DCSWrite_i DCSAddress_i[11:0] 008 00C 010 014 018 DCSWriteData_i[31:0]] DCSByteEnable_i[3:0] 98970780 1141A909 00000017 77273F00 85BAE2. 00000518 F DCSWaitRequest_o DCSReadData_o[31:0] 0000.

14 DMA Control and Status Registers DMA Control and Status Registers an690-1.0 Table 12: DMA Control and Status Registers Type Address Register Access Description 0x0000 Write DMA Control R/W Contains the write Control information and the number of descriptors 0x0004 Write DMA Status R Write DMA Status 0x0008 RC Write Descriptor Base (Low) R/W Lower 32-bit Base Address of the write descriptor table in the system memory DMA Write Registers 0x000C RC Write Descriptor Base (High) R/W Upper 32-bit Base Address of the write descriptor table in the system memory 0x0010 Last Write Descriptor Index R/W Last descriptor ID to be processed 0x0014 EP Descriptor Table Base (Low) RW Lower 32-bit Base Address of the local memory in FPGA, where the descriptors are stored 0x0018 EP Descriptor Table Base (High) RW Higher 32-bit Base Address of the local memory in FPGA, where the descriptor are stored

an690-1.0 DMA Control and Status Registers 15 Type Address Register Access Description 0x0100 Read DMA Control R/W Contains the read Control information and the number of descriptors 0x0104 Read DMA Status R Read DMA Status 0x0108 RC Read Descriptor Base (Low) R/W Lower 32-bit Base Address of the read descriptor table in the RC memory DMA Read Registers 0x010C 0x0110 RC Read Descriptor Base (High) RC_LAST R/W R/W Upper 32-bit Base Address of the read descriptor table in the RC memory Last descriptor number to be processed 0x0114 EP Descriptor Table Base (Low) RW Lower 32-bit Base Address of the local memory in FPGA, where the descriptors are stored 0x00118 EP Descriptor Table Base (High) RW Higher 32-bit Base Address of the local memory in FPGA, where the descriptor are stored Table 13: Write and Read DMA Control Register Bit Mapping Bit Field Description [7:0] [8] Number of Descriptors START The number of descriptor stored in the system memory. This is used to fetch the correct amount of data in the table. Set the bit to start the DMA

16 Running the Reference Design an690-1.0 Bit Field Description [9] [10] [15:11] MSI_ENA EP_LAST_ENA Reserved Enables MSI message for the DMA. When set to 1, the MSI is sent once all descriptors are completed. Enables the Endpoint DMA module to write descriptor ID of each descriptor back to the EPLAST field in the descriptor table. - Table 14: Status Register Bit Mapping Bits Name Description [7:0] [8] [9] [15:10] Descriptor ID Completed Busy Reserved Descriptor ID (0-126) Descriptor is completed successfully DMA is running - Running the Reference Design The reference design requires the following hardware: The Stratix V FPGA Development Kit A computer running 32- or 64-bit Linux with a PCI Express slot. This computer is referred as computer number 1. A second computer with the Quartus II software installed. This computer downloads the FPGA programming file (. pof) to the FPGA on the development board. This computer is referred as computer number 2. A USB cable or other Altera download cable. Software Requirements To run the reference design, you must install the following software: The reference design software on computer number 1. The Quartus II software version 12.1 or later on computer number 2.

an690-1.0 Software Installation Complete the following steps to install the software: 1. Create a directory and unzip all files to that directory. 2. In the directory that you created, login as root by typing su. 3. Type your root password. 4. Type./ install to install the Linux driver. Software Installation 17 Hardware Installation Complete the following steps to install the hardware 1. Power down computer number 1. 2. Plug the Stratix V FPGA Development Board into a PCI Express slot. The development board has an integrated USB-Blaster for FPGA programming. 3. Connect a USB cable from computer number 2 to the Stratix V FPGA Development Board. 4. On computer number 2, bring up the Quartus II programmer and program the FPGA through an Altera USB-Blaster cable. 5. To force the system enumeration to discover the PCI Express IP Core, reboot computer number 1. Running the Software Complete the following steps to run the DMA software: 1. In a terminal window change to the directory in which you installed the Linux driver. 2. Type su. 3. Type your super user password. 4. To run the DMA driver, type./ run. The driver prints out the commands available to specify the DMA traffic you want to run. By default, the software enables DMA read, DMA write, and Simultaneous DMA reads and writes. In the figure below, the 1 next the question mark, (? 1) indicates the currently enabled commands. You can toggle between the enabled and disabled states, by typing the number of the command at the command prompt. For example, typing # 2 disables the read DMA if the read DMA is currently enabled.

18 Running the Software Figure 10: Output from 256-Bit DMA Driver an690-1.0 The following table lists the available commands. Table 15: Command Operation Command Operation 1 2 3 4 5 6 Start the DMA Enable or disable read DMA Enable or disable write DMA Enable or disable simultaneous read and write DMAs Set the number of dwords per DMA. The allowed range is 256-4096 dwords Set the number of descriptors. The allowed range is 1-127 descriptors.

an690-1.0 Peak Throughput 19 Command Operation 7 8 10 Set the Root Complex source address offset. This address must be a multiple of 4. Run a loop DMA. Exit. Understanding Throughput Measurement To measure throughput, the software driver takes two timestamps. Software takes the first timestamp shortly after the you type the Start DMA command. Software takes the second timestamp after the DMA completes and returns the required completion status, EPLAST. If read DMA, write DMA and simultaneous read and write DMAs are all enabled, the driver takes six timestamps to make the three time measurements. This DMA performance was achieved using the following PCIe Gen3 x8 in the system: Motherboard: Asus PBZ77-V Memory: Corsair 8 GB 2 dimm x [4GB (2x2GB)] @ 1600 MHz CPU:Vendor_ID : GenuineIntelCPU family : 6Model : 58Model name : Intel(R) Core(TM) i5-3450 CPU @ 3.10GHzStepping : 9Microcode : 0xcCPU MHz : 3101.000Cache size: 6144 KByte Peak Throughput SignalTap calculates the following throughputs for the Gen3 x8, 256-bit DMA running at 250 MHz for 142 cycles: 7.1 Gbps-for TX Memory Writes only, 256-byte payload, with the cycles required for transmitting read requests ignored in calculation 6.8 Gbps- for TX Memory Writes, 256-byte payload, with the cycles required for transmitting read requests included in calculation 7.0 Gbps-for RX Read Completions In the SignalTap run, the TxStValid signal never deasserts. The Avalon-MM application drives TLPs at the maximum rate with no idle cycles. In addition, the Hard IP for PCI Express continuously accepts the TX data. The Hard IP for PCI Express IP Core rarely deasserts the RxStValid signal. The deassertions of the RxStValid signal probably occur because the host system does not return completion data quite fast enough. However, the Hard IP for PCI Express IP Core and the Avalon-MM application logic are able to handle the completions continuously as evidenced by long the stretches of back-to-back TLPs. The average throughput for 2 MB transfers including the descriptor overhead is 6.4GB/s in each direction, for a total bandwidth of 12.8 Gbps. The reported DMA throughput reflects typical applications. It includes the following overhead: The time required to fetch DMA Descriptors at the beginning of the operation Periodic sub-optimal TLP address boundaries allocation by the host PC Storing DMA status back to host memory Credit control updates from the host Infrequent Avalon-MM core fabric stalls when the waitrequest signal is asserted

20 Document Revision History TX Theoretical Bandwidth Calculation For TX interface, the write header is 15 bytes. The read header is 7 bytes. Because the TX interface is 256 bits, transferring either RX read and write headers requires 1 cycle. Transferring 256 bytes of payload data requires 8 cycles (256/32), for a total of 9 cycles per TLP. The following equation calculates the theoretical maximum TX bandwidth: 8 data cycles/9 cycles x 8 Gbps = 7.111 Gbps Consequently, the header cycle reduces the maximum theoretical bandwidth by approximately 12%. an690-1.0 SignalTap Observation SignalTap simulations show a peak throughput of 142 cycles at 250 MHz to transfer 2 MB of data. Disregarding the time required for Read Requests in the overhead calculation, results in the following equation: (142-7 - 15)/(142-7) x 8 Gbps) = 7.111 Gbps Including the time required for Read Requests in the overhead calculation, results in the following equation: (142-7 - 15)/(142 x 8 Gbps) = 6.76 Gbps RX Theoretical Bandwidth Calculation For the RX interface, for a 2 MB transfer, the RxStValid signal is not asserted for 3 cycles. The transfer requires 15 Read Completion headers for 256-byte completions. For each Read Completion, the maximum RX bandwidth is represented by the following equation: 8 data cycles/9 cycles * 8Gbps = 7.1111 Gbps SignalTap Observation Including the number of Completion headers and cycles when the RxStValid signal is not asserted as overhead, the actual observed bandwidth is represented by the following equation: (142-15-3)/142 * 8Gbps = 6.986 Gbps Document Revision History Date August, 2013 Version 1.0 Initial Release. Changes