PowerPC on NetFPGA CSE 237B. Erik Rubow

Similar documents
ReconOS: An RTOS Supporting Hardware and Software Threads

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

NetFPGA Hardware Architecture

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Data Side OCM Bus v1.0 (v2.00b)

Design of a Network Camera with an FPGA

P51: High Performance Networking

Yet Another Implementation of CoRAM Memory

CO405H. Department of Compu:ng Imperial College London. Computing in Space with OpenSPL Topic 14: Networking DFEs

INT G bit TCP Offload Engine SOC

High Speed Data Transfer Using FPGA

LogiCORE IP AXI Video Direct Memory Access (axi_vdma) (v3.01.a)

Input/Output Problems. External Devices. Input/Output Module. I/O Steps. I/O Module Function Computer Architecture

LogiCORE IP AXI Video Direct Memory Access v4.00.a

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

BlazePPS (Blaze Packet Processing System) CSEE W4840 Project Design

SECURE PARTIAL RECONFIGURATION OF FPGAs. Amir S. Zeineddini Kris Gaj

Achieving UFS Host Throughput For System Performance

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor

A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system

LogiCORE IP AXI Video Direct Memory Access (axi_vdma) (v3.00.a)

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Components for Integrating Device Controllers for Fast Orbit Feedback

Adaptive algorithm for High-Performance FPGA Cores

Jakub Cabal et al. CESNET

LogiCORE IP AXI DMA (v4.00.a)

FPGA Solutions: Modular Architecture for Peak Performance

LogiCORE IP AXI DMA v6.01.a

Organisasi Sistem Komputer

Lecture 7: Introduction to Co-synthesis Algorithms

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

What's "vspi"? What's included?

FreeBSD support for Stanford NetFPGA. Wojciech A. Koszek

LogiCORE IP AXI DMA v6.02a

Outline Introduction System development Video capture Image processing Results Application Conclusion Bibliography

INT 1011 TCP Offload Engine (Full Offload)

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

Block Diagram. mast_sel. mast_inst. mast_data. mast_val mast_rdy. clk. slv_sel. slv_inst. slv_data. slv_val slv_rdy. rfifo_depth_log2.

LogiCORE IP AXI DataMover v3.00a

Block RAM. Size. Ports. Virtex-4 and older: 18Kb Virtex-5 and newer: 36Kb, can function as two 18Kb blocks

Precision Time Protocol, and Sub-Microsecond Synchronization

LogiCORE IP AXI Video Direct Memory Access v5.00.a

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Hardware Design. University of Pannonia Dept. Of Electrical Engineering and Information Systems. MicroBlaze v.8.10 / v.8.20

Tomasz Włostowski Beams Department Controls Group Hardware and Timing Section. Developing hard real-time systems using FPGAs and soft CPU cores

ProtoFlex: FPGA Accelerated Full System MP Simulation

Implementation of Ethernet, Aurora and their Integrated module for High Speed Serial Data Transmission using Xilinx EDK on Virtex-5 FPGA

An FPGA-Based Optical IOH Architecture for Embedded System

Motivation to Teach Network Hardware

Input/Output Systems

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware

Module 12: I/O Systems

Ethernet transport protocols for FPGA

SPACE: SystemC Partitioning of Architectures for Co-design of real-time Embedded systems

UCT Software-Defined Radio Research Group

6.9. Communicating to the Outside World: Cluster Networking

Running Code Out of the PPC405 Caches

LogiCORE IP AXI DMA (v3.00a)

Design and System Level Evaluation of a High Performance Memory System for reconfigurable SoC Platforms

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

Experience with the NetFPGA Program

Prototyping NGC. First Light. PICNIC Array Image of ESO Messenger Front Page

Multi MicroBlaze System for Parallel Computing

A ONE CHIP HARDENED SOLUTION FOR HIGH SPEED SPACEWIRE SYSTEM IMPLEMENTATIONS

AN 829: PCI Express* Avalon -MM DMA Reference Design

Enabling success from the center of technology. Networking with Xilinx Embedded Processors

USB3DevIP Data Recorder by FAT32 Design Rev Mar-15

Page 1 SPACEWIRE SEMINAR 4/5 NOVEMBER 2003 JF COLDEFY / C HONVAULT

LogiCORE IP AXI Master Lite (axi_master_lite) (v1.00a)

LogiCORE IP Serial RapidIO Gen2 v1.2

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

Enhanced Ethernet Switching Technology. Time Applications. Rui Santos 17 / 04 / 2009

ProtoFlex: FPGA-Accelerated Hybrid Simulator

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Technology for Adaptive Hard. Rui Santos, UA

Learning with Purpose

A Next Generation Home Access Point and Router

ADQ14 Development Kit

LogiCORE IP Video Direct Memory Access v1.1

Performance Evaluation of Myrinet-based Network Router

NetList / SourceCode / Deployment. Basic / Extended / GPLink. FireLink

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

DESIGN OF HIGH SPEED LINK BFFERB FOR HDLC PROCESSOR

IBM Network Processor, Development Environment and LHCb Software

High-Speed NAND Flash

Field Programmable Gate Array (FPGA) Devices

Enabling success from the center of technology. Interfacing FPGAs to Memory

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Reading and References. Input / Output. Why Input and Output? A typical organization. CSE 410, Spring 2004 Computer Systems

INT-1010 TCP Offload Engine

Impact of Cache Coherence Protocols on the Processing of Network Traffic

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

An NVMe-based FPGA Storage Workload Accelerator

Bus Master DMA Reference Design for the Xilinx Endpoint Block Plus Core for PCI Express Author: Jake Wiltgen

IBM PowerPC Enablement Kit: ChipBench-SLD: System Level Analysis and Design Tool Suite. Power.org, September 2005

10GE network tests with UDP. Janusz Szuba European XFEL

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Transcription:

PowerPC on NetFPGA CSE 237B Erik Rubow

NetFPGA PCI card + FPGA + 4 GbE ports FPGA (Virtex II Pro) has 2 PowerPC hard cores Untapped resource within NetFPGA community

Goals Evaluate performance of on chip embedded processor for packet processing Compare performance with host CPU and NetThreads soft processor project Evaluate the costs Provide easy access to this resource to NetFPGA community

Related Work: NetThreads Implements 2 custom multi threaded soft core processors with shared memory and mutexes

My Design

Quick Comparison NetThreads My Design 4x2 threads Single thread Separate Input/Ouput buffers Off chip memory Single Input/Output buffer (zero copy) On chip memory only (currently) No register interface Only bitfile distributed (geared toward SW only applications) Register master and slave Easy integration with NetFPGA HW modules

Easy Integration of PowerPC

Register Bus Masters insert requests with a source id Then listen and remove responses to their own requests

PPC Register Access Built into instruction set mtdcr mfdrc But DCR address width is 10 bits Need 23 for the NetFPGA register bus Solution: perform two register accesses 1) Write address to special register 2) Read/write to another special register

Packet Buffer Management 16KB of data memory is reserved for buffer Memory is a dual port block RAM 32 bit interface to CPU 128 bit interface to hardware Divided into fixed size chunks to fit any packet At any time, only one of three parties has permission to access a particular packet buffer CPU Copy in logic

Packet Buffer Management Indices to buffers are passed amongst them using 3 hardware FIFOs In, out, free CPU has DCR interface to FIFOs Copy in and copy out logic have access to buffer on alternate clock cycles 64 bit datapath 128 bit memory port

Development Environment Did not use Xilinx EDK No more Virtex II Pro support Hides too many details Not free Simple boot code to initialize stack GNU toolchain Custom linker script to map instructions and data to specific memory regions.bmm file to track block RAM primitives during synthesis

Memory Challenges Data2mem not compatible with Coregen generated RAMs Had to juggle signals to deal with Byte writes Endianness Block RAM initialization process different for simulation and synthesis Information about memory layout is needed in multiple places (HW, SW, toolchains) and needs to be consistent

API read_reg() Reads from register on NetFPGA register bus write_reg() get_pkt() send_pkt() Writes to register on NetFPGA register bus Polls packet in FIFO via DCR interface until index is received Pushes the index on the packet out FIFO

Host CPU Performance PCI limits bandwidth for packet transfers iperf measures 186 Mbps with NetFPGA as NIC Packet transfer latency (to SW and back) ~60 us kernel ~120 us userspace This is SW routing latency minus HW routing latency Test performed under light load, minimum sized frame Userspace routing implemented by Click

NetThreads Performance Data from their paper on NetThreads Raw Input Output buffer copy performance Over 58K pkts/s, over 0.7 Gbps But the test they used did not push the limits Application performance UDHCP: ~600 pkts/s Regex Classifier: ~1800 pkts/s NAT: ~4500 pkts/s Performance degredation due to synchronization issues

My Design Performance Best case throughput When CPU does nothing to packet 3.6M min sized pkts/s at 125MHz 5.18M min sized pkts/s at 250MHz 325K max sized pkts/s at 125MHz With low CPU utilization Tested in simulation What is line rate? 4/((64+20)*8*10^ 9) = 5.9M pkts/s 4/((1500+20)*8*10^ 9) = 329K pkts/s

My Design Performance Best case packet latency 192ns at 250MHz Ideal conditions: min sized packet, empty queue Latency measured from entrance to exit of wrapper module NetFPGA register access delay 256ns at 250MHz But this will depend on length of register bus chain

My Design Performance Impact of software on throughput 250MHz CPU maintains line rate for max sized frames while spending ~760 cycles on each packet But only ~32 cycles for min sized frames Unfortunately I don't have much in the way of software yet I'm trying to get LwIP to work

FPGA Resources NIC PPC NetThreads Block RAMs 7% 22% 71% Slices 37% 44% 66%

Questions?