PowerPC on NetFPGA CSE 237B. Erik Rubow

PowerPC on NetFPGA CSE 237B Erik Rubow

NetFPGA PCI card + FPGA + 4 GbE ports FPGA (Virtex II Pro) has 2 PowerPC hard cores Untapped resource within NetFPGA community

Goals Evaluate performance of on chip embedded processor for packet processing Compare performance with host CPU and NetThreads soft processor project Evaluate the costs Provide easy access to this resource to NetFPGA community

Related Work: NetThreads Implements 2 custom multi threaded soft core processors with shared memory and mutexes

My Design

Quick Comparison NetThreads My Design 4x2 threads Single thread Separate Input/Ouput buffers Off chip memory Single Input/Output buffer (zero copy) On chip memory only (currently) No register interface Only bitfile distributed (geared toward SW only applications) Register master and slave Easy integration with NetFPGA HW modules

Easy Integration of PowerPC

PPC Register Access Built into instruction set mtdcr mfdrc But DCR address width is 10 bits Need 23 for the NetFPGA register bus Solution: perform two register accesses 1) Write address to special register 2) Read/write to another special register

Packet Buffer Management 16KB of data memory is reserved for buffer Memory is a dual port block RAM 32 bit interface to CPU 128 bit interface to hardware Divided into fixed size chunks to fit any packet At any time, only one of three parties has permission to access a particular packet buffer CPU Copy in logic

Packet Buffer Management Indices to buffers are passed amongst them using 3 hardware FIFOs In, out, free CPU has DCR interface to FIFOs Copy in and copy out logic have access to buffer on alternate clock cycles 64 bit datapath 128 bit memory port

Development Environment Did not use Xilinx EDK No more Virtex II Pro support Hides too many details Not free Simple boot code to initialize stack GNU toolchain Custom linker script to map instructions and data to specific memory regions.bmm file to track block RAM primitives during synthesis

Memory Challenges Data2mem not compatible with Coregen generated RAMs Had to juggle signals to deal with Byte writes Endianness Block RAM initialization process different for simulation and synthesis Information about memory layout is needed in multiple places (HW, SW, toolchains) and needs to be consistent

API read_reg() Reads from register on NetFPGA register bus write_reg() get_pkt() send_pkt() Writes to register on NetFPGA register bus Polls packet in FIFO via DCR interface until index is received Pushes the index on the packet out FIFO

Host CPU Performance PCI limits bandwidth for packet transfers iperf measures 186 Mbps with NetFPGA as NIC Packet transfer latency (to SW and back) ~60 us kernel ~120 us userspace This is SW routing latency minus HW routing latency Test performed under light load, minimum sized frame Userspace routing implemented by Click

NetThreads Performance Data from their paper on NetThreads Raw Input Output buffer copy performance Over 58K pkts/s, over 0.7 Gbps But the test they used did not push the limits Application performance UDHCP: ~600 pkts/s Regex Classifier: ~1800 pkts/s NAT: ~4500 pkts/s Performance degredation due to synchronization issues

My Design Performance Best case throughput When CPU does nothing to packet 3.6M min sized pkts/s at 125MHz 5.18M min sized pkts/s at 250MHz 325K max sized pkts/s at 125MHz With low CPU utilization Tested in simulation What is line rate? 4/((64+20)*8*10^ 9) = 5.9M pkts/s 4/((1500+20)*8*10^ 9) = 329K pkts/s

My Design Performance Best case packet latency 192ns at 250MHz Ideal conditions: min sized packet, empty queue Latency measured from entrance to exit of wrapper module NetFPGA register access delay 256ns at 250MHz But this will depend on length of register bus chain

My Design Performance Impact of software on throughput 250MHz CPU maintains line rate for max sized frames while spending ~760 cycles on each packet But only ~32 cycles for min sized frames Unfortunately I don't have much in the way of software yet I'm trying to get LwIP to work

FPGA Resources NIC PPC NetThreads Block RAMs 7% 22% 71% Slices 37% 44% 66%

Questions?