PowerPC on NetFPGA CSE 237B Erik Rubow
NetFPGA PCI card + FPGA + 4 GbE ports FPGA (Virtex II Pro) has 2 PowerPC hard cores Untapped resource within NetFPGA community
Goals Evaluate performance of on chip embedded processor for packet processing Compare performance with host CPU and NetThreads soft processor project Evaluate the costs Provide easy access to this resource to NetFPGA community
Related Work: NetThreads Implements 2 custom multi threaded soft core processors with shared memory and mutexes
My Design
Quick Comparison NetThreads My Design 4x2 threads Single thread Separate Input/Ouput buffers Off chip memory Single Input/Output buffer (zero copy) On chip memory only (currently) No register interface Only bitfile distributed (geared toward SW only applications) Register master and slave Easy integration with NetFPGA HW modules
Easy Integration of PowerPC
Register Bus Masters insert requests with a source id Then listen and remove responses to their own requests
PPC Register Access Built into instruction set mtdcr mfdrc But DCR address width is 10 bits Need 23 for the NetFPGA register bus Solution: perform two register accesses 1) Write address to special register 2) Read/write to another special register
Packet Buffer Management 16KB of data memory is reserved for buffer Memory is a dual port block RAM 32 bit interface to CPU 128 bit interface to hardware Divided into fixed size chunks to fit any packet At any time, only one of three parties has permission to access a particular packet buffer CPU Copy in logic
Packet Buffer Management Indices to buffers are passed amongst them using 3 hardware FIFOs In, out, free CPU has DCR interface to FIFOs Copy in and copy out logic have access to buffer on alternate clock cycles 64 bit datapath 128 bit memory port
Development Environment Did not use Xilinx EDK No more Virtex II Pro support Hides too many details Not free Simple boot code to initialize stack GNU toolchain Custom linker script to map instructions and data to specific memory regions.bmm file to track block RAM primitives during synthesis
Memory Challenges Data2mem not compatible with Coregen generated RAMs Had to juggle signals to deal with Byte writes Endianness Block RAM initialization process different for simulation and synthesis Information about memory layout is needed in multiple places (HW, SW, toolchains) and needs to be consistent
API read_reg() Reads from register on NetFPGA register bus write_reg() get_pkt() send_pkt() Writes to register on NetFPGA register bus Polls packet in FIFO via DCR interface until index is received Pushes the index on the packet out FIFO
Host CPU Performance PCI limits bandwidth for packet transfers iperf measures 186 Mbps with NetFPGA as NIC Packet transfer latency (to SW and back) ~60 us kernel ~120 us userspace This is SW routing latency minus HW routing latency Test performed under light load, minimum sized frame Userspace routing implemented by Click
NetThreads Performance Data from their paper on NetThreads Raw Input Output buffer copy performance Over 58K pkts/s, over 0.7 Gbps But the test they used did not push the limits Application performance UDHCP: ~600 pkts/s Regex Classifier: ~1800 pkts/s NAT: ~4500 pkts/s Performance degredation due to synchronization issues
My Design Performance Best case throughput When CPU does nothing to packet 3.6M min sized pkts/s at 125MHz 5.18M min sized pkts/s at 250MHz 325K max sized pkts/s at 125MHz With low CPU utilization Tested in simulation What is line rate? 4/((64+20)*8*10^ 9) = 5.9M pkts/s 4/((1500+20)*8*10^ 9) = 329K pkts/s
My Design Performance Best case packet latency 192ns at 250MHz Ideal conditions: min sized packet, empty queue Latency measured from entrance to exit of wrapper module NetFPGA register access delay 256ns at 250MHz But this will depend on length of register bus chain
My Design Performance Impact of software on throughput 250MHz CPU maintains line rate for max sized frames while spending ~760 cycles on each packet But only ~32 cycles for min sized frames Unfortunately I don't have much in the way of software yet I'm trying to get LwIP to work
FPGA Resources NIC PPC NetThreads Block RAMs 7% 22% 71% Slices 37% 44% 66%
Questions?