IBM Network Processor, Development Environment and LHCb Software LHCb Readout Unit Internal Review July 24 th 2001 Niko Neufeld, CERN 1
Outline IBM NP4GS3 Architecture A Readout Unit based on the NP4GS3 Dataflow in the NP based RU Sub-event building algorithms Software Development Environment Performance of Sub-event building Calibration of simulation results with measurements on Reference Kit Hardware 2
IBM NP4GS3 4 full duplex Gigabit Ethernet MACs 16 Processors / 2 Hardware threads @ 133 MHz 2128 MIPS to handle up to 4.1x10 6 packets/s 128 kb on-chip input buffer, up to 128 MB DDR RAM output buffer 2 x Switch Interface ( DASL ) @ 4 Gb/s Embedded PPC 405 In production since beginning of this year (R1.1) 3
Readout Unit based on NP4GS3 1 or 2 Mezzanine Cards containing each 1 Network Processor All memory needed for the NP Connections to the external world PCI-bus DASL (switch bus) Connections to physical network layer JTAG, Power and clock PHY-connectors L0 Trigger-Throttle output Power and Clock generation LHCb standard ECS interface (CC-PC) with separate Ethernet connection Board Block Diagram 4
Wrap for single NP only Data flow in the NP4GS3 DASL DASL Access to frame data Ingress Event Building Access to frame data Egress Event Building 5
Sub-event merging software Sub-event building is the main task of the software running on the NP4GS3 when used in a Readout Unit Two locations for frame manipulation (ingress & egress) insinuate two different algorithms with different advantages and possible fields of application Ingress event building for high frequency up to 1.5 MHz, small frames ( ~ 64 bytes) Egress event building for large frames (up to 9000 bytes), or fragments spanning multiple frames 6
Ingress Event Building On chip memory (no wait cycles!) Memory buffers organized via descriptors Only 128 kb of memory Multiple frames more difficult, because they have to be sent in order over the DASL Code is more streamlined, because copying is done on linear memory 7
Egress Event Building 64 MB of external buffer space (Access to memory over 128 bit wide bus). Two copies of 64 MB each, for increased throughput. Only contested resource is the memory bus. Multiple frames are handled easily. For larger fragments only part of the data (ideally 1/8) need to be read. Memory is external (wait cycles!) Buffer data structure is awkward (chunks of 2 x 58 bytes) 8
Software Development Environment Development software consists of: Assembler, Debugger, Simulator and Profiler (the debugger can either be run on the simulator or connect to real hardware via a Ethernet to JTAG interface ( RISC Watch ) attached to the NP4GS3) Rich set of documentation Many examples, such as complete routing software, are available 9
User interface of Simulator and Remote Debugger Thread control Code view with break point and single stepping Coprocessor Registers View of memory buffers Simulator or Real Hardware Debugging as well as processor version 10
Performance for 4:1 Egress Event-Building 350 Maximum Allowed Fragment Rate [khz] 300 250 200 150 100 50 Limit imposed by NP Processing Power Limit imposed by single Gb Output Link Range of possible Level-1 trigger rates Only 16 out of 32 threads simulated hence blue line is highly pessimistic Should increase by at least 50% for 32 threads At any frame size the performance of the RU is limited only by the output link speed of 125 MB/s 0 0 100 200 300 400 500 600 Average Fragment Size [Bytes] 11
Performance for 3:1 Ingress Event- Building Maximum Acceptable L0 Trigger Rate [MHz] 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 110 120 Average Input Fragment Payload [Bytes] Limit imposed by NP Perfomance Limit imposed by single Gb Output Link Max. Level-0 Trigger Rate Only 16 out of 32 threads simulated Overall performance limited by single output link speed only NP performance sufficient for any fragment size 12
Detailed Profiler Information Cycles per Dispatch Thread 10 Stall Cycles Thread 10 2500 7000 Cycles (7.5 ns) 2000 1500 1000 500 gpr bus inst data Cop Exec 6000 5000 4000 3000 2000 Cycles 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Thread 10 dispatch 1000 0 Exec Cop Data Inst Bus GPR COP Stalls Thread 10 3000 2500 2000 1500 1000 500 0 CLP DS TSE0 TSE1 CAB ENQ CHK STR POL CNTR SEM Cycles Details of stall ( enforced NOP ) cycles They are assigned to the coprocessor, which controls access to the resource in question (and is causing the stall in that case) 13
Reference Kit Hardware Our reference kit has 4 x 2 port Gigabit Ethernet I/O cards 14
Test Set-up Tigon2 NIC Features Gigabit Ethernet Network Interface Card (optical/copper) Fully programmable Baseline for final stage event building ( Smart NICs ) Up to 620 khz fragment rate 1 µs resolution timer 4 x Tigon2 NIC 1000SX 4 x 1000SX Full Duplex Ports IBM NP4GS3R1.1 Reference Kit RISC Watch = JTAG via Ethernet PPC 405 Measurement Procedure Download code into NP4GS3 via RISC Watch (JTAG) Send special frame to NP4GS3 to trigger synchronization frame being sent to Tigons Tigons start sending fragments. They add their internal time-stamp to each frame (1 µs resolution). NP4GS3 processes and adds its internal timestamp (1 ms resolution). Tigons receive frames and calculate elapsed time 15
Calibrating the simulation by measurements Event building with a single thread enabled and 4 sources Event building with a single source and multiple threads (16) enabled With release 1.1 version of the NP4GS3 can unfortunately not run easily multiple sources on multiple threads (no semaphore coprocessor) 16
Results from Measurements (Ingress Event-Building) Round-trip time (Tigon-out Tigon-in: 1.7 ms per fragment (frame of 60 Bytes) Simulation says that 600 ns/fragment are used in the NP4GS3. This is not really measurable with our setup. 1 source 1 thread 4 sources 1 thread Measurement [µs/fragment] 6.6(4.9) 4.5(2.8) Simulation [µs/fragment] 4.9 3.2 NOTE: Since the system is pipelined, times smaller than the intrinsic overhead of the Tigon, (i.e. 1.7 ms), cannot be measured accurately! 1 source 16 threads 1.7 (0.0) 0.5 We can trust the timing results obtained with the simulator. 17
Conclusion The software development environment makes the full power of this complex chip available to the software developer Two sub-event merging codes for large fragments @ rates of a few 100 khz and small fragments @ rates of 1 MHz have been developed and benchmarked using simulation. The results for the more demanding ( ingress ) of the two has been verified using the IBM NP4GS3 reference kit hardware The simulation results show that the performance requirements on a readout unit are fully met an NP4GS3-based implementation. 18
Future Work With the upgrade of the NP4GS3 reference kit to version 2.0 of the processor verify (again) code for egress and ingress event building With more Tigon NICs and faster fragment generation code, test also e.g. 7 to 1 multiplexing. Develop and test layer 2 switching application (much simpler than event building due to static routing tables) Develop code for the communication between Linux operated embedded PPC 405 and the CC-PC 19
NP4GS3 Architecture 128 kb on chip Up 64 MB 128 bit wide 4 Gb/s Accessible from PCI 20
EPC Block-Diagram IBM NP4GS3 Embedded Processor Complex (EPC) Block-Diagram 21
Performance for 4:1 Egress Event-Building Fragment Rate [MHz] normalized to 16 threads 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Performance Dependance on Fragment Size max. Wire Speed at Output Average Payload 100 Byte 200 Bytes 500 Bytes Data Rate 100 Byte Data Rate 500 Bytes Data Rate 100 Bytes 0 5 10 15 20 Number of Active Threads 800 700 600 500 400 300 200 100 0 Output Bandwith [MB/s] Throughput as a function of the number of active threads and the payload. The throughput is normalised to the one achieved with 16 threads for comparability Contention for resources such as locks and memory prevents linear scaling Limit of Gigabit Ethernet 125 Mbyte/s 22
Performance for 4:1 Ingress Event- Building Fragment rate and output bandwidth as a function of active threads for various amounts of payload. (A 36 byte transport header is always included) Fragment Rate [MHz] normalized to 16 threads 8 7 6 5 4 3 2 1 0 Performance Dependance on Fragment Size 0 5 10 15 20 Number of Active Threads Average Payload 1 Byte 30 Bytes 100 Bytes Data Rate 1 Byte Data Rate 30 Bytes Data Rate 100 Bytes max. Wire Speed at Output 450 400 350 300 250 200 150 100 50 0 Output Bandwidth [MB/sec] 64 bytes = 1.9 MHz For all practical working points beyond wire-speed 23