Workload Characterization and Performance for a Network Processor

Size: px

Start display at page:

Download "Workload Characterization and Performance for a Network Processor"

Matilda Lambert
6 years ago
Views:

1 Workload Characterization and Performance for a Network Processor Mitsuhiro Miyazaki Princeton Architecture Laboratory for Multimedia and Security (PALMS) May

2 Objectives To evaluate a NP from the computer architect s point of view, rather than the network infrastructure point of view To understand hardware multithreading effect for NPs To guide the architectural design of future NPs

3 Outline Router Processing Characterization Workload Characterization Intel s IXP1200 Architecture Simulation Setup IXP1200 Evaluation Instruction Mix Latency Executing, Aborted, Stalled and Idle ratio CPI Throughput Other NPs Conclusion and Future work

4 Input Port Router Processing Characterization Input Scheduler Classifier & Filter RPB Forwarder FIB FW Queuing Assignment TQB Output Scheduler & Load balancing Output Port IS RFIFO CF FL RPB FIB FW QA TQB OS TFIFO LB Packet Discard RPB FIB FW TQB

5 Frequently occurred packets in the real Internet Packe t Size Packet Type Description Packets Distributio n Internet Traffic 1) 40 Bytes 2) 576 Bytes TCP packets with IP header but no payload (i.e. only 20 Bytes IP header plus 20 Bytes TCP header), typically sent at the start of a new TCP session. The default IP Maximum Datagram Size (MDS) packets without fragmentation, including the default TCP Maximum Segment Size (MSS) 536 Bytes packets. 35% 3.5% 11.5% 16.5% 3) 1500 Bytes Packets corresponding to the Maximum Transmission Unit (MTU) size of an Ethernet connection. 10% 37% Note: Based on data collected by the National Laboratory for Applied Network Research (NLANR) project located at San Diego Supercomputer Center

6 Workloads of fixed size packets Packet Size Packet Type Description 1) 64 Bytes The minimum-size Ethernet packets, consisting of 14 Bytes Ethernet header, 20 Bytes IP header, 26 Bytes Payload, and 4 Bytes Ethernet trailer (FCS), and being expected to be used for TCP handshake 2) 594 Bytes Ethernet packets including 14 Bytes Ethernet header, 20 Bytes IP header, 556 Bytes Payload (assuming 20 Bytes TCP header plus 536 Bytes MSS), and 4 Bytes Ethernet trailer (FCS) 3) 1518 Bytes The maximum-size Ethernet packets, consisting of 14 Bytes Ethernet header, 20 Bytes IP header, 1480 Bytes Payload and 4 Bytes Ethernet trailer (FCS) Note: Workloads use Ethernet packets because the simulation assumes a router with 16x100Mbps Ethernet ports

7 Workload of Mixture packets Packet Size (Bytes) Proportion of Total Traffic Load 64 50% (6 parts) 7.881% % (5 parts) 60.96% % (1 parts) % Note: The average size of packets is 406 bytes.

8 IXP1200 Architecture Intel Strong ARM Core Intel Strong ARM SA-1Core 16 Kbyte I-cache 8 Kbyte D-cache 512 Kbyte Mini-Dcache Write-Buffer Read Buffer JTAG PCI Unit 32-bit bus UART 4 Timers GPIO RTC SDRAM Unit 64-bit bus 32-bit bus SRAM Unit FBI Unit Scratchpad Memory (4 Kbyte) Microengine 1 Microengine 2 Microengine 3 Hash Unit 64-bit bus IX Bus Interface Microengine 4 Microengine 5 Microengine 6 Notes: 32-bit Data Bus 32-bit ARM System Bus

9 Microengine Pipelining Note: Context switching can be made by 4PCs, 128GPRs, 64SDRAM Xfer regs, 64 SRAM Xfer regs and other CSRs

10 Hardwre Multi-Threading Multithreading keeps Microengine execution pipeline active without numerous stalled cycles Thread0 Thread stalled* Thread1 Thread stalled* Thread stalled* Thread2 Thread stalled* Thread3 *Note: Threads stalled are caused by memory access

11 Memory Access Flow

12 Branch and Context switch Instructions Class 3 Class2 Class1 br_bclr and br_bset br=0 br sdram br=byte and br!=byte br!=0 br=ctx sram jump br>0 br!=ctx hash1_48 rtn br>=0 ctx_arb hash2_48 br_!signal br<0 csr hash3_48 br_inp_state br<=0 r_fifo_rd hash1_64 br=cout t_fifo_wr hash2_64 br!=cout scratch hash3_64 Note: Blue colored instructions indicate context switch instructions.

13 Branch pipeline example with Class 3 Instruction

14 Branch pipeline example with Class 2 Instruction Case 1 Case 2

15 Branch/Context switch pipeline example with Class 1 Instruction

16 Solutions for branch penalties Deferred branch instruction Guess branch instruction Condition Code set earlier

17 Deferred branch Instruction

18 Guess Branch Instruction

19 Combination of Guess and Deferred Branch

20 Simulation Setup Workbench GUI interface to all Microengine tools Microcode assembler Microcode linker Transactor Debug and Simulation engine with IXP1200 Architectural Model and Memory The verilog model of an IX bus device(i.e. MAC device) Reference program(l2l3fwd16)

21 Simulation Image SRAM 100Mbps(Full Duplex) x 16 ports MAC IXF440 8 ports MAC IXF440 8 ports IX Bus 32bit 32bit FBI Unit SRAM Unit SDRAM Unit Six Micro engines SDRAM IXP1200

22 Thread assignment & Sim Conditions Receive threads are assigned to Microengine 0-3 Transmit threads are assigned to Microengine 4-5 One thread per Microengine works as output scheduler in Microengine 4-5 Operation Frequency Microengine runs at 232MHz The IX bus transfers packets at 104MHz SRAM and SDRAM bus transfer data at 116MHz The simulation had to forward 3000 packets

23 Instruction Mix for Receive Processing Packet Types Mixture 1518B 594B 64B 31.9% 30.3% 32.5% 40.8% 39.8% 40.8% 37.8% 28.0% 7.3%15.2% 5.8% 7.2% 16.4% 5.3% 10.0% 14.2% 5.6% 7.6% 16.6% 6.9% Arithmetic,Rotate, and Shift Instructions Branch and Jump Instructions Reference Instructions Local Register Instructions Miscellaneous Instructions 0% 20% 40% 60% 80% 100% Instruction Ratio

24 Instruction Mix for Transmit Processing Packet Types Mixture 1518B 594B 64B 50.7% 50.9% 51.3% 48.2% 31.0% 31.1% 30.7% 30.7% 8.5% 8.7% 1.1% 8.5% 8.6% 0.9% 8.2% 8.6% 1.2% 10.6% 8.2% 2.4% Arithmetic,Rotate, and Shift Instructions Branch and Jump Instructions Reference Instructions Local Register Instructions Miscellaneous Instructions 0% 20% 40% 60% 80% 100% Instruction Ratio

25 Instruction Mix for Overall Processing Packet Types Mixture 1518B 594B 64B 39.2% 38.4% 39.8% 43.4% 36.4% 37.0% 35.1% 29.0% 6.9% 12.7% 4.9% 6.5% 13.4% 4.7% 6.6% 12.0% 6.6% 8.6%11.7% 7.4% Arithmetic,Rotate, and Shift Instructions Branch and Jump Instructions Reference Instructions Local Register Instructions Miscellaneous Instructions 0% 20% 40% 60% 80% 100% Instruction Ratio

26 SDRAM Latency 100 cumulative percentage Microengine0 Microengine1 Microengine2 Microengine cycles

27 SRAM Latency (unlocked) 100 cumulative percentage Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine cycles

28 Execution, Aborted, Stalled and Idle Ratio on 64bytes packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio

29 Execution, Aborted, Stalled and Idle Ratio on 594bytes packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio

30 Execution, Aborted, Stalled and Idle Ratio on 1518bytes packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio

31 Execution, Aborted, Stalled and Idle Ratio on Mixture packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio

32 Cycle per Instruction (CPI) MixturePackets-uEngine0 1518BPackets-uEngine0 594BPackets-uEngine0 64BPackets-uEngine0 MixturePackets-uEngine1 1518BPackets-uEngine1 594BPackets-uEngine1 64BPackets-uEngine1 MixturePackets-uEngine2 1518BPackets-uEngine2 594BPackets-uEngine2 64BPackets-uEngine2 MixturePackets-uEngine3 1518BPackets-uEngine3 594BPackets-uEngine3 64BPackets-uEngine3 MixturePackets-uEngine4 1518BPackets-uEngine4 594BPackets-uEngine4 64BPackets-uEngine4 MixturePackets-uEngine5 1518BPackets-uEngine5 594BPackets-uEngine5 64BPackets-uEngine CPI

33 Throughput (bounded) Sim Rate Mpps Ideal Sim Rate OC-24(CRC16) Mixture 1518bytes 594bytes 64bytes Note: The reason why OC-24 is higher than Sim rate comes from the difference of protocol overhead Ethernet protocol overhead:38bytes per packet.(82.6% overhead for 46bytes IP packet) Protocol header and trailer(18bytes)+ifg(12bytes)+preamble/sfd(8bytes)= 38bytes OC-24 POS overhead:7bytes per packet(15.2% overhead for 46bytes IP packet)

34 Throughput (unbounded) Mpps Mixture 1518bytes 594bytes 64bytes Sim Rate 1.244GEther(OC-24class) 2.488GEther(OC-48class) Note: These throughputs don t include 12bytes IFG overhead.

35 Features of Other NPs Lexra s NetVortex 32-bit MIPS-1 Instruction set plus 18 extended instructions for context control and bit-field operation Supports up to 8 contexts per processor Each context includes 32 GPRs, its own PC and a status reg. Uses delay slot of memory reference for context switching(ex. LW.CSW reg. addr.) Performs in the similar way to IXP1200 Motrola s C-5 A subset of MIPS-1 Instruction set (excluding multiply, divide, floating point, and Coprocessor Zero(CpO)) Provides its own special purpose CpO instructions for context switching(ex. MTC0 $1 $3) 16 x Channel Processor RISC Cores(CPRCs), each supports up to 4 contexts and 32 GPRs IBM s PowerNP 16 x picoprocessors performing operation codes, each supports 2 contexts 4 threads perform context switch in a cluster 4 categories: 1) ALU opcodes, 2) control opcodes, 3) data movement opcodes, 4) coprocessor execution opcodes(supporting context switching) Context switching occurs when the picoprocessor is waiting for a shared resource (ex. Waiting for one of the coprocessors to complete an operation, access memory, etc)

36 Conclusion and Future work H/W multithreading can hide large latencies effectively, but another issue has come up Aborted cycles occurred by branch and context switch are not small Some dynamic hardware prediction or speculation could be necessary to reduce penalties for future NPs, but should consider cost issue An IXP1200 has achieved OC-24 class router processing, but not enough to perform OC-48 class router processing

37 Backup Slide

38 Instruction Categories Instruction Description Instruction Description Arithmetic,Rotate, and Shift Instructions Reference Instructions alu Perform an alu operation csr Csr reference alu_shf Perform an alu and shift operation fast_wr W rite immediate data to thd_done csrs dbl_shf local_csr_rd, local_csr_wr r_fifo_rd Read and write csrs Read the receive fifo Branch and Jump Instructions pcl_dma Issue a request to the pci unit br, br=0, br!=0, br>=0, br>=0, br<0, br<=0, br>0, br=cout, br!=cout scratch sdram sram t_fifo_wr Scratchpad reference Sdram reference Sram reference W rite to the transmit fifo br_bset, br_bclr Branch on bit set or bit clear Local Register Instructions br=byte, br!=byte Brabch on byte equal find_bset, find_bset_with_mask Determine position number of first bit set in an arbitrary 16-bit field of a register. br=ctx, br!=ctx Branch on current context immed Load immediate word and sign extend or zero fill with shift. br_inp_state Branch on event state (e,g.,sram done). immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. br_!signal Branch if signal deasserted immed_wo, immed_w1 Load immediate word to a field. jum p Jump to label ld_field, ld_field_w_clr Load byte(s) into specified field(s). rtn Return from a branch or a jump load_addr Load instruction address. Miscellaneous Instructions Load the result of a find_bset or load_bset_result1, load_bset_result2 ctx_arb Perform context swap and wake on event. find_bset_with_mask instruction. nop Perform no operation. hash1_48, hash2_48, hash3_48 hash1_64, hash2_64, hash3_64 Concatenate two longwords, shift the result, and save a longword. Branch on condition code Perform 48-bit hash. Perform 64-bit hash.

39 SRAM Latency (locked) 100 cumulative percentage Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine cycles

40 FBI Architecture AMBA (Core) Command Bus Microengine Command Bus From SDRAM Push and Pull Engine Arbiters fast _ wr TFIFO 16 elements (10 quadwords each) CSRs Pull Engine TFIFO Rd CRS/Scratch Hash Rd Pull Command From SRAM Microengine Write Transfer Register 8 command Pull Queue 8 command Hash Queue 8 command Push Queue IX Bus Interface 1k x 32 Scratchpad Hash Unit RFIFO 16 elements (10 quadwords each) Push Engine CRS/Scratch Hash Return Push command RFIFO To SRAM Microengine Read Transfer Register To SDRAM Ready Bus Sequencer Transmit State Machine Receive State Machine IX Bus Arbiter 64-bit IX Bus Ready Bus

41 Ready Bus and Ready Flags

42 Theoretical IP Throughput Media 100Mbps Ethernet 64-byte PPS (46-byte IP packet) 594-byte PPS (576-byte IP packet) 1518-byte PPS (1500-byte IP packet) Mixture (avg 406-byte) PPS (avg 388-byte IP packet) 148,810 20,358 8,127 29,343 Gigabit Ethernet 1,488, ,583 81, ,427 10Gigabit Ethernet 14,880,952 2,035, ,744 2,934,272 OC-3 POS CRC ,491 31,681 12,256 46,759 OC-12 POS CRC-16 1,412, ,439 49, ,570 OC-24 POS CRC-16 2,825, ,878 99, ,139 OC-48 POS CRC-16 5,651, , , ,278 OC-192 POS CRC-16 22,605,283 2,055, ,010 3,033,114 OC-3 POS CRC ,818 31,573 12,240 46,524 OC-12 POS CRC-32 1,361, ,000 49, ,615 OC-24 POS CRC-32 2,722, ,000 99, ,229 OC-48 POS CRC-32 5,445, , , ,458 OC-192 POS CRC-32 21,783,273 2,048, ,956 3,017,834 ATM OC-3 174,245 26,807 10,890 38,721 ATM OC , ,679 44, ,981 ATM OC-24 1,412, ,358 88, ,962 ATM OC-48 2,825, , , ,925 ATM OC ,302,642 1,738, ,415 2,511,698

43 NetVortex extended Instruction set Instruction MYCX POSTCX CSW LW.CSW LT.CSW WD WD.CSW WDLW.CSW WDLT.CSW SETI CLRI EXTIV INSV ACS2 MFCXG MTCXG MFCXC MTCXC Description Context-Control Instructions Read my context Post event to a context Context Switch Load word with context switch Load twinword* with context switch Write descriptor to device Write descriptor to device with context switch Write descriptor to device,load word with context switch Write descriptor to device,load twinword with context switch Bit-Field Instructions Set subfield to ones Clear subfield to zeroes Extract subfield and prepare for insertion Insert extracted subfield Dual 16-bit ones complement add for checksum Cross-Context Access Instructions Move from a context general-purpose register Move to a context general-purpose register Move from a context-control register Move to a context-control register

44 NetVortex Context Switch Mechanism Thread 1 Program I1(T1): I2(T1): LW.CSW (reg, addr) I3(T1): Delay slot instruction I4(T1): Next instruction I5(T1): Context Switch to Thread 2 Thread 2 Program I1(T1): I2(T1): LW.CSW (reg, addr) I3(T1): Delay slot instruction I4(T1): Next instruction I5(T1): Context Switch to next available thread General Purpose Register File Thread Context 1 (r0 - r31) Thread Context 2 (r0 - r31) Context Registers Thread1 CXPC = I4(T1) Thread1 CXSTATUS = Wait Thread2 CXPC = PC Thread2 CXSTATUS = Active PC = I1(T2)

45 PowerNP Context Switch Example IF Reduction_OR(mask16(i) = coprocessr. Busy(i))THEN PC <= stall ELSE PC <=PC +1 END IF IF p=1 THEN Priority Over(other thread)<= TRUE ELSE PriorityOwner(Other thread)<= PriorityOwner(Other thread) END IF;

Intel IXP1200 Network Processor Family

Intel IXP1200 Network Processor Family Hardware Reference Manual December 2001 Part Number: 278303-009 Revision History Revision Date Revision Description 8/30/99 001 Beta 1 release. 10/29/99 002 Beta