Workload Characterization and Performance for a Network Processor
|
|
- Matilda Lambert
- 6 years ago
- Views:
Transcription
1 Workload Characterization and Performance for a Network Processor Mitsuhiro Miyazaki Princeton Architecture Laboratory for Multimedia and Security (PALMS) May
2 Objectives To evaluate a NP from the computer architect s point of view, rather than the network infrastructure point of view To understand hardware multithreading effect for NPs To guide the architectural design of future NPs
3 Outline Router Processing Characterization Workload Characterization Intel s IXP1200 Architecture Simulation Setup IXP1200 Evaluation Instruction Mix Latency Executing, Aborted, Stalled and Idle ratio CPI Throughput Other NPs Conclusion and Future work
4 Input Port Router Processing Characterization Input Scheduler Classifier & Filter RPB Forwarder FIB FW Queuing Assignment TQB Output Scheduler & Load balancing Output Port IS RFIFO CF FL RPB FIB FW QA TQB OS TFIFO LB Packet Discard RPB FIB FW TQB
5 Frequently occurred packets in the real Internet Packe t Size Packet Type Description Packets Distributio n Internet Traffic 1) 40 Bytes 2) 576 Bytes TCP packets with IP header but no payload (i.e. only 20 Bytes IP header plus 20 Bytes TCP header), typically sent at the start of a new TCP session. The default IP Maximum Datagram Size (MDS) packets without fragmentation, including the default TCP Maximum Segment Size (MSS) 536 Bytes packets. 35% 3.5% 11.5% 16.5% 3) 1500 Bytes Packets corresponding to the Maximum Transmission Unit (MTU) size of an Ethernet connection. 10% 37% Note: Based on data collected by the National Laboratory for Applied Network Research (NLANR) project located at San Diego Supercomputer Center
6 Workloads of fixed size packets Packet Size Packet Type Description 1) 64 Bytes The minimum-size Ethernet packets, consisting of 14 Bytes Ethernet header, 20 Bytes IP header, 26 Bytes Payload, and 4 Bytes Ethernet trailer (FCS), and being expected to be used for TCP handshake 2) 594 Bytes Ethernet packets including 14 Bytes Ethernet header, 20 Bytes IP header, 556 Bytes Payload (assuming 20 Bytes TCP header plus 536 Bytes MSS), and 4 Bytes Ethernet trailer (FCS) 3) 1518 Bytes The maximum-size Ethernet packets, consisting of 14 Bytes Ethernet header, 20 Bytes IP header, 1480 Bytes Payload and 4 Bytes Ethernet trailer (FCS) Note: Workloads use Ethernet packets because the simulation assumes a router with 16x100Mbps Ethernet ports
7 Workload of Mixture packets Packet Size (Bytes) Proportion of Total Traffic Load 64 50% (6 parts) 7.881% % (5 parts) 60.96% % (1 parts) % Note: The average size of packets is 406 bytes.
8 IXP1200 Architecture Intel Strong ARM Core Intel Strong ARM SA-1Core 16 Kbyte I-cache 8 Kbyte D-cache 512 Kbyte Mini-Dcache Write-Buffer Read Buffer JTAG PCI Unit 32-bit bus UART 4 Timers GPIO RTC SDRAM Unit 64-bit bus 32-bit bus SRAM Unit FBI Unit Scratchpad Memory (4 Kbyte) Microengine 1 Microengine 2 Microengine 3 Hash Unit 64-bit bus IX Bus Interface Microengine 4 Microengine 5 Microengine 6 Notes: 32-bit Data Bus 32-bit ARM System Bus
9 Microengine Pipelining Note: Context switching can be made by 4PCs, 128GPRs, 64SDRAM Xfer regs, 64 SRAM Xfer regs and other CSRs
10 Hardwre Multi-Threading Multithreading keeps Microengine execution pipeline active without numerous stalled cycles Thread0 Thread stalled* Thread1 Thread stalled* Thread stalled* Thread2 Thread stalled* Thread3 *Note: Threads stalled are caused by memory access
11 Memory Access Flow
12 Branch and Context switch Instructions Class 3 Class2 Class1 br_bclr and br_bset br=0 br sdram br=byte and br!=byte br!=0 br=ctx sram jump br>0 br!=ctx hash1_48 rtn br>=0 ctx_arb hash2_48 br_!signal br<0 csr hash3_48 br_inp_state br<=0 r_fifo_rd hash1_64 br=cout t_fifo_wr hash2_64 br!=cout scratch hash3_64 Note: Blue colored instructions indicate context switch instructions.
13 Branch pipeline example with Class 3 Instruction
14 Branch pipeline example with Class 2 Instruction Case 1 Case 2
15 Branch/Context switch pipeline example with Class 1 Instruction
16 Solutions for branch penalties Deferred branch instruction Guess branch instruction Condition Code set earlier
17 Deferred branch Instruction
18 Guess Branch Instruction
19 Combination of Guess and Deferred Branch
20 Simulation Setup Workbench GUI interface to all Microengine tools Microcode assembler Microcode linker Transactor Debug and Simulation engine with IXP1200 Architectural Model and Memory The verilog model of an IX bus device(i.e. MAC device) Reference program(l2l3fwd16)
21 Simulation Image SRAM 100Mbps(Full Duplex) x 16 ports MAC IXF440 8 ports MAC IXF440 8 ports IX Bus 32bit 32bit FBI Unit SRAM Unit SDRAM Unit Six Micro engines SDRAM IXP1200
22 Thread assignment & Sim Conditions Receive threads are assigned to Microengine 0-3 Transmit threads are assigned to Microengine 4-5 One thread per Microengine works as output scheduler in Microengine 4-5 Operation Frequency Microengine runs at 232MHz The IX bus transfers packets at 104MHz SRAM and SDRAM bus transfer data at 116MHz The simulation had to forward 3000 packets
23 Instruction Mix for Receive Processing Packet Types Mixture 1518B 594B 64B 31.9% 30.3% 32.5% 40.8% 39.8% 40.8% 37.8% 28.0% 7.3%15.2% 5.8% 7.2% 16.4% 5.3% 10.0% 14.2% 5.6% 7.6% 16.6% 6.9% Arithmetic,Rotate, and Shift Instructions Branch and Jump Instructions Reference Instructions Local Register Instructions Miscellaneous Instructions 0% 20% 40% 60% 80% 100% Instruction Ratio
24 Instruction Mix for Transmit Processing Packet Types Mixture 1518B 594B 64B 50.7% 50.9% 51.3% 48.2% 31.0% 31.1% 30.7% 30.7% 8.5% 8.7% 1.1% 8.5% 8.6% 0.9% 8.2% 8.6% 1.2% 10.6% 8.2% 2.4% Arithmetic,Rotate, and Shift Instructions Branch and Jump Instructions Reference Instructions Local Register Instructions Miscellaneous Instructions 0% 20% 40% 60% 80% 100% Instruction Ratio
25 Instruction Mix for Overall Processing Packet Types Mixture 1518B 594B 64B 39.2% 38.4% 39.8% 43.4% 36.4% 37.0% 35.1% 29.0% 6.9% 12.7% 4.9% 6.5% 13.4% 4.7% 6.6% 12.0% 6.6% 8.6%11.7% 7.4% Arithmetic,Rotate, and Shift Instructions Branch and Jump Instructions Reference Instructions Local Register Instructions Miscellaneous Instructions 0% 20% 40% 60% 80% 100% Instruction Ratio
26 SDRAM Latency 100 cumulative percentage Microengine0 Microengine1 Microengine2 Microengine cycles
27 SRAM Latency (unlocked) 100 cumulative percentage Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine cycles
28 Execution, Aborted, Stalled and Idle Ratio on 64bytes packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio
29 Execution, Aborted, Stalled and Idle Ratio on 594bytes packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio
30 Execution, Aborted, Stalled and Idle Ratio on 1518bytes packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio
31 Execution, Aborted, Stalled and Idle Ratio on Mixture packets Microengine Microengine1 Microengine2 Microengine Executing Aborted Stalled Idle Microengine Microengine % 20% 40% 60% 80% 100% ratio
32 Cycle per Instruction (CPI) MixturePackets-uEngine0 1518BPackets-uEngine0 594BPackets-uEngine0 64BPackets-uEngine0 MixturePackets-uEngine1 1518BPackets-uEngine1 594BPackets-uEngine1 64BPackets-uEngine1 MixturePackets-uEngine2 1518BPackets-uEngine2 594BPackets-uEngine2 64BPackets-uEngine2 MixturePackets-uEngine3 1518BPackets-uEngine3 594BPackets-uEngine3 64BPackets-uEngine3 MixturePackets-uEngine4 1518BPackets-uEngine4 594BPackets-uEngine4 64BPackets-uEngine4 MixturePackets-uEngine5 1518BPackets-uEngine5 594BPackets-uEngine5 64BPackets-uEngine CPI
33 Throughput (bounded) Sim Rate Mpps Ideal Sim Rate OC-24(CRC16) Mixture 1518bytes 594bytes 64bytes Note: The reason why OC-24 is higher than Sim rate comes from the difference of protocol overhead Ethernet protocol overhead:38bytes per packet.(82.6% overhead for 46bytes IP packet) Protocol header and trailer(18bytes)+ifg(12bytes)+preamble/sfd(8bytes)= 38bytes OC-24 POS overhead:7bytes per packet(15.2% overhead for 46bytes IP packet)
34 Throughput (unbounded) Mpps Mixture 1518bytes 594bytes 64bytes Sim Rate 1.244GEther(OC-24class) 2.488GEther(OC-48class) Note: These throughputs don t include 12bytes IFG overhead.
35 Features of Other NPs Lexra s NetVortex 32-bit MIPS-1 Instruction set plus 18 extended instructions for context control and bit-field operation Supports up to 8 contexts per processor Each context includes 32 GPRs, its own PC and a status reg. Uses delay slot of memory reference for context switching(ex. LW.CSW reg. addr.) Performs in the similar way to IXP1200 Motrola s C-5 A subset of MIPS-1 Instruction set (excluding multiply, divide, floating point, and Coprocessor Zero(CpO)) Provides its own special purpose CpO instructions for context switching(ex. MTC0 $1 $3) 16 x Channel Processor RISC Cores(CPRCs), each supports up to 4 contexts and 32 GPRs IBM s PowerNP 16 x picoprocessors performing operation codes, each supports 2 contexts 4 threads perform context switch in a cluster 4 categories: 1) ALU opcodes, 2) control opcodes, 3) data movement opcodes, 4) coprocessor execution opcodes(supporting context switching) Context switching occurs when the picoprocessor is waiting for a shared resource (ex. Waiting for one of the coprocessors to complete an operation, access memory, etc)
36 Conclusion and Future work H/W multithreading can hide large latencies effectively, but another issue has come up Aborted cycles occurred by branch and context switch are not small Some dynamic hardware prediction or speculation could be necessary to reduce penalties for future NPs, but should consider cost issue An IXP1200 has achieved OC-24 class router processing, but not enough to perform OC-48 class router processing
37 Backup Slide
38 Instruction Categories Instruction Description Instruction Description Arithmetic,Rotate, and Shift Instructions Reference Instructions alu Perform an alu operation csr Csr reference alu_shf Perform an alu and shift operation fast_wr W rite immediate data to thd_done csrs dbl_shf local_csr_rd, local_csr_wr r_fifo_rd Read and write csrs Read the receive fifo Branch and Jump Instructions pcl_dma Issue a request to the pci unit br, br=0, br!=0, br>=0, br>=0, br<0, br<=0, br>0, br=cout, br!=cout scratch sdram sram t_fifo_wr Scratchpad reference Sdram reference Sram reference W rite to the transmit fifo br_bset, br_bclr Branch on bit set or bit clear Local Register Instructions br=byte, br!=byte Brabch on byte equal find_bset, find_bset_with_mask Determine position number of first bit set in an arbitrary 16-bit field of a register. br=ctx, br!=ctx Branch on current context immed Load immediate word and sign extend or zero fill with shift. br_inp_state Branch on event state (e,g.,sram done). immed_bo, immed_b1, immed_b2, immed_b3 Load immediate byte to a field. br_!signal Branch if signal deasserted immed_wo, immed_w1 Load immediate word to a field. jum p Jump to label ld_field, ld_field_w_clr Load byte(s) into specified field(s). rtn Return from a branch or a jump load_addr Load instruction address. Miscellaneous Instructions Load the result of a find_bset or load_bset_result1, load_bset_result2 ctx_arb Perform context swap and wake on event. find_bset_with_mask instruction. nop Perform no operation. hash1_48, hash2_48, hash3_48 hash1_64, hash2_64, hash3_64 Concatenate two longwords, shift the result, and save a longword. Branch on condition code Perform 48-bit hash. Perform 64-bit hash.
39 SRAM Latency (locked) 100 cumulative percentage Microengine0 Microengine1 Microengine2 Microengine3 Microengine4 Microengine cycles
40 FBI Architecture AMBA (Core) Command Bus Microengine Command Bus From SDRAM Push and Pull Engine Arbiters fast _ wr TFIFO 16 elements (10 quadwords each) CSRs Pull Engine TFIFO Rd CRS/Scratch Hash Rd Pull Command From SRAM Microengine Write Transfer Register 8 command Pull Queue 8 command Hash Queue 8 command Push Queue IX Bus Interface 1k x 32 Scratchpad Hash Unit RFIFO 16 elements (10 quadwords each) Push Engine CRS/Scratch Hash Return Push command RFIFO To SRAM Microengine Read Transfer Register To SDRAM Ready Bus Sequencer Transmit State Machine Receive State Machine IX Bus Arbiter 64-bit IX Bus Ready Bus
41 Ready Bus and Ready Flags
42 Theoretical IP Throughput Media 100Mbps Ethernet 64-byte PPS (46-byte IP packet) 594-byte PPS (576-byte IP packet) 1518-byte PPS (1500-byte IP packet) Mixture (avg 406-byte) PPS (avg 388-byte IP packet) 148,810 20,358 8,127 29,343 Gigabit Ethernet 1,488, ,583 81, ,427 10Gigabit Ethernet 14,880,952 2,035, ,744 2,934,272 OC-3 POS CRC ,491 31,681 12,256 46,759 OC-12 POS CRC-16 1,412, ,439 49, ,570 OC-24 POS CRC-16 2,825, ,878 99, ,139 OC-48 POS CRC-16 5,651, , , ,278 OC-192 POS CRC-16 22,605,283 2,055, ,010 3,033,114 OC-3 POS CRC ,818 31,573 12,240 46,524 OC-12 POS CRC-32 1,361, ,000 49, ,615 OC-24 POS CRC-32 2,722, ,000 99, ,229 OC-48 POS CRC-32 5,445, , , ,458 OC-192 POS CRC-32 21,783,273 2,048, ,956 3,017,834 ATM OC-3 174,245 26,807 10,890 38,721 ATM OC , ,679 44, ,981 ATM OC-24 1,412, ,358 88, ,962 ATM OC-48 2,825, , , ,925 ATM OC ,302,642 1,738, ,415 2,511,698
43 NetVortex extended Instruction set Instruction MYCX POSTCX CSW LW.CSW LT.CSW WD WD.CSW WDLW.CSW WDLT.CSW SETI CLRI EXTIV INSV ACS2 MFCXG MTCXG MFCXC MTCXC Description Context-Control Instructions Read my context Post event to a context Context Switch Load word with context switch Load twinword* with context switch Write descriptor to device Write descriptor to device with context switch Write descriptor to device,load word with context switch Write descriptor to device,load twinword with context switch Bit-Field Instructions Set subfield to ones Clear subfield to zeroes Extract subfield and prepare for insertion Insert extracted subfield Dual 16-bit ones complement add for checksum Cross-Context Access Instructions Move from a context general-purpose register Move to a context general-purpose register Move from a context-control register Move to a context-control register
44 NetVortex Context Switch Mechanism Thread 1 Program I1(T1): I2(T1): LW.CSW (reg, addr) I3(T1): Delay slot instruction I4(T1): Next instruction I5(T1): Context Switch to Thread 2 Thread 2 Program I1(T1): I2(T1): LW.CSW (reg, addr) I3(T1): Delay slot instruction I4(T1): Next instruction I5(T1): Context Switch to next available thread General Purpose Register File Thread Context 1 (r0 - r31) Thread Context 2 (r0 - r31) Context Registers Thread1 CXPC = I4(T1) Thread1 CXSTATUS = Wait Thread2 CXPC = PC Thread2 CXSTATUS = Active PC = I1(T2)
45 PowerNP Context Switch Example IF Reduction_OR(mask16(i) = coprocessr. Busy(i))THEN PC <= stall ELSE PC <=PC +1 END IF IF p=1 THEN Priority Over(other thread)<= TRUE ELSE PriorityOwner(Other thread)<= PriorityOwner(Other thread) END IF;
Intel IXP1200 Network Processor Family
Intel IXP1200 Network Processor Family Hardware Reference Manual December 2001 Part Number: 278303-009 Revision History Revision Date Revision Description 8/30/99 001 Beta 1 release. 10/29/99 002 Beta
More informationINF5060: Multimedia data communication using network processors Memory
INF5060: Multimedia data communication using network processors Memory 10/9-2004 Overview!Memory on the IXP cards!kinds of memory!its features!its accessibility!microengine assembler!memory management
More informationNetwork Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN
Network Processors Douglas Comer Computer Science Department Purdue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purdue.edu/people/comer Copyright 2003. All rights reserved.
More informationTopic & Scope. Content: The course gives
Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors
More informationRoad Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor
Performance of Embedded System Application on Network Processor 2006 Spring Directed Study Project Danhua Guo University of California, Riverside dguo@cs.ucr.edu 06-07 07-2006 Motivation NP Overview Programmability
More informationIntroduction to Network Processors: Building Block for Programmable High- Speed Networks. Example: Intel IXA
Introduction to Network Processors: Building Block for Programmable High- Speed Networks Example: Intel IXA Shiv Kalyanaraman Yong Xia (TA) shivkuma@ecse.rpi.edu http://www.ecse.rpi.edu/homepages/shivkuma
More informationIntel IXP1250 Network Processor
Product Features Datasheet The Intel IXP1250 Network Processor delivers high-performance processing power and flexibility to a wide variety of LAN and telecommunications products. Distinguishing features
More informationIBM Network Processor, Development Environment and LHCb Software
IBM Network Processor, Development Environment and LHCb Software LHCb Readout Unit Internal Review July 24 th 2001 Niko Neufeld, CERN 1 Outline IBM NP4GS3 Architecture A Readout Unit based on the NP4GS3
More informationAdvanced IXP1200 Microengine Programming
Advanced IXP1200 Microengine Programming IXP1200 Network Processor IDF Spring 2001 Corporation Copyright 2001 Corporation. Agenda! Intro Programming Assignment! What is the IXP Macro Library?! Some Macro
More informationNetwork Processors Outline
High-Performance Networking The University of Kansas EECS 881 James P.G. Sterbenz Department of Electrical Engineering & Computer Science Information Technology & Telecommunications Research Center The
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationScheduling Computations on a Software-Based Router
Scheduling Computations on a Software-Based Router ECE 697J November 19 th, 2002 ECE 697J 1 Processor Scheduling How is scheduling done on a workstation? What does the operating system do? What is the
More informationParallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1
Pipelining COMP375 Computer Architecture and dorganization Parallelism The most common method of making computers faster is to increase parallelism. There are many levels of parallelism Macro Multiple
More informationCommercial Network Processors
Commercial Network Processors ECE 697J December 5 th, 2002 ECE 697J 1 AMCC np7250 Network Processor Presenter: Jinghua Hu ECE 697J 2 AMCC np7250 Released in April 2001 Packet and cell processing Full-duplex
More informationNEPSIM: A NETWORK PROCESSOR SIMULATOR WITH A POWER EVALUATION FRAMEWORK
NEPSIM: A NETWORK PROCESSOR SIMULATOR WITH A POWER EVALUATION FRAMEWORK THIS OPEN-SOURCE INTEGRATED SIMULATION INFRASTRUCTURE CONTAINS A CYCLE-ACCURATE SIMULATOR FOR A TYPICAL NETWORK PROCESSOR ARCHITECTURE,
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationECE 650 Systems Programming & Engineering. Spring 2018
ECE 650 Systems Programming & Engineering Spring 2018 Networking Transport Layer Tyler Bletsch Duke University Slides are adapted from Brian Rogers (Duke) TCP/IP Model 2 Transport Layer Problem solved:
More informationMapping Stream based Applications to an Intel IXP Network Processor using Compaan
Mapping Stream based Applications to an Intel IXP Network Processor using Compaan Sjoerd Meijer (PhD Student) University Leiden, LIACS smeijer@liacs.nl Outline Need for multi-processor platforms Problem:
More informationA Fast Network Processor Performance Analysis Methodology. Abstract
A Fast Network Processor Performance Analysis Methodology Hao Che, Chethan Kumar, and Basavaraj Menasinahal Department of Computer Science and Engineering University of Texas at Arlington (hche@cse.uta.edu,
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationChapter 20 Network Layer: Internet Protocol 20.1
Chapter 20 Network Layer: Internet Protocol 20.1 Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 20-1 INTERNETWORKING In this section, we discuss internetworking,
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationCSE398: Network Systems Design
CSE398: Network Systems Design Instructor: Dr. Liang Cheng Department of Computer Science and Engineering P.C. Rossin College of Engineering & Applied Science Lehigh University April 04, 2005 Outline Recap
More informationDesign and Implementation of an Emulator for the Intel IXP 2400 Network Processor
Semester Thesis Design and Implementation of an Emulator for the Intel IXP 2400 Network Processor Author Christian Kummer supervising assistant Lukas Ruf supervising professor Prof. Bernhard Plattner SA-2006-17
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationTowards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005
Towards Effective Packet Classification J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005 Outline Algorithm Study Understanding Packet Classification Worst-case Complexity
More informationDesign Space Exploration of Network Processor Architectures
Design Space Exploration of Network Processor Architectures ECE 697J December 3 rd, 2002 ECE 697J 1 Introduction Network processor architectures have many choices Number of processors Size of memory Type
More informationPowerPC 740 and 750
368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order
More informationThe Impact of Parallel and Multithread Mechanism on Network Processor Performance
The Impact of Parallel and Multithread Mechanism on Network Processor Performance Chunqing Wu Xiangquan Shi Xuejun Yang Jinshu Su Computer School, National University of Defense Technolog,Changsha, HuNan,
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationThese actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.
MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously
More informationINT G bit TCP Offload Engine SOC
INT 10011 10 G bit TCP Offload Engine SOC Product brief, features and benefits summary: Highly customizable hardware IP block. Easily portable to ASIC flow, Xilinx/Altera FPGAs or Structured ASIC flow.
More informationSwitches and Routers. Switches and Routers
1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 s 5.2 switches and routers 6. End systems 7. End-to-end
More informationLink Layer and Ethernet
Link Layer and Ethernet 14-740: Fundamentals of Computer Networks Bill Nace Material from Computer Networking: A Top Down Approach, 6 th edition. J.F. Kurose and K.W. Ross traceroute Data Link Layer Multiple
More informationLink Layer and Ethernet
Link Layer and Ethernet 14-740: Fundamentals of Computer Networks Bill Nace Material from Computer Networking: A Top Down Approach, 6 th edition. J.F. Kurose and K.W. Ross traceroute Data Link Layer Multiple
More informationNAVAL POSTGRADUATE SCHOOL THESIS
NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS UTILIZING IXP1200 HARDWARE AND SOFTWARE FOR PACKET FILTERING by Jeffery L. Lindholm December 2004 Thesis Advisor: Co-Advisor: Su Wen John Gibson Approved
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More information440GX Application Note
Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communication from the physical
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationDesign and Evaluation of Diffserv Functionalities in the MPLS Edge Router Architecture
Design and Evaluation of Diffserv Functionalities in the MPLS Edge Router Architecture Wei-Chu Lai, Kuo-Ching Wu, and Ting-Chao Hou* Center for Telecommunication Research and Department of Electrical Engineering
More informationThe Network Processor Revolution
The Network Processor Revolution Fast Pattern Matching and Routing at OC-48 David Kramer Senior Design/Architect Market Segments Optical Mux Optical Core DWDM Ring OC 192 to OC 768 Optical Mux Carrier
More informationPage 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions
Structure of von Nuemann machine Arithmetic and Logic Unit Input Output Equipment Main Memory Program Control Unit 1 1 Instruction Set - the type of Instructions Arithmetic + Logical (ADD, SUB, MULT, DIV,
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationCisco IOS Switching Paths Overview
This chapter describes switching paths that can be configured on Cisco IOS devices. It contains the following sections: Basic Router Platform Architecture and Processes Basic Switching Paths Features That
More informationCEC 450 Real-Time Systems
CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationTopics C-Ware TM Software Toolset release timeline C-Ware TM Tools Overview C-Ware TM Applications Library Overview
C-Port Family C-Ware Software Toolset CST Overview (PUBLIC) Off. All other product or service names are the property of their respective owners. Motorola, Inc. 2001. All rights reserved. Topics C-Ware
More informationA 400Gbps Multi-Core Network Processor
A 400Gbps Multi-Core Network Processor James Markevitch, Srinivasa Malladi Cisco Systems August 22, 2017 Legal THE INFORMATION HEREIN IS PROVIDED ON AN AS IS BASIS, WITHOUT ANY WARRANTIES OR REPRESENTATIONS,
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationFujitsu SOC Fujitsu Microelectronics America, Inc.
Fujitsu SOC 1 Overview Fujitsu SOC The Fujitsu Advantage Fujitsu Solution Platform IPWare Library Example of SOC Engagement Model Methodology and Tools 2 SDRAM Raptor AHB IP Controller Flas h DM A Controller
More informationCS 152, Spring 2011 Section 2
CS 152, Spring 2011 Section 2 Christopher Celio University of California, Berkeley About Me Christopher Celio celio @ eecs Office Hours: Tuesday 1-2pm, 751 Soda Agenda Q&A on HW1, Lab 1 Pipelining Questions
More informationIntelop. *As new IP blocks become available, please contact the factory for the latest updated info.
A FPGA based development platform as part of an EDK is available to target intelop provided IPs or other standard IPs. The platform with Virtex-4 FX12 Evaluation Kit provides a complete hardware environment
More informationCONTACT: ,
S.N0 Project Title Year of publication of IEEE base paper 1 Design of a high security Sha-3 keccak algorithm 2012 2 Error correcting unordered codes for asynchronous communication 2012 3 Low power multipliers
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationHierarchically Aggregated Fair Queueing (HAFQ) for Per-flow Fair Bandwidth Allocation in High Speed Networks
Hierarchically Aggregated Fair Queueing () for Per-flow Fair Bandwidth Allocation in High Speed Networks Ichinoshin Maki, Hideyuki Shimonishi, Tutomu Murase, Masayuki Murata, Hideo Miyahara Graduate School
More informationECE331: Hardware Organization and Design
ECE331: Hardware Organization and Design Lecture 27: Midterm2 review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Midterm 2 Review Midterm will cover Section 1.6: Processor
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationDiffServ over Network Processors: Implementation and Evaluation
DiffServ over Network Processors: Implementation and Evaluation Authors:Ying-Dar Lin, Yi-Neng, Shun-Chin Yang, and Yu-Shen Lin Speaker: Yi-Neng Lin Department of Computer and Information Science National
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationNew Advances in Micro-Processors and computer architectures
New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,
More informationInstruction Set Architecture. "Speaking with the computer"
Instruction Set Architecture "Speaking with the computer" The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture Digital Design
More informationYour favorite blog :www.vijay-jotani.weebly.com (popularly known as VIJAY JOTANI S BLOG..now in facebook.join ON FB VIJAY
VISIT: Course Code : MCS-042 Course Title : Data Communication and Computer Network Assignment Number : MCA (4)/042/Assign/2014-15 Maximum Marks : 100 Weightage : 25% Last Dates for Submission : 15 th
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationIBM POWER8 100 GigE Adapter Best Practices
Introduction IBM POWER8 100 GigE Adapter Best Practices With higher network speeds in new network adapters, achieving peak performance requires careful tuning of the adapters and workloads using them.
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationI/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran
More informationSummary of MAC protocols
Summary of MAC protocols What do you do with a shared media? Channel Partitioning, by time, frequency or code Time Division, Code Division, Frequency Division Random partitioning (dynamic) ALOHA, S-ALOHA,
More informationCommunications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki
Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp
More informationCPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition
CPU Structure and Function Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition CPU must: CPU Function Fetch instructions Interpret/decode instructions Fetch data Process data
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More informationEECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture
EECS : Introduction to Computer Networks Switch and Router Architectures Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley,
More informationECE4110 Internetwork Programming. Introduction and Overview
ECE4110 Internetwork Programming Introduction and Overview 1 EXAMPLE GENERAL NETWORK ALGORITHM Listen to wire Are signals detected Detect a preamble Yes Read Destination Address No data carrying or noise?
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationTCP/IP Performance ITL
TCP/IP Performance ITL Protocol Overview E-Mail HTTP (WWW) Remote Login File Transfer TCP UDP IP ICMP ARP RARP (Auxiliary Services) Ethernet, X.25, HDLC etc. ATM 4/30/2002 Hans Kruse & Shawn Ostermann,
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationJim Keller. Digital Equipment Corp. Hudson MA
Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationThomas Polzer Institut für Technische Informatik
Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =
More informationPacketExpert PDF Report Details
PacketExpert PDF Report Details July 2013 GL Communications Inc. 818 West Diamond Avenue - Third Floor Gaithersburg, MD 20878 Phone: 301-670-4784 Fax: 301-670-9187 Web page: http://www.gl.com/ E-mail:
More informationATM-DB Firmware Specification E. Hazen Updated January 4, 2007
ATM-DB Firmware Specification E. Hazen Updated January 4, 2007 This document describes the firmware operation of the Ethernet Daughterboard for the ATM for Super- K (ATM-DB). The daughterboard is controlled
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationCS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07
CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as
More informationThe Tofu Interconnect 2
The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect
More informationPractice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx
Microprogram Control Practice Problems (Con t) The following microinstructions are supported by each CW in the CS: RR ALU opx RA Rx RB Rx RB IR(adr) Rx RR Rx MDR MDR RR MDR Rx MAR IR(adr) MAR Rx PC IR(adr)
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationAdvanced Network Design
Advanced Network Design Organization Whoami, Book, Wikipedia www.cs.uchicago.edu/~nugent/cspp54015 Grading Homework/project: 60% Midterm: 15% Final: 20% Class participation: 5% Interdisciplinary Course
More informationINT 1011 TCP Offload Engine (Full Offload)
INT 1011 TCP Offload Engine (Full Offload) Product brief, features and benefits summary Provides lowest Latency and highest bandwidth. Highly customizable hardware IP block. Easily portable to ASIC flow,
More informationData Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies
Data Link Layer Our goals: understand principles behind data link layer services: link layer addressing instantiation and implementation of various link layer technologies 1 Outline Introduction and services
More informationDesign Space Exploration for Memory Subsystems of VLIW Architectures
E University of Paderborn Dr.-Ing. Mario Porrmann Design Space Exploration for Memory Subsystems of VLIW Architectures Thorsten Jungeblut 1, Gregor Sievers, Mario Porrmann 1, Ulrich Rückert 2 1 System
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More information