The Impact of Parallel and Multithread Mechanism on Network Processor Performance

Size: px
Start display at page:

Download "The Impact of Parallel and Multithread Mechanism on Network Processor Performance"

Transcription

1 The Impact of Parallel and Multithread Mechanism on Network Processor Performance Chunqing Wu Xiangquan Shi Xuejun Yang Jinshu Su Computer School, National University of Defense Technolog,Changsha, HuNan, China 4173 Abstract Network processors are becoming a predominant feature in the field of network hardware due to its high performance and flexibility. The performance of network processor mainly depends on its architecture. This paper studied the parallel architecture and multithread mechanism in network processor. We discussed reasons of thread stalls and the principle of hiding latencies caused by various stalls using multithread mechanism. Lastly, we present the test results based on analyzing the relationship of active thread number and the performance of network processor. 1. Introduction The ceaseless rise of network link rate demands network devices process packets in wondrously short time. For example, the arrival interval of 4 bytes packet is 35ns on 1Gbps links, and this interval is 8ns on 4G links. It is quite difficult in so short time to complete the process of QoS and looking up route table in line rate. Network processor is canonized by network device manufacturers due to the characteristic of its flexibility and the process performance close to ASIC. The architecture of network processor is crucial to its processing power, but its memory capacity and computing capability are the basic factors. As a SOC system, it is unpractical to make its frequency as high as generic CPU s. For example, the arrival interval of packet is about 16ns on OC-48 links. That is, the network processor with frequency of 133Mhz and clock cycle of 7.5ns must complete the process of a 4bytes packet in 21.3 clock cycles to avoid packet dropping. Network processor usually employs DRAM as external memory instead of SRAM due to its high price, although the access delay of SRAM could reach 1ns. As DRAM access delay is about 55-7ns(DDR,RDR), it needs 1 clock cycles to complete an external memory access. This indicates the impossibility of completing the packet process in 21 clock cycles only by single thread. [1] and [2] evaluate the requirement of the network process ability by the compute algorithm to perform looking up route table(rtr) and IP segment(frag). The compute complexity of RTR is 2.1 instructions per byte. For 2.5Gbps link, the required compute ability is 2.5Gbps/8*2.1= 656MIPS, The compute complexity of FRAG is 7.7 instructions per byte. The required compute ability is 2.5/8*7.7=247MIPS. Now the process ability of single processing element at the art of state in network processor is about 15MIPS. So the architecture of network processor must be lucubrated. We can meet the high speed network requirement by appropriate design or arrangement of process elements() to overcome the process ability shortage of single. It has been sufficiently verified that the parallel and multithread mechanisms are effectual approaches to rise the performance of computer systems. This paper focuses on the impact of parallel and multithread mechanism on network processor performance. In section 2, related works are described. Thread stalls and introducing of multithreads are presented in section 3. In section 4, we discuss the relationship of parallel mechanism and network processor performance. And at last, we give implementation and testing results. 2. Related works Usually, the core proportion of network processor is multi-s. Every is a simple micro-processor on which there are multithreads running, and one packet is assigned to one thread to process each time. The essential idea is to exploit the parallelism of packet processing by assign uncorrelated packets to different threads. The hardware architecture of network This work is supported by National Natural Science Foundation of China (NSFC), under agreement no 9646 Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26

2 processor is shown in Fig1. There are a group of s, multi co-processors and multi hardware logic block. CoP CoP CoP Fig1. The hardware architecture of network processor We can partition network processor to two classes according to its architecture[3]: Pipelined: each processor is designed for a particular packet processing task and communicates in a pipelined fashion. Examples of this architectural style include Cisco s PXF, Motorola s C-5 DCP, and Xelerated Packet Devices. Parallel: each is performing similar functionality. This approach is commonly coupled with numerous co-processors to accelerate specific types of computation. Since these co-processors are shared across many s, an arbitration unit is often required. The Agere PayloadPlus, Intel IXP12, IBM PowerNP, and Lexra NetVortex are examples of this type of macro-architecture. The parallelism of this type architecture includes the parallelism of s[4], the parallelism of multithreads in s[5], the parallelism of task/packet level and the parallelism of data/instruction level in s,etc. Tilman Wolf and his fellows[6] researched on how organization impact on system performance. They analyze the network processor performance of four types of parallel, serial, pipelined and mixed architectures. The simulation results indicate that organization has important impact on system performance, where The depth of pipeline can rise system throughput, but the competition of memory access will limits the width of pipeline contributing to system throughput. Access to off-chip memory will reduce system throughput and increase packet queuing delay. This may be alleviated by introduce the hidden delay mechanism. The cost of communication and synchronization has more important impact on system throughput than processing time. Venkatachalam and his fellows[7] researched on how to use configurable micro-engine architecture and programming model to develop serial pipeline architecture and parallel architecture on Intel IXP24. They evaluated the efficiency of IXP24 network processor according to the two applications of ATM AAL2 and ATM flow management. The presently research results indicate that pipeline architecture may rise processing performance by increasing pipeline stages, but it is difficult to develop the software which will not generate system bottle neck and can efficiently drive processing engines. Comparatively, software development is easier in parallel architecture. There are problems such as memory access conflict between multi-processors and resource sharing. Packets processed in different s are not always independent, but sometimes packets may be dependent each other. This conditioning relationship performs in two aspects of service order and resource operation. The synchronization problem caused by service order could be solved by maintaining status of packet process. Lock mechanism must be introduced to avoid resource operation confliction. Although some packets may be dependent each other, the independent packets are the majority of Internet traffic. Parallel processing packets has great effect in rising system throughput. Many researches have indicated that the performance of packets processing in parallel architecture with multi-s is more powerful than in pure pipeline architecture. Additionally, time wasting can be caused by kinds of thread stall during threads running inside, which will impact on network processor performance. Next we will focus on the impact on network processor performance coming from thread stalls, the introducing of multi-threads and parallel s. 3. Thread stalls and introducing of multithreads We have mentioned that thread stall can reduce the system process ability. This section will discuss the type of thread stall, the impact on network processor Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26

3 performance coming from thread stall and the stall hiding mechanism by utilizing multi-threads Type of thread stall Resource sharing and exclusive access will cause running thread stall. These stalls have great impact on the performance of multithread network processing. The main stall types are: (1)Coprocessor stall: A coprocessor stall occurs when the thread is stalled waiting for a coprocessor to finish executing. Some examples of when a coprocessor stall occurs are: Synchronous coprocessor command is issued and the thread is stalled until the coprocessor is done executing Asynchronous coprocessor command is issued and the coprocessor is already in use Wait instruction is executed and a coprocessor is still executing from a previous asynchronous coprocessor command (2)Data stall: A data stall occurs when an instruction must wait for a specific general purpose register (GPR) to get data that is being loaded across the data bus (3)Instruction stall: An instruction stall occurs when the thread is stalled waiting for an instruction fetch to complete. An example of an instruction stall is when a branch instruction is executed. (4)Bus stall: A bus stall occurs when the thread is stalled waiting to access a data bus. Contention for a data bus can be due to any of the following: Another instruction executed by this thread or another thread is already using the bus during the cycle that the CLP requests it for this instruction Another coprocessor is already using the bus on the cycle that the CLP requests it. (5)GPR stall: A GPR stall occurs when two operations attempt to copy data into any GPR on the same cycle. 3.2.Impact on performance caused by thread stalls By utilizing the Npprofile, a network processor performance analysis toolkit, we have analyzed trace message log file produced by packet forward picocode of the third layer and second layer. Figure2 and Figure3 presents the result about thread stalls. In Figure 2,the first column means that the threads were stalled about 12 times, each lasting 1-1 cycles (coprocessor stalls are 1, data stalls are 35, instruction stalls are 12, and bus stalls are 6). The second column indicates that there were 1 stalls that lasted 1 2 cycles (was a coprocessor stall ). Among various thread stalls, the bus stall is dominant. And it is noticed that the coprocessor stall occupies the most number of cycles. stall frequency Number of cycles stalled Figure2. Thread stalls frequency and cycles tested with forward picocode of Layer 3 stall frequency Number of cycles stalled Figure3. Thread stalls frequency and cycles tested with forward picocode of Layer 2 By utilizing the Npprofile, we have analyzed trace message log file produced by packet forward picocode of the third layer. Figure 4 presents the distribution of various stalls and running cycles of single thread. It shows that various stalls occupy 59% of the thread running period. It is necessary to hide these stalls by adopting parallel multithread mechanism to achieve higher utilization of processors. 13% 13% CLP EXEC 41% 12% 21% Fig4. Distribution of various stalls and running cycles of single thread CLP EXEC Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26

4 3.3.The principle of hiding stalls using multithrea By introducing the multithread model, we can avoid waiting caused by various stalls of single thread in micro-engine. Figure5 shows the stall hiding in multithread model. Thread2 Thread1 Thread run stall packet1 packet3 packet2 thread life(t) Figure5. The stall hiding in multithread model When Thead accesses the tree search engine, it can hand over the to thread1. When thead1 accesses external memory, it can hand over the to thread2. Thus, by adopting multithread model and thread switching technology, the delays caused by various stalls are effectually hidden. As a result, will not waste processing cycles for waiting for the end of stalls. 4. The relationship of parallel mechanism and network processor performance Suppose m is number of s, n is number of packets,for network processor with single, we can compute the time of processing single packet of length L as expressions (1): St=C t +f(m,n)+p t (1) Where, Ct is the time of keep packet order, f(m,n) is the stall time, here m=1, n=1, Pt is the time of processing single packet in without any stall. Pt is associated with packet length L. The larger the L, the larger the Pt. Pt is approximately a linearity function associated with L as expressions(2): P t =ß*L (2) For network processor with single, we can compute the time of processing n packets of length L as expressions (3): NSt=Ct+f(m,n)+n*ß*L (3) Where, m=1. For network processor with m s, we can compute the time of processing n packets of length L as expressions (4): NTt=g(m)Ct+f(m,n)+n*ß*L /m (4) For network processor with m parallel s, keeping packet order is performed by special hardware. So g(m) is approximately a linearity function associated with m, and independent of packet number. So we can suppose g(m) is a constant. For network processor with m parallel s, f(m,n) is associated with the number m of active s, the larger the m, the larger the f(m,n). According to the principle of hiding stalls by multithread, f(m,n) increases very slowly. We know that f(1,n)< f(m,n), but npt>> n*pt/m, suppose m is 32, f(1,n) may be 32 times to f(m,n). The length of packet (say, L)also affect the number of needed. The smaller the L, the shorter the packet arrival intervals, so St should be shorter. For short packets, it can effectively reduce the pressure of packet processing by increasing the number of s. It is difficult to greatly shorten Pt to reduce n*pt/m only by rising the frequency of single, which is limited by technique of chip. Comparatively, it is more easy to implement the number of s within certain scope. We notice that when m increases, f(m,n) increases too. It impairs the performance of network processor with multithread. And m is limited by technique of chip too. It is important to trade off between f(m,n) and Pt. 5. Implementation and testing results We have implemented a high performance core router using network processor with multi-s. The test was performed on 2.5Gbps network interface of this core router. The relationship of throughput and number of needed was shown in figure6. When packet length is 124Bytes, line rate forwarding only need 2-3 s. When packet length is 6Bytes, line rate forwarding will need 2 s. Packet Number Be Sent(Million) Number of Active Thread Figure6. Relationship of throughput and number Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26

5 Max BandWidth Rate Number of Active Thread the number of s and the network processor performance. Experiments and Simulations show, parallel and multithread mechanism play an very important role in the network processor. On the other hand, stalls caused by resource competition impact greatly on the performance of network processor. Besides the hardware work, we could increase system s parallelism by software work., such as parallel searching of multi-route tables. Figure7. The relationship of max bandwidth rate and active thread number without packet loss Throughput Gigabits per Second and Packets per Second Rates Mpps Gbit/s Packet Size Figure8. Performance of our core router using network processor with multi-s The test results indicate that the performance will rise greatly when active thread number increases for network processor with multi-s. Short packets require more parallelism because of their higher arriving rate. But the performance is not a linearity function associated with active thread number because the thread stalls will increase in multithread environment. 6. Summary and Conclusion Network Processors are an emerging technology in the network industry. The performance of network processor is closely associated with its architecture. In our studies we focused the efforts on relation between 7. References [1] TAN Zhang-Xi, LIN Chuang, Analysis and Research on Network Processor Journal of Software Vol.14, No.2 23 [2] Wolf T, Franklin MA. CommBench a telecommunications benchmark for network processors. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Austin, TX, ~162. [3] Niraj Shah,Understanding Network Processors Technical Report University of berkeley [4] Ning Weng and Tilman Wolf. Pipelining vs. multiprocessors - choosing the right network processor system topology. in Proceedings of Advanced Networking and Communications Hardware Workshop (ANCHOR 24), Munich, Germany, June 24 [5] Patrick Crowley, Marc E. Fiuczynski, Jean-Loup Baer. On the Performance of Multithreaded Architectures for Network Processors Technical Report 2-1-, University of Washington. [6] L. Kencl, JY Le Boudec, T. Wolf et al. Adaptive Load Sharing for Network Processors In IEEE INFOCOM 22, New York [7] Muthu Venkatachalam, Prashant Chandra, RajYavatkar. A highly flexible, distributed multiprocessor architecture for network processing. IEEE Computer Networks Vol.41 23, pp Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26

Design Space Exploration of Network Processor Architectures

Design Space Exploration of Network Processor Architectures Design Space Exploration of Network Processor Architectures ECE 697J December 3 rd, 2002 ECE 697J 1 Introduction Network processor architectures have many choices Number of processors Size of memory Type

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Pa Available online at: Analysis of Network Processor Elements Topologies Devesh Chaurasiya

More information

Topic & Scope. Content: The course gives

Topic & Scope. Content: The course gives Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors

More information

Design Tradeoffs for Embedded Network. processors.

Design Tradeoffs for Embedded Network. processors. Design Tradeoffs for Embedded Network Processors Tilman Wolf 1 and Mark A. Franklin 2 1 Department of Computer Science, Washington University, St. Louis, MO, USA wolf@ccrc.wustl.edu 2 Departments of Computer

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

IMPLEMENTATION OF DDR I SDRAM MEMORY CONTROLLER USING ACTEL FPGA

IMPLEMENTATION OF DDR I SDRAM MEMORY CONTROLLER USING ACTEL FPGA IMPLEMENTATION OF DDR I SDRAM MEMORY CONTROLLER USING ACTEL FPGA Vivek V S 1, Preethi R 2, Nischal M N 3, Monisha Priya G 4, Pratima A 5 1,2,3,4,5 School of Electronics and Communication, REVA university,(india)

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Design Tradeoffs for Embedded Network Processors

Design Tradeoffs for Embedded Network Processors Design Tradeoffs for Embedded Network Processors Tilman Wolf Mark A. Franklin Edward W. Spitznagel WUCS-00-2 July 10, 2000 Department of Computer Science Washington University Campus Box 105 One Brookings

More information

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor Performance of Embedded System Application on Network Processor 2006 Spring Directed Study Project Danhua Guo University of California, Riverside dguo@cs.ucr.edu 06-07 07-2006 Motivation NP Overview Programmability

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Performance Models for Network Processor Design

Performance Models for Network Processor Design TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. YY, ZZZ 26 Performance Models for Network Processor Design Tilman Wolf, Member, IEEE, Mark A. Franklin, Fellow, IEEE Abstract To provide a

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400 The International Arab Journal of Information Technology, Vol. 6, No. 3, July 2009 289 Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400 Kandasamy Anusuya, Karupagouder

More information

Commercial Network Processors

Commercial Network Processors Commercial Network Processors ECE 697J December 5 th, 2002 ECE 697J 1 AMCC np7250 Network Processor Presenter: Jinghua Hu ECE 697J 2 AMCC np7250 Released in April 2001 Packet and cell processing Full-duplex

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

Workload Characterization and Performance for a Network Processor

Workload Characterization and Performance for a Network Processor Workload Characterization and Performance for a Network Processor Mitsuhiro Miyazaki Princeton Architecture Laboratory for Multimedia and Security (PALMS) May. 16. 2002 Objectives To evaluate a NP from

More information

Design Issues for High-Performance Active Routers

Design Issues for High-Performance Active Routers 404 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 19, NO. 3, MARCH 2001 Design Issues for High-Performance Active Routers Tilman Wolf and Jonathan S. Turner, Fellow, IEEE Abstract Modern networks

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Networks-on-Chip Router: Configuration and Implementation

Networks-on-Chip Router: Configuration and Implementation Networks-on-Chip : Configuration and Implementation Wen-Chung Tsai, Kuo-Chih Chu * 2 1 Department of Information and Communication Engineering, Chaoyang University of Technology, Taichung 413, Taiwan,

More information

A Scalable, Cache-Based Queue Management Subsystem for Network Processors

A Scalable, Cache-Based Queue Management Subsystem for Network Processors A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar and Patrick Crowley Applied Research Laboratory Department of Computer Science and Engineering Washington University

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

New Advances in Micro-Processors and computer architectures

New Advances in Micro-Processors and computer architectures New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Research on Transmission Based on Collaboration Coding in WSNs

Research on Transmission Based on Collaboration Coding in WSNs Research on Transmission Based on Collaboration Coding in WSNs LV Xiao-xing, ZHANG Bai-hai School of Automation Beijing Institute of Technology Beijing 8, China lvxx@mail.btvu.org Journal of Digital Information

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Efficiency of Cache Mechanism for Network Processors

Efficiency of Cache Mechanism for Network Processors TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/18llpp575-585 Volume 14, Number 5, October 2009 Efficiency of Cache Mechanism for Network Processors XU Bo ( ) 1,3, CHANG Jian ( ) 2, HUANG Shimeng (

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Scheduling Computations on a Software-Based Router

Scheduling Computations on a Software-Based Router Scheduling Computations on a Software-Based Router ECE 697J November 19 th, 2002 ECE 697J 1 Processor Scheduling How is scheduling done on a workstation? What does the operating system do? What is the

More information

Network Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN

Network Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN Network Processors Douglas Comer Computer Science Department Purdue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purdue.edu/people/comer Copyright 2003. All rights reserved.

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

Locality-Aware Predictive Scheduling of Network Processors

Locality-Aware Predictive Scheduling of Network Processors Locality-Aware Predictive Scheduling of Network Processors Tilman Wolf and Mark A. Franklin Departments of Computer Science and Electrical Engineering Washington University in St. Louis, MO, USA fwolf,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

IBM Network Processor, Development Environment and LHCb Software

IBM Network Processor, Development Environment and LHCb Software IBM Network Processor, Development Environment and LHCb Software LHCb Readout Unit Internal Review July 24 th 2001 Niko Neufeld, CERN 1 Outline IBM NP4GS3 Architecture A Readout Unit based on the NP4GS3

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Cycles Per Instruction For This Microprocessor

Cycles Per Instruction For This Microprocessor What Is The Average Number Of Machine Cycles Per Instruction For This Microprocessor Wikipedia's Instructions per second page says that an i7 3630QM deliver ~110,000 It does reduce the number of "wasted"

More information

An Energy Consumption Analytic Model for A Wireless Sensor MAC Protocol

An Energy Consumption Analytic Model for A Wireless Sensor MAC Protocol An Energy Consumption Analytic Model for A Wireless Sensor MAC Protocol Hung-Wei Tseng, Shih-Hsien Yang, Po-Yu Chuang,Eric Hsiao-Kuang Wu, and Gen-Huey Chen Dept. of Computer Science and Information Engineering,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto. Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors

More information

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling Keigo Mizotani, Yusuke Hatori, Yusuke Kumura, Masayoshi Takasu, Hiroyuki Chishiro, and Nobuyuki Yamasaki Graduate

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

G-NET: Effective GPU Sharing In NFV Systems

G-NET: Effective GPU Sharing In NFV Systems G-NET: Effective Sharing In NFV Systems Kai Zhang*, Bingsheng He^, Jiayu Hu #, Zeke Wang^, Bei Hua #, Jiayi Meng #, Lishan Yang # *Fudan University ^National University of Singapore #University of Science

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Research Article MFT-MAC: A Duty-Cycle MAC Protocol Using Multiframe Transmission for Wireless Sensor Networks

Research Article MFT-MAC: A Duty-Cycle MAC Protocol Using Multiframe Transmission for Wireless Sensor Networks Distributed Sensor Networks Volume 2013, Article ID 858765, 6 pages http://dx.doi.org/10.1155/2013/858765 Research Article MFT-MAC: A Duty-Cycle MAC Protocol Using Multiframe Transmission for Wireless

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Multi-gigabit Switching and Routing

Multi-gigabit Switching and Routing Multi-gigabit Switching and Routing Gignet 97 Europe: June 12, 1997. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science nickm@ee.stanford.edu http://ee.stanford.edu/~nickm

More information

PCnet-FAST Buffer Performance White Paper

PCnet-FAST Buffer Performance White Paper PCnet-FAST Buffer Performance White Paper The PCnet-FAST controller is designed with a flexible FIFO-SRAM buffer architecture to handle traffic in half-duplex and full-duplex 1-Mbps Ethernet networks.

More information

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1 System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Integrated EPON-LTE Network DBA Algorithm s Real Time Performance Analysis Using Network Processors

Integrated EPON-LTE Network DBA Algorithm s Real Time Performance Analysis Using Network Processors Integrated EPON-LTE Network DBA Algorithm s Real Time Performance Analysis Using Network Processors S. Ramya Dr. N. Nagarajan Dr. B. Kaarthick Mettler-Toledo Turing Software, Coimbatore, INDIA Department

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures Chapter 5 A Closer Look at Instruction Set Architectures Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Gain familiarity with memory addressing modes. Understand

More information

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics ,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also

More information

Transparent TCP Acceleration Through Network Processing

Transparent TCP Acceleration Through Network Processing Transparent TCP Acceleration Through Network Processing Tilman Wolf, Shulin You, and Ramaswamy Ramaswamy Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA 3 {wolf,syou,rramaswa}@ecs.umass.edu

More information

A Parallel Decoding Algorithm of LDPC Codes using CUDA

A Parallel Decoding Algorithm of LDPC Codes using CUDA A Parallel Decoding Algorithm of LDPC Codes using CUDA Shuang Wang and Samuel Cheng School of Electrical and Computer Engineering University of Oklahoma-Tulsa Tulsa, OK 735 {shuangwang, samuel.cheng}@ou.edu

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

Evaluating Compiler Support for Complexity Effective Network Processing

Evaluating Compiler Support for Complexity Effective Network Processing Evaluating Compiler Support for Complexity Effective Network Processing Pradeep Rao and S.K. Nandy Computer Aided Design Laboratory. SERC, Indian Institute of Science. pradeep,nandy@cadl.iisc.ernet.in

More information

TCP performance experiment on LOBS network testbed

TCP performance experiment on LOBS network testbed Wei Zhang, Jian Wu, Jintong Lin, Wang Minxue, Shi Jindan Key Laboratory of Optical Communication & Lightwave Technologies, Ministry of Education Beijing University of Posts and Telecommunications, Beijing

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Cross Clock-Domain TDM Virtual Circuits for Networks on Chips

Cross Clock-Domain TDM Virtual Circuits for Networks on Chips Cross Clock-Domain TDM Virtual Circuits for Networks on Chips Zhonghai Lu Dept. of Electronic Systems School for Information and Communication Technology KTH - Royal Institute of Technology, Stockholm

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Ruler: High-Speed Packet Matching and Rewriting on Network Processors Ruler: High-Speed Packet Matching and Rewriting on Network Processors Tomáš Hrubý Kees van Reeuwijk Herbert Bos Vrije Universiteit, Amsterdam World45 Ltd. ANCS 2007 Tomáš Hrubý (VU Amsterdam, World45)

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Computer System Components

Computer System Components Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Architectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans

Architectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans Architectural Considerations for Network Processor Design EE 382C Embedded Software Systems Prof. Evans Department of Electrical and Computer Engineering The University of Texas at Austin David N. Armstrong

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Network Processors Evolution and Current Trends May 1, Nazar Zaidi RMI Corporation, USA

Network Processors Evolution and Current Trends May 1, Nazar Zaidi RMI Corporation, USA Network Processors Evolution and Current Trends May 1, 2008 Nazar Zaidi RMI Corporation, USA Network Processors: Evolution & Trends Overview of Network Processing Drivers & Demands for Network Processing

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

Fundamental Network Processor Performance Bounds. Hao Che, Chethan Kumar, and Basavaraj Menasinahal

Fundamental Network Processor Performance Bounds. Hao Che, Chethan Kumar, and Basavaraj Menasinahal Fundamental Network Processor Performance Bounds Hao Che, Chethan Kumar, and Basavaraj Menasinahal Department of Computer Science and Engineering University of Texas at Arlington (hche@cseutaedu, chethan@utaedu,

More information

DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER

DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER G MAHESH BABU, et al, Volume 2, Issue 7, PP:, SEPTEMBER 2014. DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER G.Mahesh Babu 1*, Prof. Ch.Srinivasa Kumar 2* 1. II. M.Tech (VLSI), Dept of ECE,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information