The Impact of Parallel and Multithread Mechanism on Network Processor Performance
|
|
- Jessie Warren
- 6 years ago
- Views:
Transcription
1 The Impact of Parallel and Multithread Mechanism on Network Processor Performance Chunqing Wu Xiangquan Shi Xuejun Yang Jinshu Su Computer School, National University of Defense Technolog,Changsha, HuNan, China 4173 Abstract Network processors are becoming a predominant feature in the field of network hardware due to its high performance and flexibility. The performance of network processor mainly depends on its architecture. This paper studied the parallel architecture and multithread mechanism in network processor. We discussed reasons of thread stalls and the principle of hiding latencies caused by various stalls using multithread mechanism. Lastly, we present the test results based on analyzing the relationship of active thread number and the performance of network processor. 1. Introduction The ceaseless rise of network link rate demands network devices process packets in wondrously short time. For example, the arrival interval of 4 bytes packet is 35ns on 1Gbps links, and this interval is 8ns on 4G links. It is quite difficult in so short time to complete the process of QoS and looking up route table in line rate. Network processor is canonized by network device manufacturers due to the characteristic of its flexibility and the process performance close to ASIC. The architecture of network processor is crucial to its processing power, but its memory capacity and computing capability are the basic factors. As a SOC system, it is unpractical to make its frequency as high as generic CPU s. For example, the arrival interval of packet is about 16ns on OC-48 links. That is, the network processor with frequency of 133Mhz and clock cycle of 7.5ns must complete the process of a 4bytes packet in 21.3 clock cycles to avoid packet dropping. Network processor usually employs DRAM as external memory instead of SRAM due to its high price, although the access delay of SRAM could reach 1ns. As DRAM access delay is about 55-7ns(DDR,RDR), it needs 1 clock cycles to complete an external memory access. This indicates the impossibility of completing the packet process in 21 clock cycles only by single thread. [1] and [2] evaluate the requirement of the network process ability by the compute algorithm to perform looking up route table(rtr) and IP segment(frag). The compute complexity of RTR is 2.1 instructions per byte. For 2.5Gbps link, the required compute ability is 2.5Gbps/8*2.1= 656MIPS, The compute complexity of FRAG is 7.7 instructions per byte. The required compute ability is 2.5/8*7.7=247MIPS. Now the process ability of single processing element at the art of state in network processor is about 15MIPS. So the architecture of network processor must be lucubrated. We can meet the high speed network requirement by appropriate design or arrangement of process elements() to overcome the process ability shortage of single. It has been sufficiently verified that the parallel and multithread mechanisms are effectual approaches to rise the performance of computer systems. This paper focuses on the impact of parallel and multithread mechanism on network processor performance. In section 2, related works are described. Thread stalls and introducing of multithreads are presented in section 3. In section 4, we discuss the relationship of parallel mechanism and network processor performance. And at last, we give implementation and testing results. 2. Related works Usually, the core proportion of network processor is multi-s. Every is a simple micro-processor on which there are multithreads running, and one packet is assigned to one thread to process each time. The essential idea is to exploit the parallelism of packet processing by assign uncorrelated packets to different threads. The hardware architecture of network This work is supported by National Natural Science Foundation of China (NSFC), under agreement no 9646 Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26
2 processor is shown in Fig1. There are a group of s, multi co-processors and multi hardware logic block. CoP CoP CoP Fig1. The hardware architecture of network processor We can partition network processor to two classes according to its architecture[3]: Pipelined: each processor is designed for a particular packet processing task and communicates in a pipelined fashion. Examples of this architectural style include Cisco s PXF, Motorola s C-5 DCP, and Xelerated Packet Devices. Parallel: each is performing similar functionality. This approach is commonly coupled with numerous co-processors to accelerate specific types of computation. Since these co-processors are shared across many s, an arbitration unit is often required. The Agere PayloadPlus, Intel IXP12, IBM PowerNP, and Lexra NetVortex are examples of this type of macro-architecture. The parallelism of this type architecture includes the parallelism of s[4], the parallelism of multithreads in s[5], the parallelism of task/packet level and the parallelism of data/instruction level in s,etc. Tilman Wolf and his fellows[6] researched on how organization impact on system performance. They analyze the network processor performance of four types of parallel, serial, pipelined and mixed architectures. The simulation results indicate that organization has important impact on system performance, where The depth of pipeline can rise system throughput, but the competition of memory access will limits the width of pipeline contributing to system throughput. Access to off-chip memory will reduce system throughput and increase packet queuing delay. This may be alleviated by introduce the hidden delay mechanism. The cost of communication and synchronization has more important impact on system throughput than processing time. Venkatachalam and his fellows[7] researched on how to use configurable micro-engine architecture and programming model to develop serial pipeline architecture and parallel architecture on Intel IXP24. They evaluated the efficiency of IXP24 network processor according to the two applications of ATM AAL2 and ATM flow management. The presently research results indicate that pipeline architecture may rise processing performance by increasing pipeline stages, but it is difficult to develop the software which will not generate system bottle neck and can efficiently drive processing engines. Comparatively, software development is easier in parallel architecture. There are problems such as memory access conflict between multi-processors and resource sharing. Packets processed in different s are not always independent, but sometimes packets may be dependent each other. This conditioning relationship performs in two aspects of service order and resource operation. The synchronization problem caused by service order could be solved by maintaining status of packet process. Lock mechanism must be introduced to avoid resource operation confliction. Although some packets may be dependent each other, the independent packets are the majority of Internet traffic. Parallel processing packets has great effect in rising system throughput. Many researches have indicated that the performance of packets processing in parallel architecture with multi-s is more powerful than in pure pipeline architecture. Additionally, time wasting can be caused by kinds of thread stall during threads running inside, which will impact on network processor performance. Next we will focus on the impact on network processor performance coming from thread stalls, the introducing of multi-threads and parallel s. 3. Thread stalls and introducing of multithreads We have mentioned that thread stall can reduce the system process ability. This section will discuss the type of thread stall, the impact on network processor Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26
3 performance coming from thread stall and the stall hiding mechanism by utilizing multi-threads Type of thread stall Resource sharing and exclusive access will cause running thread stall. These stalls have great impact on the performance of multithread network processing. The main stall types are: (1)Coprocessor stall: A coprocessor stall occurs when the thread is stalled waiting for a coprocessor to finish executing. Some examples of when a coprocessor stall occurs are: Synchronous coprocessor command is issued and the thread is stalled until the coprocessor is done executing Asynchronous coprocessor command is issued and the coprocessor is already in use Wait instruction is executed and a coprocessor is still executing from a previous asynchronous coprocessor command (2)Data stall: A data stall occurs when an instruction must wait for a specific general purpose register (GPR) to get data that is being loaded across the data bus (3)Instruction stall: An instruction stall occurs when the thread is stalled waiting for an instruction fetch to complete. An example of an instruction stall is when a branch instruction is executed. (4)Bus stall: A bus stall occurs when the thread is stalled waiting to access a data bus. Contention for a data bus can be due to any of the following: Another instruction executed by this thread or another thread is already using the bus during the cycle that the CLP requests it for this instruction Another coprocessor is already using the bus on the cycle that the CLP requests it. (5)GPR stall: A GPR stall occurs when two operations attempt to copy data into any GPR on the same cycle. 3.2.Impact on performance caused by thread stalls By utilizing the Npprofile, a network processor performance analysis toolkit, we have analyzed trace message log file produced by packet forward picocode of the third layer and second layer. Figure2 and Figure3 presents the result about thread stalls. In Figure 2,the first column means that the threads were stalled about 12 times, each lasting 1-1 cycles (coprocessor stalls are 1, data stalls are 35, instruction stalls are 12, and bus stalls are 6). The second column indicates that there were 1 stalls that lasted 1 2 cycles (was a coprocessor stall ). Among various thread stalls, the bus stall is dominant. And it is noticed that the coprocessor stall occupies the most number of cycles. stall frequency Number of cycles stalled Figure2. Thread stalls frequency and cycles tested with forward picocode of Layer 3 stall frequency Number of cycles stalled Figure3. Thread stalls frequency and cycles tested with forward picocode of Layer 2 By utilizing the Npprofile, we have analyzed trace message log file produced by packet forward picocode of the third layer. Figure 4 presents the distribution of various stalls and running cycles of single thread. It shows that various stalls occupy 59% of the thread running period. It is necessary to hide these stalls by adopting parallel multithread mechanism to achieve higher utilization of processors. 13% 13% CLP EXEC 41% 12% 21% Fig4. Distribution of various stalls and running cycles of single thread CLP EXEC Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26
4 3.3.The principle of hiding stalls using multithrea By introducing the multithread model, we can avoid waiting caused by various stalls of single thread in micro-engine. Figure5 shows the stall hiding in multithread model. Thread2 Thread1 Thread run stall packet1 packet3 packet2 thread life(t) Figure5. The stall hiding in multithread model When Thead accesses the tree search engine, it can hand over the to thread1. When thead1 accesses external memory, it can hand over the to thread2. Thus, by adopting multithread model and thread switching technology, the delays caused by various stalls are effectually hidden. As a result, will not waste processing cycles for waiting for the end of stalls. 4. The relationship of parallel mechanism and network processor performance Suppose m is number of s, n is number of packets,for network processor with single, we can compute the time of processing single packet of length L as expressions (1): St=C t +f(m,n)+p t (1) Where, Ct is the time of keep packet order, f(m,n) is the stall time, here m=1, n=1, Pt is the time of processing single packet in without any stall. Pt is associated with packet length L. The larger the L, the larger the Pt. Pt is approximately a linearity function associated with L as expressions(2): P t =ß*L (2) For network processor with single, we can compute the time of processing n packets of length L as expressions (3): NSt=Ct+f(m,n)+n*ß*L (3) Where, m=1. For network processor with m s, we can compute the time of processing n packets of length L as expressions (4): NTt=g(m)Ct+f(m,n)+n*ß*L /m (4) For network processor with m parallel s, keeping packet order is performed by special hardware. So g(m) is approximately a linearity function associated with m, and independent of packet number. So we can suppose g(m) is a constant. For network processor with m parallel s, f(m,n) is associated with the number m of active s, the larger the m, the larger the f(m,n). According to the principle of hiding stalls by multithread, f(m,n) increases very slowly. We know that f(1,n)< f(m,n), but npt>> n*pt/m, suppose m is 32, f(1,n) may be 32 times to f(m,n). The length of packet (say, L)also affect the number of needed. The smaller the L, the shorter the packet arrival intervals, so St should be shorter. For short packets, it can effectively reduce the pressure of packet processing by increasing the number of s. It is difficult to greatly shorten Pt to reduce n*pt/m only by rising the frequency of single, which is limited by technique of chip. Comparatively, it is more easy to implement the number of s within certain scope. We notice that when m increases, f(m,n) increases too. It impairs the performance of network processor with multithread. And m is limited by technique of chip too. It is important to trade off between f(m,n) and Pt. 5. Implementation and testing results We have implemented a high performance core router using network processor with multi-s. The test was performed on 2.5Gbps network interface of this core router. The relationship of throughput and number of needed was shown in figure6. When packet length is 124Bytes, line rate forwarding only need 2-3 s. When packet length is 6Bytes, line rate forwarding will need 2 s. Packet Number Be Sent(Million) Number of Active Thread Figure6. Relationship of throughput and number Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26
5 Max BandWidth Rate Number of Active Thread the number of s and the network processor performance. Experiments and Simulations show, parallel and multithread mechanism play an very important role in the network processor. On the other hand, stalls caused by resource competition impact greatly on the performance of network processor. Besides the hardware work, we could increase system s parallelism by software work., such as parallel searching of multi-route tables. Figure7. The relationship of max bandwidth rate and active thread number without packet loss Throughput Gigabits per Second and Packets per Second Rates Mpps Gbit/s Packet Size Figure8. Performance of our core router using network processor with multi-s The test results indicate that the performance will rise greatly when active thread number increases for network processor with multi-s. Short packets require more parallelism because of their higher arriving rate. But the performance is not a linearity function associated with active thread number because the thread stalls will increase in multithread environment. 6. Summary and Conclusion Network Processors are an emerging technology in the network industry. The performance of network processor is closely associated with its architecture. In our studies we focused the efforts on relation between 7. References [1] TAN Zhang-Xi, LIN Chuang, Analysis and Research on Network Processor Journal of Software Vol.14, No.2 23 [2] Wolf T, Franklin MA. CommBench a telecommunications benchmark for network processors. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Austin, TX, ~162. [3] Niraj Shah,Understanding Network Processors Technical Report University of berkeley [4] Ning Weng and Tilman Wolf. Pipelining vs. multiprocessors - choosing the right network processor system topology. in Proceedings of Advanced Networking and Communications Hardware Workshop (ANCHOR 24), Munich, Germany, June 24 [5] Patrick Crowley, Marc E. Fiuczynski, Jean-Loup Baer. On the Performance of Multithreaded Architectures for Network Processors Technical Report 2-1-, University of Washington. [6] L. Kencl, JY Le Boudec, T. Wolf et al. Adaptive Load Sharing for Network Processors In IEEE INFOCOM 22, New York [7] Muthu Venkatachalam, Prashant Chandra, RajYavatkar. A highly flexible, distributed multiprocessor architecture for network processing. IEEE Computer Networks Vol.41 23, pp Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) /6 $2. 26
Design Space Exploration of Network Processor Architectures
Design Space Exploration of Network Processor Architectures ECE 697J December 3 rd, 2002 ECE 697J 1 Introduction Network processor architectures have many choices Number of processors Size of memory Type
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Pa Available online at: Analysis of Network Processor Elements Topologies Devesh Chaurasiya
More informationTopic & Scope. Content: The course gives
Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors
More informationDesign Tradeoffs for Embedded Network. processors.
Design Tradeoffs for Embedded Network Processors Tilman Wolf 1 and Mark A. Franklin 2 1 Department of Computer Science, Washington University, St. Louis, MO, USA wolf@ccrc.wustl.edu 2 Departments of Computer
More informationIntegrating MRPSOC with multigrain parallelism for improvement of performance
Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,
More informationIMPLEMENTATION OF DDR I SDRAM MEMORY CONTROLLER USING ACTEL FPGA
IMPLEMENTATION OF DDR I SDRAM MEMORY CONTROLLER USING ACTEL FPGA Vivek V S 1, Preethi R 2, Nischal M N 3, Monisha Priya G 4, Pratima A 5 1,2,3,4,5 School of Electronics and Communication, REVA university,(india)
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationDesign Tradeoffs for Embedded Network Processors
Design Tradeoffs for Embedded Network Processors Tilman Wolf Mark A. Franklin Edward W. Spitznagel WUCS-00-2 July 10, 2000 Department of Computer Science Washington University Campus Box 105 One Brookings
More informationRoad Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor
Performance of Embedded System Application on Network Processor 2006 Spring Directed Study Project Danhua Guo University of California, Riverside dguo@cs.ucr.edu 06-07 07-2006 Motivation NP Overview Programmability
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationPerformance Models for Network Processor Design
TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. YY, ZZZ 26 Performance Models for Network Processor Design Tilman Wolf, Member, IEEE, Mark A. Franklin, Fellow, IEEE Abstract To provide a
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationImplementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400
The International Arab Journal of Information Technology, Vol. 6, No. 3, July 2009 289 Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400 Kandasamy Anusuya, Karupagouder
More informationCommercial Network Processors
Commercial Network Processors ECE 697J December 5 th, 2002 ECE 697J 1 AMCC np7250 Network Processor Presenter: Jinghua Hu ECE 697J 2 AMCC np7250 Released in April 2001 Packet and cell processing Full-duplex
More informationA Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup
A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington
More informationNoC Test-Chip Project: Working Document
NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance
More informationWorkload Characterization and Performance for a Network Processor
Workload Characterization and Performance for a Network Processor Mitsuhiro Miyazaki Princeton Architecture Laboratory for Multimedia and Security (PALMS) May. 16. 2002 Objectives To evaluate a NP from
More informationDesign Issues for High-Performance Active Routers
404 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 19, NO. 3, MARCH 2001 Design Issues for High-Performance Active Routers Tilman Wolf and Jonathan S. Turner, Fellow, IEEE Abstract Modern networks
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationNetworks-on-Chip Router: Configuration and Implementation
Networks-on-Chip : Configuration and Implementation Wen-Chung Tsai, Kuo-Chih Chu * 2 1 Department of Information and Communication Engineering, Chaoyang University of Technology, Taichung 413, Taiwan,
More informationA Scalable, Cache-Based Queue Management Subsystem for Network Processors
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar and Patrick Crowley Applied Research Laboratory Department of Computer Science and Engineering Washington University
More informationh Coherence Controllers
High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs
More informationChapter Seven Morgan Kaufmann Publishers
Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationNew Advances in Micro-Processors and computer architectures
New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,
More informationMeasurement-based Analysis of TCP/IP Processing Requirements
Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the
More informationA priority based dynamic bandwidth scheduling in SDN networks 1
Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationResearch on Transmission Based on Collaboration Coding in WSNs
Research on Transmission Based on Collaboration Coding in WSNs LV Xiao-xing, ZHANG Bai-hai School of Automation Beijing Institute of Technology Beijing 8, China lvxx@mail.btvu.org Journal of Digital Information
More informationCaches. Hiding Memory Access Times
Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationEfficiency of Cache Mechanism for Network Processors
TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/18llpp575-585 Volume 14, Number 5, October 2009 Efficiency of Cache Mechanism for Network Processors XU Bo ( ) 1,3, CHANG Jian ( ) 2, HUANG Shimeng (
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationScheduling Computations on a Software-Based Router
Scheduling Computations on a Software-Based Router ECE 697J November 19 th, 2002 ECE 697J 1 Processor Scheduling How is scheduling done on a workstation? What does the operating system do? What is the
More informationNetwork Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN
Network Processors Douglas Comer Computer Science Department Purdue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purdue.edu/people/comer Copyright 2003. All rights reserved.
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationA hardware operating system kernel for multi-processor systems
A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,
More informationLocality-Aware Predictive Scheduling of Network Processors
Locality-Aware Predictive Scheduling of Network Processors Tilman Wolf and Mark A. Franklin Departments of Computer Science and Electrical Engineering Washington University in St. Louis, MO, USA fwolf,
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationIBM Network Processor, Development Environment and LHCb Software
IBM Network Processor, Development Environment and LHCb Software LHCb Readout Unit Internal Review July 24 th 2001 Niko Neufeld, CERN 1 Outline IBM NP4GS3 Architecture A Readout Unit based on the NP4GS3
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More informationAdvanced Memory Organizations
CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationCycles Per Instruction For This Microprocessor
What Is The Average Number Of Machine Cycles Per Instruction For This Microprocessor Wikipedia's Instructions per second page says that an i7 3630QM deliver ~110,000 It does reduce the number of "wasted"
More informationAn Energy Consumption Analytic Model for A Wireless Sensor MAC Protocol
An Energy Consumption Analytic Model for A Wireless Sensor MAC Protocol Hung-Wei Tseng, Shih-Hsien Yang, Po-Yu Chuang,Eric Hsiao-Kuang Wu, and Gen-Huey Chen Dept. of Computer Science and Information Engineering,
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationA Cache Hierarchy in a Computer System
A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationEmbedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.
Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors
More informationAn Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling
An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling Keigo Mizotani, Yusuke Hatori, Yusuke Kumura, Masayoshi Takasu, Hiroyuki Chishiro, and Nobuyuki Yamasaki Graduate
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationG-NET: Effective GPU Sharing In NFV Systems
G-NET: Effective Sharing In NFV Systems Kai Zhang*, Bingsheng He^, Jiayu Hu #, Zeke Wang^, Bei Hua #, Jiayi Meng #, Lishan Yang # *Fudan University ^National University of Singapore #University of Science
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationResearch Article MFT-MAC: A Duty-Cycle MAC Protocol Using Multiframe Transmission for Wireless Sensor Networks
Distributed Sensor Networks Volume 2013, Article ID 858765, 6 pages http://dx.doi.org/10.1155/2013/858765 Research Article MFT-MAC: A Duty-Cycle MAC Protocol Using Multiframe Transmission for Wireless
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationMulti-gigabit Switching and Routing
Multi-gigabit Switching and Routing Gignet 97 Europe: June 12, 1997. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science nickm@ee.stanford.edu http://ee.stanford.edu/~nickm
More informationPCnet-FAST Buffer Performance White Paper
PCnet-FAST Buffer Performance White Paper The PCnet-FAST controller is designed with a flexible FIFO-SRAM buffer architecture to handle traffic in half-duplex and full-duplex 1-Mbps Ethernet networks.
More informationNegotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye
Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1 System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface
More informationThe Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):
The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:
More informationArchitecture Tuning Study: the SimpleScalar Experience
Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.
More informationIntegrated EPON-LTE Network DBA Algorithm s Real Time Performance Analysis Using Network Processors
Integrated EPON-LTE Network DBA Algorithm s Real Time Performance Analysis Using Network Processors S. Ramya Dr. N. Nagarajan Dr. B. Kaarthick Mettler-Toledo Turing Software, Coimbatore, INDIA Department
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationUnderstanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures
Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3
More informationChapter 5. A Closer Look at Instruction Set Architectures
Chapter 5 A Closer Look at Instruction Set Architectures Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Gain familiarity with memory addressing modes. Understand
More information,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics
,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also
More informationTransparent TCP Acceleration Through Network Processing
Transparent TCP Acceleration Through Network Processing Tilman Wolf, Shulin You, and Ramaswamy Ramaswamy Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA 3 {wolf,syou,rramaswa}@ecs.umass.edu
More informationA Parallel Decoding Algorithm of LDPC Codes using CUDA
A Parallel Decoding Algorithm of LDPC Codes using CUDA Shuang Wang and Samuel Cheng School of Electrical and Computer Engineering University of Oklahoma-Tulsa Tulsa, OK 735 {shuangwang, samuel.cheng}@ou.edu
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationA Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing
727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni
More informationEvaluating Compiler Support for Complexity Effective Network Processing
Evaluating Compiler Support for Complexity Effective Network Processing Pradeep Rao and S.K. Nandy Computer Aided Design Laboratory. SERC, Indian Institute of Science. pradeep,nandy@cadl.iisc.ernet.in
More informationTCP performance experiment on LOBS network testbed
Wei Zhang, Jian Wu, Jintong Lin, Wang Minxue, Shi Jindan Key Laboratory of Optical Communication & Lightwave Technologies, Ministry of Education Beijing University of Posts and Telecommunications, Beijing
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCross Clock-Domain TDM Virtual Circuits for Networks on Chips
Cross Clock-Domain TDM Virtual Circuits for Networks on Chips Zhonghai Lu Dept. of Electronic Systems School for Information and Communication Technology KTH - Royal Institute of Technology, Stockholm
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationRuler: High-Speed Packet Matching and Rewriting on Network Processors
Ruler: High-Speed Packet Matching and Rewriting on Network Processors Tomáš Hrubý Kees van Reeuwijk Herbert Bos Vrije Universiteit, Amsterdam World45 Ltd. ANCS 2007 Tomáš Hrubý (VU Amsterdam, World45)
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationComputer System Components
Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware
More informationA Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors
A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal
More informationArchitectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans
Architectural Considerations for Network Processor Design EE 382C Embedded Software Systems Prof. Evans Department of Electrical and Computer Engineering The University of Texas at Austin David N. Armstrong
More informationMemory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology
Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast
More informationChapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY
Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored
More informationNetwork Processors Evolution and Current Trends May 1, Nazar Zaidi RMI Corporation, USA
Network Processors Evolution and Current Trends May 1, 2008 Nazar Zaidi RMI Corporation, USA Network Processors: Evolution & Trends Overview of Network Processing Drivers & Demands for Network Processing
More information1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola
1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device
More information[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School
References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative
More informationFundamental Network Processor Performance Bounds. Hao Che, Chethan Kumar, and Basavaraj Menasinahal
Fundamental Network Processor Performance Bounds Hao Che, Chethan Kumar, and Basavaraj Menasinahal Department of Computer Science and Engineering University of Texas at Arlington (hche@cseutaedu, chethan@utaedu,
More informationDESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER
G MAHESH BABU, et al, Volume 2, Issue 7, PP:, SEPTEMBER 2014. DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER G.Mahesh Babu 1*, Prof. Ch.Srinivasa Kumar 2* 1. II. M.Tech (VLSI), Dept of ECE,
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More information