# Scalable Packet Classification Using Interpreting. A Cross-platform Multi-core Solution

Save this PDF as:

Size: px
Start display at page:

## Transcription

4 to the first element of array eqid 0 whose bitmap is Each bit in a bitmap corresponds to a rule, with the most significant bit corresponding to R 1, and the least significant bit to R 4. Each bit records whether the corresponding rule matches or not for a given input. Thus, bitmap 0001 means only rule R 4 matches when index 0 of Chunk 0 is used as F 1 input. Similarly, the first entry of Chunk 2 has value 0, and it points to the first entry of eqid 2 whose bitmap is 0011, indicating only rules R 3 and R 4 match if index 0 of Chunk 2 is used as input for field F 3. In the second phase (Phase 1), a cross-producting table (CPT) and its accompanying eqid table are constructed from the eqid tables built in Phase 0. Each CPT entry is also an index, pointing to the final eqid table whose entry records all the rules matched when the corresponding index is concatenated from eqid 0 eqid 1 eqid 2. For instance, the index of the first entry of CPT is 0, calculated from concatenating three bit strings The rules matched can be computed as the intersection of eqid 0 [0] ( 0001 ), eqid 1 [0] ( 0001 ), and eqid 2 [0] ( 0011 ). The result is 0001, indicating rule R 4 matches when is used as input for the three fields F 1, F 2, and F 3. The lookup process for the sample packet P(010,100,100) in Figure 1 is as follows: 1) use each field, P 1, P 2 and P 3 (i.e., 010,100,100) to look up Chunk 0-2 to compute the index of cross-producting table A by Chunk 0 [2]*3*2+Chunk 1 [4]*2+Chunk 2 [4], which is 16; 2) the value of CPT[16] is 3 and it is used as an index to eqid 3. The result of 0011 indicates that rules R 3 and R 4 match the input packet P. Finally, R 3 is returned as it has higher priority than R 4 according to the longest match principle. Phase 0 eqid eqid CBM P (010) Chunk 0 eqid eqid CBM P (100) 7 0 Chunk 1 eqid eqid CBM P 3 (100) 7 0 Chunk OIndex Phase 1 eqid 3 eqid CBM Cross-producting Table A Sample packet:p = (010, 100, 100) OIndex = eqid 0 * 6 + eqid 1 * 2 + eqid 2 Figure 1. A two-phase RFC reduction tree Figure 2. 4-phase RFC reduction tree A 4-phase RFC reduction tree for 5-tuple IP packet classification is composed in Figure 2. There are seven chunks (corresponding to 16 bits lower/higher src/des IP address, 16 bits src/des port number, 8 bits protocol type respectively) in phase 0, three CPTs in phase 1, two CPTs in phase 2, and one CPT in phase 3. It achieves the fastest classification speed by using 13 memory accesses per packet. Astute readers may notice that the size of CPTs increases non-linearly in later reduction phases. The fast speed of RFC is thus obtained at the cost of memory explosion. 4.2 TIC Algorithm Description Since the space will explode in RFC when the number of rules becomes large, two-stage classification algorithms [6][11] were proposed to balance classification speed and memory space. In a two-stage classification algorithm, the IP address fields and TCP port numbers are searched separately. The advantages of such separation are as follows: Firstly, the IP address fields are normally represented in prefix and the TCP port fields in range. Because range-to-prefix transformation increases the number of actual rules, it will increase memory space accordingly. Furthermore, matching a prefix is memory-intensive operation and matching a range is ALU-intensive operation. It will be more efficient to handle them separately, especially on a multi-core where there are idle CPU cycles when more CPU cores are available. Secondly, previous studies on classifier database characteristics [6] have revealed that 99.9% of the time the number of classifier rules that match a given source-destination prefix pair is no more than 5. Our analysis on the synthetic classifiers generated by ClassBench [4] find that 95.8% of the time the number of matching rules is no more than 10. Therefore, the number of ranges to be matched after the first stage is limited. Our Two-stage Interpreting based Classification (TIC) algorithm consists of the following two stages: The first stage is to search a RFC reduction tree composed from the source and destination addresses. As shown the upper part of Figure 2, there are three phases of search with 7 memory accesses. In phase 0, it contains 4 chunks: chunk 0 and 1 searches the low and high 16 bits of source IP address, and chunk 2 and 3 searches the low and high 16 bits of destination IP address respectively. In 36

5 phase 1, it contains 2 CPTs made from the results of the 16-bit address match. In phase 2, a list of range expressions is returned. The second stage is to search a list of range expressions made from the port numbers and protocol fields. Ranges in the list are encoded as a sequence of ALU instructions to further reduce memory space, and an interpreter is implemented to execute those instructions to find a match. The Range Interpreter (RI) is a kind of simple virtual machine. It fetches a block of code from external memory to internal memory (cache), and then it decodes and executes each instruction sequentially. The interpreter exploits data locality by matching the block size to the cache-line size or internal memory block size. By doing so, the external memory accesses are dramatically reduced. Our analysis on the experimental classifiers shows that 94.9% of lists of candidate ranges can be encoded into one block, whose size is 64 bytes, the same as the cache-line size of an X86 multicore. Therefore, 94.9% of the time the number of external memory accesses in the second stage is one. On the other hand, since the interpreter only accesses this cache line until all instructions in a block finish execution, it enables highly efficient ALU execution. 4.3 Instruction Encoding In preprocessing, after the reduction tree of the first stage is constructed, it presents lists of potential matching rules. For the rule example in Table 1, four lists are available, which are (R 4 ), (R 1, R 4 ), (R 2, R 4 ) and (R 3, R 4 ). Then for each list there is a code block that is formed by encoding the range information in the candidate rules. If one block can not accommodate the whole list, more blocks are needed. ClassBench [4] gives five classes of port range: WC (wildcard), HI ([1024 : 65535]), LO ([0 : 1023]), AR (arbitrary range), and EM (exact match), and two classes of protocol range: WC and EM. So we have about 50 (5*5*2) operators for the CISC instructions. There are three instruction formats, with one operand, three operands, and five operands as shown in Figure 3. Since 8 bits are used for encoding the operator and 16 bits for rule ID and the protocol field is 8 bits, the minimum of bytes required for an instruction is 4. For example, operator `EM-WC-WC` means that the protocol field requires exact match, and the source and destination port fields can match any port number since they are a wild character. Therefore, instruction `EM-WC-WC` can be encoded in 4 bytes in which only the protocol field is required in matching. Because an arbitrary range requires two 16-bit numbers to represent, the 8-byte and 12-byte instructions are for these instructions having one or two AR operands. We use 16 bits to store rule ID, which means that the maximum number of rules in the classifiers is 64K. When port number specification in classifiers is WC, LO or HI, no operand is needed. At least one operand is needed for EM, and two for AR. Table 2 and Table 3 show the distribution of port number specifications in real classifier seeds of ClassBench. For both source and destination port number the proportions of AR are very limited, so instructions of 12-byte appear very infrequently. In fact, the average length of instructions is less than 1.8 long-words. Instruction of 4 Bytes operator operand0 Rule ID Instruction of 12 Bytes operator operand0 Rule ID operand1 operand2/reserved operand3 operand4/reserved Instruction of 8 Bytes operator operand0 Rule ID operand1 operand2/reserved Figure 3. Three instruction formats with 4, 8, and 12 bytes Table 2. Distribution of source port number Classifier WC HI LO EM AR seed1 100% seed % % - seed % % - seed4 100% seed % 0.35% % 2.00% Table 3. Distribution of destination port number Classifier WC HI LO EM AR seed % % 11.60% seed2 9.25% 13.96% % 11.04% seed3 8.56% 12.15% % 11.21% seed % 4.08% % 5.20% seed % 6.52% % 2.53% In addition, the following instructions are supported: NOP is to avoid handling a partial instruction in a block. An NOP instruction is padded at the end of each instruction block to align the block size to the cache-line size. It simplifies the interpreter execution by eliminating these partial instructions. HI is to handle well-known TCP port number comparison. TCP port numbers that are less than 1024 are assigned by IETF directly. These two ranges LO ([0 : 1023]), HI ([1024 : 65536]) are widely used in classification rules. Two special instructions are added to denote these two arranges. The size of the two instructions is reduced from 3 to 1 since the two well-known ranges do not need to store into the corresponding instructions. As Table 2 and Table 3 show that the frequency of the wild character appearance is very high. This suggests to encode this information in the operator field rather than the operand field since the potential number of operators can be as large as The Range Interpreter The pseudocode of the range interpreter is listed in Figure 4. All instruction blocks are stored in external memory, and after the first stage, we get the address of the first code block. In Lines 1-2 the block i is fetched from external memory to internal memory whose size is equal to the size of cache line. Then (lines 3-4) the current instruction is decoded and executed. If true is returned, the best matching rule is found. Thus, it returns the rule ID (lines 5-6); otherwise the search must be continued (lines 7-8). If all instructions in the block i are executed and still no matching rule is found, then the next block i+1 must be fetched (line 10). Please note that the loop (lines 3-9) accesses the current data cache-line only. 37

9 indicates that the time of a TIC SRAM operation stays in FIFO is longer than that of a RFC SRAM operation. Thus, it will eventually slow down the TIC s classification speed. Second, some of 32B read are wasteful and we will analyze its impacts in the following section. Table 10. Classifying speeds (Mpps) of IXP2800 for RFC & TIC 1 T 2 T 4 T 8T 16 T 32 T RFC DB1 TIC Imp.(%) RFC DB2 TIC Imp.(%) RFC DB3 TIC Imp.(%) RFC DB4 TIC Imp.(%) RFC DB5 TIC Imp.(%) Table 11. Memory access behavior of RFC vs. TIC # Memory Access # Long-words Accessed RFC LW TIC = = 15 LW Table 12. The SRAM read FIFO average and maximum size for DB1 Cha. Cha. Cha. Cha. Ave. #0 #1 #2 #3 Max TIC Ave Max RFC Ave Figure 6. Classifying speeds (Mpps) and speedups for 32B and 64B block size for DB5 6.5 Block Size Impact on IXP2800 Figure 6 shows the classifying speeds and speedups of TIC on IXP2800 when the block size is chosen as 32B and 64B respectively for DB5. We can see that when the block size is 32B, the classification speed and the speedup is all higher. There are two reasons: (1) Fetching 64B of code block is wasteful because on average the number of range evaluated is 1.17 (as shown in Table 9); (2) Fetching 64 bytes from SRAM makes the utilization of SRAM channel higher and the size of SRAM read FIFO larger, which causes the latency of SRAM access increasing badly. As shown in Table 12, on average the average size and the maximal size of SRAM read FIFO is larger in TIC than in RFC. This suggests that a SRAM operation will stay in FIFO longer for TIC than for RFC. Therefore, it takes more time for TIC to access SRAM data than that for RFC. Therefore, choosing the right size of the code block will influence the classification performance. 7. Programming Guidance on Multi-core Architectures We have presented the implementation and performance analysis of TIC in two representative multi-core processors: Intel Core 2 Duo and Intel IXP2800. Based on our experiences, we provide the following guidelines for creating an efficient network application on a multi-core architecture. 1) Both application and multi-core architectural features must be exploited to design and implement a high efficient networking algorithm. 2) Compress data structures and store them in as high memory hierarchy as possible (e.g., SRAM, cache, onchip memory) to fill the speed gap between CPU and memory. 3) Proper compression scheme must be chosen to balance the decompression complexity and space reduction, so that decompression will not become a new performance bottleneck. 4) In general, multi-core architecture has many different shared resources. Pay attention to how those shared resources are used because they might become a bottleneck in algorithm implementation. 5) Exploit as many latency hiding techniques as possible to hide memory access latency or reduce memory references. 8. Conclusions and Future Work This paper proposed a scalable packet classification algorithm TIC that can be efficiently implemented on a multi-core architecture with or without cache. We implemented this algorithm on both Intel IXP2800 network processor and Core 2 Duo X86 architecture. We studied the interaction between the parallel algorithm design and architecture mapping to facilitate efficient algorithm implementation on multi-core architectures. We experimented with an architecture-aware design principle to guarantee the scalability and high-performance of the resulting algorithm. Furthermore, we investigated the main software design issues that have most significant performance impacts on 41

### High-performance Packet Classification Algorithm for Many-core and Multithreaded Network Processor

High-performance Packet Classification Algorithm for Many-core and Multithreaded Network Processor Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang Department of Computer Science and Technology University

### Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Algorithms for Routing Lookups and Packet Classification October 3, 2000 High Level Outline Part I. Routing Lookups - Two lookup algorithms Part II. Packet Classification - One classification algorithm

### SSA: A Power and Memory Efficient Scheme to Multi-Match Packet Classification. Fang Yu, T.V. Lakshman, Martin Austin Motoyama, Randy H.

SSA: A Power and Memory Efficient Scheme to Multi-Match Packet Classification Fang Yu, T.V. Lakshman, Martin Austin Motoyama, Randy H. Katz Presented by: Discussion led by: Sailesh Kumar Packet Classification

### Routing Lookup Algorithm for IPv6 using Hash Tables

Routing Lookup Algorithm for IPv6 using Hash Tables Peter Korppoey, John Smith, Department of Electronics Engineering, New Mexico State University-Main Campus Abstract: After analyzing of existing routing

### EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

EECS : Introduction to Computer Networks Switch and Router Architectures Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley,

### AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING James J. Rooney 1 José G. Delgado-Frias 2 Douglas H. Summerville 1 1 Dept. of Electrical and Computer Engineering. 2 School of Electrical Engr. and Computer

### Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

Generic Architecture EECS : Introduction to Computer Networks Switch and Router Architectures Computer Science Division Department of Electrical Engineering and Computer Sciences University of California,

### PC-DUOS: Fast TCAM Lookup and Update for Packet Classifiers

PC-DUOS: Fast TCAM Lookup and Update for Packet Classifiers Tania Mishra and Sartaj Sahni Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 326 {tmishra,

### The Impact of Parallel and Multithread Mechanism on Network Processor Performance

The Impact of Parallel and Multithread Mechanism on Network Processor Performance Chunqing Wu Xiangquan Shi Xuejun Yang Jinshu Su Computer School, National University of Defense Technolog,Changsha, HuNan,

### Tree, Segment Table, and Route Bucket: A Multistage Algorithm for IPv6 Routing Table Lookup

Tree, Segment Table, and Route Bucket: A Multistage Algorithm for IPv6 Routing Table Lookup Zhenqiang LI Dongqu ZHENG Yan MA School of Computer Science and Technology Network Information Center Beijing

### Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

### Packet Classification Algorithms: From Theory to Practice

Packet Classification Algorithms: From Theory to Practice Yaxuan Qi, Lianghong Xu and Baohua Yang Department of Automation, Research Institute of Information Technology (RIIT), Tsinghua University, Beijing,

### A Scalable, Cache-Based Queue Management Subsystem for Network Processors

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar and Patrick Crowley Applied Research Laboratory Department of Computer Science and Engineering Washington University

### ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification

98 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 21, NO. 1, FEBRUARY 2013 ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification Haoyu Song, Member, IEEE, and Jonathan S. Turner, Fellow,

### Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Ruler: High-Speed Packet Matching and Rewriting on Network Processors Tomáš Hrubý Kees van Reeuwijk Herbert Bos Vrije Universiteit, Amsterdam World45 Ltd. ANCS 2007 Tomáš Hrubý (VU Amsterdam, World45)

### CPU Pipelining Issues

CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

### Binary Search Schemes for Fast IP Lookups

1 Schemes for Fast IP Lookups Pronita Mehrotra, Paul D. Franzon Abstract IP route look up is the most time consuming operation of a router. Route lookup is becoming a very challenging problem due to the

### Network Processors and their memory

Network Processors and their memory Network Processor Workshop, Madrid 2004 Nick McKeown Departments of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm

### Shape Shifting Tries for Faster IP Route Lookup

Shape Shifting Tries for Faster IP Route Lookup Haoyu Song, Jonathan Turner, John Lockwood Applied Research Laboratory Washington University in St. Louis Email: {hs1,jst,lockwood}@arl.wustl.edu Abstract

### Shape Shifting Tries for Faster IP Route Lookup

Shape Shifting Tries for Faster IP Route Lookup Haoyu Song, Jonathan Turner, John Lockwood Applied Research Laboratory Washington University in St. Louis Email: {hs1,jst,lockwood}@arl.wustl.edu Abstract

### Scalable Packet Classification

Scalable Packet Classification Florin Baboescu Dept. of Computer Science and Engineering University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0114 baboescu@cs.ucsd.edu George Varghese

### Memory Hierarchy Design for a Multiprocessor Look-up Engine

Memory Hierarchy Design for a Multiprocessor Look-up Engine Jean-Loup Baer, Douglas Low, Patrick Crowley, Neal Sidhwaney Department of Computer Science and Engineering University of Washington baer,douglas,pcrowley

### Computer Architecture Lecture 24: Memory Scheduling

18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

### Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Weirong Jiang, Viktor K. Prasanna University of Southern California Norio Yamagaki NEC Corporation September 1, 2010 Outline

### Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

### Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

2010 International Conference on Field Programmable Logic and Applications Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Weirong Jiang, Viktor K. Prasanna Ming Hsieh Department

### CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

### Computer Organization and Assembly Language

Computer Organization and Assembly Language Week 01 Nouman M Durrani COMPUTER ORGANISATION AND ARCHITECTURE Computer Organization describes the function and design of the various units of digital computers

### Introduction to Parallel Computing

Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

### Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

### 15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem

### QoS Provisioning Using IPv6 Flow Label In the Internet

QoS Provisioning Using IPv6 Flow Label In the Internet Xiaohua Tang, Junhua Tang, Guang-in Huang and Chee-Kheong Siew Contact: Junhua Tang, lock S2, School of EEE Nanyang Technological University, Singapore,

### More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

### IP LOOK-UP WITH TIME OR MEMORY GUARANTEE AND LOW UPDATE TIME 1

2005 IEEE International Symposium on Signal Processing and Information Technology IP LOOK-UP WITH TIME OR MEMORY GUARANTEE AND LOW UPDATE TIME 1 G.T. Kousiouris and D.N. Serpanos Dept. of Electrical and

### LECTURE 10: Improving Memory Access: Direct and Spatial caches

EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

### Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

### The Memory System. Components of the Memory System. Problems with the Memory System. A Solution

Datorarkitektur Fö 2-1 Datorarkitektur Fö 2-2 Components of the Memory System The Memory System 1. Components of the Memory System Main : fast, random access, expensive, located close (but not inside)

### Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table

Recent Results in Best Matching Prex George Varghese October 16, 2001 Router Model InputLink i 100100 B2 Message Switch B3 OutputLink 6 100100 Processor(s) B1 Prefix Output Link 0* 1 100* 6 1* 2 Forwarding

### White Paper Enabling Quality of Service With Customizable Traffic Managers

White Paper Enabling Quality of Service With Customizable Traffic s Introduction Communications networks are changing dramatically as lines blur between traditional telecom, wireless, and cable networks.

### CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

CS 201 The Memory Hierarchy Gerson Robboy Portland State University memory hierarchy overview (traditional) CPU registers main memory (RAM) secondary memory (DISK) why? what is different between these

### Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers ABSTRACT Jing Fu KTH, Royal Institute of Technology Stockholm, Sweden jing@kth.se Virtual routers are a promising

### Multiway Range Trees: Scalable IP Lookup with Fast Updates

Multiway Range Trees: Scalable IP Lookup with Fast Updates Subhash Suri George Varghese Priyank Ramesh Warkhede Department of Computer Science Washington University St. Louis, MO 63130. Abstract In this

### Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

### Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

### FPX Architecture for a Dynamically Extensible Router

FPX Architecture for a Dynamically Extensible Router Alex Chandra, Yuhua Chen, John Lockwood, Sarang Dharmapurikar, Wenjing Tang, David Taylor, Jon Turner http://www.arl.wustl.edu/arl Dynamically Extensible

### The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache \$\$\$\$\$ Cache Basics:

### L3/L4 Multiple Level Cache concept using ADS

L3/L4 Multiple Level Cache concept using ADS Hironao Takahashi 1,2, Hafiz Farooq Ahmad 2,3, Kinji Mori 1 1 Department of Computer Science, Tokyo Institute of Technology 2-12-1 Ookayama Meguro, Tokyo, 152-8522,

### Routers: Forwarding EECS 122: Lecture 13

Routers: Forwarding EECS 122: Lecture 13 epartment of Electrical Engineering and Computer Sciences University of California Berkeley Router Architecture Overview Two key router functions: run routing algorithms/protocol

### Networking in a Vertically Scaled World

Networking in a Vertically Scaled World David S. Miller Red Hat Inc. LinuxTAG, Berlin, 2008 OUTLINE NETWORK PRINCIPLES MICROPROCESSOR HISTORY IMPLICATIONS FOR NETWORKING LINUX KERNEL HORIZONTAL NETWORK

### Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

### A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

### Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, \$2000 \$5000 per GB Dynamic RAM (DRAM) 50ns 70ns, \$20 \$75 per GB Magnetic disk 5ms 20ms, \$0.20 \$2 per

### Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

### Analyzing Cache Bandwidth on the Intel Core 2 Architecture

John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

### DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

### IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

### Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor

### How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

### Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

### Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

### W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b l e d N e x t - G e n e r a t i o n V N X

Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b

### Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

### Single Pass Connected Components Analysis

D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

### Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

### Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

### Code Compression for DSP

Code for DSP Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress Abstract

### Memory. Lecture 22 CS301

Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

### SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh

BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 797- flur,chengkokg@ecn.purdue.edu

### High-performance IPv6 Forwarding Algorithm for Multi-core and Multithreaded Network Processor

High-performance IPv6 Forwarding lgorithm for Multi-core and Multithreaded Network Processor Xianghui Hu University of Sci. and Tech. of China Dept. of Computer Sci. and Tech. Hefei, China, 2327 xhhu@mail.ustc.edu.cn

### Concurrent/Parallel Processing

Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

### COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

### Multi-gigabit Switching and Routing

Multi-gigabit Switching and Routing Gignet 97 Europe: June 12, 1997. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science nickm@ee.stanford.edu http://ee.stanford.edu/~nickm

### COMP 273 Winter physical vs. virtual mem Mar. 15, 2012

Virtual Memory The model of MIPS Memory that we have been working with is as follows. There is your MIPS program, including various functions and data used by this program, and there are some kernel programs

### A paralleled algorithm based on multimedia retrieval

A paralleled algorithm based on multimedia retrieval Changhong Guo Teaching and Researching Department of Basic Course, Jilin Institute of Physical Education, Changchun 130022, Jilin, China Abstract With

### COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

### Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

### Analysis of Sorting as a Streaming Application

1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency

### Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

### Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract.

Fault-Tolerant Routing in Fault Blocks Planarly Constructed Dong Xiang, Jia-Guang Sun, Jie and Krishnaiyan Thulasiraman Abstract A few faulty nodes can an n-dimensional mesh or torus network unsafe for

### Chapter 1: Fundamentals of Quantitative Design and Analysis

1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the

### MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

### ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING

ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING Evan Baytan Department of Electrical Engineering and Computer

### Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

### 1 Connectionless Routing

UCSD DEPARTMENT OF COMPUTER SCIENCE CS123a Computer Networking, IP Addressing and Neighbor Routing In these we quickly give an overview of IP addressing and Neighbor Routing. Routing consists of: IP addressing

### Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

### One instruction specifies multiple operations All scheduling of execution units is static

VLIW Architectures Very Long Instruction Word Architecture One instruction specifies multiple operations All scheduling of execution units is static Done by compiler Static scheduling should mean less

### Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

### Memory hierarchy Outline

Memory hierarchy Outline Performance impact Principles of memory hierarchy Memory technology and basics 2 Page 1 Performance impact Memory references of a program typically determine the ultimate performance

### Fast Firewall Implementations for Software and Hardware-based Routers

Fast Firewall Implementations for Software and Hardware-based Routers Lili Qiu George Varghese Subhash Suri liliq@microsoft.com varghese@cs.ucsd.edu suri@cs.ucsb.edu Microsoft Research University of California,

### A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

### Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

### Caches Concepts Review

Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

### PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS

THE UNIVERSITY OF NAIROBI DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING FINAL YEAR PROJECT. PROJECT NO. 60 PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS OMARI JAPHETH N. F17/2157/2004 SUPERVISOR:

### EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major