Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Similar documents
FORTH-ICS / TR-392 July Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors. Abstract

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more

Cluster Communica/on Latency:

6.9. Communicating to the Outside World: Cluster Networking

Prototyping Efficient Interprocessor Communication Mechanisms

Prototyping Efficient Interprocessor Communication Mechanisms

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Basic Low Level Concepts

QUESTION BANK UNIT I

6.2 Per-Flow Queueing and Flow Control

Comp 310 Computer Systems and Organization

White Paper Enabling Quality of Service With Customizable Traffic Managers

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

Performance Evaluation of Myrinet-based Network Router

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

Hardware/Software Codesign of Schedulers for Real Time Systems

Fast Scalable FPGA-Based Network-on-Chip Simulation Models

A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system

An FPGA-Based Optical IOH Architecture for Embedded System

XPU A Programmable FPGA Accelerator for Diverse Workloads

Ultra-Fast NoC Emulation on a Single FPGA

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Main Points of the Computer Organization and System Software Module

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Hardware Implementation of an RTOS Scheduler on Xilinx MicroBlaze Processor

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

INT 1011 TCP Offload Engine (Full Offload)

LS Example 5 3 C 5 A 1 D

Research on the Implementation of MPI on Multicore Architectures

INT G bit TCP Offload Engine SOC

CS370 Operating Systems

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

PowerPC on NetFPGA CSE 237B. Erik Rubow

FPGA implementation of a Cache Controller with Configurable Scratchpad Space

IO virtualization. Michael Kagan Mellanox Technologies

Flexible Architecture Research Machine (FARM)

4. Networks. in parallel computers. Advances in Computer Architecture

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Lecture 11: Large Cache Design

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Message Passing Architecture in Intra-Cluster Communication

Memory Systems in Pipelined Processors

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Design and Implementation of a Shared Memory Switch Fabric

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

Operating System Review

The IRAM Network Interface

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

The instruction scheduling problem with a focus on GPU. Ecole Thématique ARCHI 2015 David Defour

POLITECNICO DI TORINO Repository ISTITUZIONALE

IV. PACKET SWITCH ARCHITECTURES

CS-534 Packet Switch Architecture

KeyStone Training. Multicore Navigator Overview

CS370 Operating Systems

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Cluster Communica/on Latency:

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Lessons learned from MPI

Chapter 5: CPU Scheduling

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

A Four-Terabit Single-Stage Packet Switch with Large. Round-Trip Time Support. F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I.

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Properties of Processes

Runtime OpenMP support using hardware primitives on explicitly memory managed multi-processor

FPGA Solutions: Modular Architecture for Peak Performance

Midterm Exam Amy Murphy 6 March 2002

操作系统概念 13. I/O Systems

Routing, Routers, Switching Fabrics

The RM9150 and the Fast Device Bus High Speed Interconnect

Operating Systems 2010/2011

CSE398: Network Systems Design

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

The Nios II Family of Configurable Soft-core Processors

Supporting the Linux Operating System on the MOLEN Processor Prototype

High Performance Packet Processing with FlexNIC

Leveraging HyperTransport for a custom high-performance cluster network

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

IBM PowerPRS TM. A Scalable Switch Fabric to Multi- Terabit: Architecture and Challenges. July, Speaker: François Le Maut. IBM Microelectronics

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Scheduling Data Flows using DRR

Building a Fast, Virtualized Data Plane with Programmable Hardware. Bilal Anwer Nick Feamster

Part I Overview Chapter 1: Introduction

CPU Scheduling: Objectives

Advanced Computer Networks. End Host Optimization

ProtoFlex: FPGA Accelerated Full System MP Simulation

ECE 697J Advanced Topics in Computer Networks

Transcription:

University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors Supervisor Prof. Manolis G.H. Katevenis Heraklion, Crete, July, 2007

Presentation Outline NI Queue Manager Introduction Key Concepts Architecture & Implementation NI Design for CMPs NI Design Goals NI Design Issues Proposed NI Design Conclusions & Future Work 2

Presentation Outline NI Queue Manager Introduction Key Concepts Architecture & Implementation NI Design for CMPs NI Design Goals NI Design Issues Proposed NI Design Conclusions & Future Work 3

NI Queue Manager Introduction FPGA-based Prototyping Platform PCI-X RDMA-capable NIC in cluster environment Buffered crossbar switch Goals Confirm buffered crossbar behavior Interprocessor communication research 4

Presentation Outline NI Queue Manager Introduction Key Concepts Architecture & Implementation NI Design for CMPs NI Design Goals NI Design Issues Proposed NI Design Conclusions & Future Work 5

NI Queue Manager - Key Concepts Head-Of-Line (HOL) Blocking HOL Blocking reduces switch throughput Input Queues Switching Fabric Outputs 1 1 2 1 1 2 2 2 1 Idle! 2 HOL Blocking 6

NI Queue Manager - Key Concepts Virtual Output Queues (VOQs) VOQs eliminate HOL Blocking Input Queues Switching Fabric Outputs 1 1 1 2 1 2 2 1 2 2 VOQs 7

NI Queue Manager - Key Concepts Traffic Segmentation Schemes 40 44 Traffic segmented to optimize switching Variable-Size MultiPacket (VSMP) Segmentation well suited to buffered crossbar 40 300 160 80 40 300 160 80 64 64 64 64 64 64 64 64 64 64 64 11 segments, S = 64 B 704 bytes Fixed-size Unipacket segments 40 300 160 80 80 256 160 5 segments, S 256 B 580 bytes Variable-size Unipacket seg. 40 300 160 80 256 256 256 3 segments, S = 256 B 768 bytes Fixed-size Multipacket segments 68 256 256 3 segments, S 256 B 580 bytes Variable-size Multipacket seg. 8

Presentation Outline NI Queue Manager Introduction Key Concepts Architecture & Implementation NI Design for CMPs NI Design Goals NI Design Issues Proposed NI Design Conclusions & Future Work 9

NI Queue Manager Architecture & Implementation Overview Virtual Output Queues (VOQs) VOQ migration to external memory Hardware-managed linked lists VSMP Segmentation Scheduling Flow Control 3 major versions implemented 10

NI Queue Manager Architecture & Implementation Architecture External Memory (Off-chip VOQs) Memory Controller Host (PCI-X) Packet Sorter Linked List Manager Packet Processor Network (RocketIO) On-Chip VOQs Scheduler Flow Control 11

NI Queue Manager Architecture & Implementation Packet Sorter External Memory (Off-chip VOQs) Memory Controller Host (PCI-X) Packet Sorter Linked List Manager Packet Processor Network (RocketIO) On-Chip VOQs Scheduler Flow Control 12

NI Queue Manager Architecture & Implementation Packet Sorter Sorts packets according to: destination other criteria (e.g. priority) Notifies Scheduler about incoming traffic Light-weight packet processing e.g. enforce maximum packet size 2 versions implemented with packet segmentation without packet segmentation 13

NI Queue Manager Architecture & Implementation On-Chip VOQs External Memory (Off-chip VOQs) Memory Controller Host (PCI-X) Packet Sorter Linked List Manager Packet Processor Network (RocketIO) On-Chip VOQs Scheduler Flow Control 14

NI Queue Manager Architecture & Implementation On-Chip VOQs Accumulates traffic in VOQs VOQs implemented as: Circular buffers in single statically partitioned on-chip memory Xilinx FIFOs 2 versions implemented VOQs in BRAM VOQs in Xilinx FIFOs 15

NI Queue Manager Architecture & Implementation Linked List Manager External Memory (Off-chip VOQs) Memory Controller Host (PCI-X) Packet Sorter Linked List Manager Packet Processor Network (RocketIO) On-Chip VOQs Scheduler Flow Control 16

Linked List Manager Performs Segment Transfers variable-size segments fixed-size segments Manages Linked Lists Head, Tail pointers in on-chip memory Next Block pointers in DRAM (along data) Optimization Techniques Free Block Preallocation Free-List Bypass FSM-based Implementation 17

from Packet Sorter On-Chip VOQs to Packet Processor VOQs Off-Chip Linked List Manager Segment Transfers 1. From On-Chip VOQs to Packet Processor 2. From On-Chip to Off-Chip VOQs 3. From Off-Chip VOQs to Packet Processor 2 3 fixed-size blocks max. segment size = 1 block variable-size segments 1 18

from Packet Sorter On-Chip VOQs to Packet Processor VOQs Off-Chip Linked List Manager Segment Transfers 1. From On-Chip VOQs to Packet Processor 2. From On-Chip to Off-Chip VOQs 3. From Off-Chip VOQs to Packet Processor 2 3 fixed-size blocks max. segment size = 1 block variable-size segments 1 19

from Packet Sorter On-Chip VOQs to Packet Processor VOQs Off-Chip Linked List Manager Segment Transfers 1. From On-Chip VOQs to Packet Processor 2. From On-Chip to Off-Chip VOQs 3. From Off-Chip VOQs to Packet Processor 2 3 fixed-size blocks max. segment size = 1 block variable-size segments 1 20

from Packet Sorter On-Chip VOQs to Packet Processor VOQs Off-Chip Linked List Manager Segment Transfers 1. From On-Chip VOQs to Packet Processor 2. From On-Chip to Off-Chip VOQs 3. From Off-Chip VOQs to Packet Processor 2 3 fixed-size blocks max. segment size = 1 block variable-size segments 1 21

Linked List Manager Linked List Management Large VOQs migrate to DRAM Traffic stored in linked-lists of fixed-size blocks Dynamic allocation of external memory Block size needs to be: Large to benefit from DRAM burst length Small to minimize size of On-Chip VOQs 2 Basic Operations Enqueue Dequeue Free blocks stored in Free-Block List 22

Linked List Manager Basic Linked List Operations Enqueue Get free block from Free-Block list Write data in the new block Update Next-Block pointer of the last block Update VOQ Tail pointer Dequeue Read data from the first block Read Next-Block from first block Update VOQ Head pointer Put free block in Free-Block list 23

Linked List Manager Enqueue/Dequeue Example 0 1 5 15 Head Head Next Free Block VOQ Pointers Tail Free List Tail Enqueue into VOQ 5 Enqueue into VOQ 5 Dequeue from VOQ 5 0 1 2 3 4 5 15 16 17 18 19 20 DRAM 24

Linked List Manager Finite State Machine (FSM) DEQ1 DEQ2 SRAM2XBAR1 DEQUEUE PUSHFB2 PUSHFB2 PUSH FREE BLOCK SRAM2XBAR2 Idle SRAM2XBAR POPFB2 POPFB2 POP FREE BLOCK ENQ1 ENQ2 ENQUEUE 25

Linked List Manager Finite State Machine (FSM) DEQUEUE PUSH FREE BLOCK On-Chip VOQs to Packet Processor IDLE POP FREE BLOCK ENQUEUE 26

NI Queue Manager Scheduler External Memory (Off-chip VOQs) Memory Controller Host (PCI-X) Packet Sorter Linked List Manager Packet Processor Network (RocketIO) On-Chip VOQs Scheduler Flow Control 27

Scheduler Keeps track of each VOQ On-chip occupancy Off-chip occupancy Employs Flow Control (network & local) Number of sent data words Number of credits Implements Scheduling Builds VOQ eligibility masks Enforces scheduling policy Instructs Linked List Manager 28

Scheduler Scheduling Issues Determining Eligibility One eligibility mask for each kind of transfer Eager approach Lazy approach Scheduling Policy Round-Robin Weighted Round-Robin Deficit Round-Robin Strict Priority Starvation? 29

NI Queue Manager Packet Processor External Memory (Off-chip VOQs) Memory Controller Host (PCI-X) Packet Sorter Linked List Manager Packet Processor Network (RocketIO) On-Chip VOQs Scheduler Flow Control 30

Packet Processor Processes Network Traffic Receives variable-size segments Creates autonomous network packets Performs 3 Basic Operations Insert header Modify header Delete header Implemented as 3-stage pipeline Greatly depends on packet nature RDMA packets well suited 31

Packet Processor Example of packet processing Traffic passing through Packet Processor Insert Insert Insert Insert Modify Delete Modify Seg 5 Seg 4 Seg 3 : : : : : Segment 2 Segment 1 Packet 5 Packet 4 Pck 3 Pck 2 Packet 1 = Packet Header = Packet Body := End of Packet 32

NI Queue Manager Implementation 3 major versions Full No external memory No VSMP segmentation Variations of individual modules Packet Processor with/without segmentation On-Chip VOQs with BRAM/Xilinx FIFOs Linked List Manager with/without external memory support 33

NI Queue Manager FPGA Hardware Cost Results (8 VOQs) Queue Mgr Version LUTs Slices Flip Flops BRAMs Gate Count Full 2962 (7%) 2106 (10%) 1571 (3%) 34 (17%) 2263151 No External Memory 2713 (6%) 1900 (9%) 1467 (3%) 34 (17%) 2260321 No VSMPS 3430 (8%) 2639 (13%) 2286 (5%) 37 (19%) 2471047 Module LUTs Slices Flip Flops BRAMs Gate Count Packet Sorter with segmentation 179 (1%) 135 (1%) 80 (1%) 0 (0%) 1909 Packet Sorter no segmentation 42 (1%) 25 (1%) 12 (1%) 0 (0%) 392 On-Chip VOQs BRAM 320 (1%) 236 (1%) 170 (1%) 31 (16%) 2035342 On-Chip VOQs Xilinx FIFOs 904 (2%) 1408 (7%) 1648 (4%) 32 (16%) 2119015 Scheduler 2240 (5%) 1256 (6%) 428 (1%) 1 (1%) 86226 Linked List Mgr with ext mem 680 (1%) 365 (1%) 425 (1%) 0 (0%) 7069 Linked List Mgr noext mem 33 (1%) 23 (1%) 18 (1%) 0 (0%) 426 Packet Processor 521 (1%) 511 (2%) 617 (1%) 0 (0%) 9983 34

NI Queue Manager Network Performance Results * Verified previous theoretical and simulation results about the behavior and performance of buffered crossbar. Throughput measurements using unbalanced traffic patterns. Average Delay vs. Input Load under uniform traffic. Max load is 96%. * The experiments were conducted by Vassilis Papaefstathiou, member of the CARV team 35

Presentation Outline NI Queue Manager Introduction Key Concepts Architecture & Implementation NI Design for CMPs NI Design Goals NI Design Issues Proposed NI Design Conclusions & Future Work 36

IPC: NI Design Issues NI Design Goals High Performance Low Latency High Bandwidth Scalability Reliability Protection & Security Communication Overhead Minimization 37

Presentation Outline NI Queue Manager Introduction Key Concepts Architecture & Implementation NI Design for CMPs NI Design Goals NI Design Issues Proposed NI Design Conclusions & Future Work 38

NI Design Issues NI Placement Lower Performance Standard Interfaces More Buffer Space Processor Proximity 1. I/O Bus 2. Memory Bus 3. Cache 4. CPU Registers Higher Performance Proprietary Interfaces Resource Sharing Less Buffer Space 39

NI Design Issues NI Data Transfer Mechanisms Programmed IO (PIO) Uncached stores, loads or I/O instructions Low bandwidth Cached stores, loads data transferred in cache blocks requires coherence Occupies Processor to copy data Using Direct Memory Access (DMA) Decouples Processor Requires Virtual to Physical Address Translation 40

NI Design Issues NI Virtualization & Protection Traditional approach System Call to Access NI Very Costly (high latency) User-level NIs (ULNIs) Bypass the OS Mapped to Virtual Memory Uses OS paging mechanism for protection User-level Software Operating System NI Hardware User-level Software Operating System NI Hardware 41

Presentation Outline NI Queue Manager Introduction Key Concepts Architecture & Implementation NI Design for CMPs NI Design Goals NI Design Issues Proposed NI Design Conclusions & Future Work 42

Proposed NI Design Design Goals / Desired Features Targeted to future Chip Multiprocessors thousands to millions of nodes Tightly coupled with the processor Low Complexity/Cost Compared to processor and local memory Resources (e.g. memory) Shared with Processor Dynamically allocated (only to active connections) Low Overhead transfer initiation Powerful Communication Primitives Bulky Data Transfers Control & Synchronization Traffic 43

Proposed NI Design Target Hardware/System Xilinx XUP board CPU: PowerPC OS: Linux On-chip BRAM Scratchpad Cache? External DRAM 10 Gbps Network RocketIO 44

Proposed NI Design Target Hardware/System PowerPC @ 266 MHz OCM? Network Interface simple and small compared to CPU and its local memory DRAM @133MHz PLB Shared among CPU and NI NI 10 Gbps Network 10 Gbps @133 MHz @166 MHz On-chip Memory (low latency, high throughput, 8 dual-port BRAM banks) FPGA Xilinx XUP 45

Proposed NI Design Communication Primitives Remote DMA Bulky Data Transfers Producer-Consumer Facilitates Zero-Copy Protocols Requires Virtual-to-Physical Translation Message Queues Low Latency, Minimal Overhead communication Small Data Transfers One-to-One, One-to-Many, Many-to-One Powerful synchronization primitive 46

Proposed NI Design Message Queues One-to-One: e.g. Small Data Transfers One-to-Many: e.g. Job Dispatching Many-to-One: e.g. Synchronization Queues 47

Proposed NI Design Connections Mechanism for Sending Messages Initiating RDMA operations A Connection consists of: Queue Pair (Incoming & Outgoing) Information needed for RDMA Various Queue, Flow Control metadata Connection Types One-to-One (lower overhead, higher security) Many-to-Many (better resource utilization) 48

Proposed NI Design Connection Types One-to-One Incoming & Outgoing Queue is One-to-One Many-to-Many Incoming Queue is Many-to-One Or/And Outgoing Queue is One-to-Many 49

Proposed NI Design Connection Types - Example One-to-One: Local Communication Many-to-Many: Global Communication 50

Proposed NI Design Connection Table (CT) Connections reside in Connection Table in the form of Connection Table Entries (CTEs) Node Memory Scratchpad ConnectionID or CID CID CID CID CID CID CID Connection Table Entry (CTE) Scratchpad Connection Table (CT) 51

Proposed NI Design Connection Table Entry (CTE) Each CTE stores Destination Info (for one-to-one) Protection Info (for many-to-many) Incoming/Outgoing Queue Info Incoming/Outgoing RDMA Info Flow Control Info Other Info e.g. Notification Type 52

Proposed NI Design Connection Table Entry for Xilinx XUP Remote Node ID (24) Remote CID (16) PGID (16) Base (6) Start (8) End (8) Head (10) System-level Connection Info System-level Connection Info System-level Outgoing Queue Info Outgoing RDMA Info CID B1 (32) B2 (32) B3 (32) B4 (32) B5 (32) B6 (32) B7 (32) B8 (32) System-level Incoming Queue Info Incoming RDMA Info User-level Outgoing Queue Info User-level Incoming Queue Info Base (6) Start (8) End (8) Tail (10) Outgoing Tail (10) Incoming Head (10) 53

Proposed NI Design Software Interface Send Message or Initiate RDMA Read Head/Tail of Outgoing Queue Post Descriptor Write Tail of Outgoing Queue (Poll Head of Outgoing Queue) Receive Message or RDMA Notification Poll Tail of Incoming Queue (or get Interrupt) Read Message (or Descriptor) Write Head of Incoming Queue 54

Proposed NI Design Software Interface - Descriptors 55

Proposed NI Design Protection Intranode Protection Isolation of malicious processes. Not allowed to: Read or write data of connections of other processes Corrupt connections of other processes User-Level Process 13 Process 27 Process 42 Kernel NI Hardware 56

Proposed NI Design Protection Internode Protection Isolation of compromised node. Not allowed to: Compromise other nodes Corrupt connections of other processes Node A Node B Node D Node C Node E 57

Proposed NI Design Protection in the Proposed NI Creation of 3 Protection Zones CID CID CID CID CID CID CID CID Connection Table Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 8 High protection Moderate protection Low protection e.g. banks written only by special NI hardware or run-time system e.g. banks written only by system-level software (e.g. kernel driver) e.g. banks written by everyone including user-level software 58

Proposed NI Design Controlling Access to the CT Distinguish User-level from Kernel-level Use of Shadow Address Spaces Double the required Address Space Virtual Address Space Mapped to systemlevel process Physical Address Space 0xDEADBEEF Normal Physical Address Space Connection Table Mapped to userlevel process 0x123456EF 0xFFFF0ABC Shadow Physical Address Space 0xFFFF1ABC CTE 59

Proposed NI Design Controlling Access to the CT Distinguish different User-level processes Fine-grain protection Virtual Address Space Process 1 virtual page Physical Address Space Process 1 physical page Connection Table Process 2 virtual page Process 3 virtual page Process 2 physical page Process 3 physical page 32 CTEs 32 CTEs 32 CTEs 32 CTEs Page Process 4 virtual page Process 4 physical page 60

Presentation Outline Key Concepts NI Queue Manager Architecture Implementation IPC: NI Design Issues Proposed NI Design Conclusions & Future Work 61

Conclusions & Future Work Conclusions NI Queue Manager Feasibility and Effectiveness of VSMPS Confirmed buffered crossbar results Novel Packet Processing Mechanism Eliminates need for reassembly Proposed NI Design Lightweight, well suited to future CMPs Powerful communication primitives Message Queues, Remote DMA Notion of Connections/Queues Versatile Protection & Security Mechanisms 62

Conclusions & Future Work Future Work NI Queue Manager Scalability Dynamic memory allocation for On-Chip VOQs Proposed NI Design Process migration Flow control Efficient cache coherence support 63

Thank You! 64