Hardware Accelerators for Real-Time Scheduling in Packet Processing Systems

Similar documents
CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Section 3 - Backplane Architecture Backplane Designer s Guide

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Chapter 5: ASICs Vs. PLDs

TIMA Lab. Research Reports

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

2. REAL-TIME CONTROL SYSTEM AND REAL-TIME NETWORKS

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

The Design and Implementation of a Low-Latency On-Chip Network

A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

Switched Network Latency Problems Solved

Design of a System-on-Chip Switched Network and its Design Support Λ

Introduction to Real-Time Systems ECE 397-1

Robustness of Multiplexing Protocols for Audio-Visual Services over Wireless Networks

ECE519 Advanced Operating Systems

Multiprocessor and Real-Time Scheduling. Chapter 10

FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues

TEMPLATE BASED ASYNCHRONOUS DESIGN

4/6/2011. Informally, scheduling is. Informally, scheduling is. More precisely, Periodic and Aperiodic. Periodic Task. Periodic Task (Contd.

ARITHMETIC operations based on residue number systems

Advances in Designing Clockless Digital Systems

High Performance Interconnect and NoC Router Design

Subject Name: OPERATING SYSTEMS. Subject Code: 10EC65. Prepared By: Kala H S and Remya R. Department: ECE. Date:

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

Optimizing Emulator Utilization by Russ Klein, Program Director, Mentor Graphics

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

Introduction to Asynchronous Circuits and Systems

Design & Implementation of AHB Interface for SOC Application

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

Digital Design Methodology (Revisited) Design Methodology: Big Picture

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Homework index. Processing resource description. Goals for lecture. Communication resource description. Graph extensions. Problem definition

CHAPTER 1 INTRODUCTION

An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks

Worst-case Ethernet Network Latency for Shaped Sources

Analyzing Real-Time Systems

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Digital Design Methodology

Memory Systems IRAM. Principle of IRAM

«Computer Science» Requirements for applicants by Innopolis University

Part IV. Chapter 15 - Introduction to MIMD Architectures

Prioritization scheme for QoS in IEEE e WLAN

4. Hardware Platform: Real-Time Requirements

Verilog for High Performance

Three DIMENSIONAL-CHIPS

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Start of Lecture: February 10, Chapter 6: Scheduling

Design of 8 bit Pipelined Adder using Xilinx ISE

Low-Power FIR Digital Filters Using Residue Arithmetic

Design Patterns for Real-Time Computer Music Systems

Chapter 1. Introduction

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

EC EMBEDDED AND REAL TIME SYSTEMS

Chapter 6 Heaps. Introduction. Heap Model. Heap Implementation

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

A framework for automatic generation of audio processing applications on a dual-core system

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

Design and Performance Evaluation of a New Spatial Reuse FireWire Protocol. Master s thesis defense by Vijay Chandramohan

PROOFS Fault Simulation Algorithm

How Much Logic Should Go in an FPGA Logic Block?

Frequently asked questions from the previous class survey

Implementation of Asynchronous Topology using SAPTL

VHDL for Synthesis. Course Description. Course Duration. Goals

VLSI Design Automation

A CELLULAR, LANGUAGE DIRECTED COMPUTER ARCHITECTURE. (Extended Abstract) Gyula A. Mag6. University of North Carolina at Chapel Hill

Real-Time Mixed-Criticality Wormhole Networks

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Revision: August 30, Overview

Clockless IC Design using Handshake Technology. Ad Peeters

VLSI Design Automation

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

CSE 332: Data Structures & Parallelism Lecture 3: Priority Queues. Ruth Anderson Winter 2019

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Event-based tasks give Logix5000 controllers a more effective way of gaining high-speed processing without compromising CPU performance.

Network Control and Signalling

The design of a simple asynchronous processor

CPU scheduling. Alternating sequence of CPU and I/O bursts. P a g e 31

1993 Paper 3 Question 6

Synchronization In Digital Systems

An Efficient Method for Constructing a Distributed Depth-First Search Tree

P2FS: supporting atomic writes for reliable file system design in PCM storage

Testing Techniques for Ada 95

Introduction to Real-Time Communications. Real-Time and Embedded Systems (M) Lecture 15

ECE 551 System on Chip Design

IMPLEMENTATION OF A FAST MPEG-2 COMPLIANT HUFFMAN DECODER

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Fault tolerant scheduling in real time systems

Scheduling Algorithms to Minimize Session Delays

Real-Time (Paradigms) (47)

Transcription:

Hardware Accelerators for Real-Time Scheduling in Packet Processing Systems Abstract Scheduler Task Fast job scheduling is vital for real-time systems in which task execution times are short. This paper focuses on packet processing systems, e.g. multimedia streaming applications, in which it is critical that scheduling delays are no more than a few clock cycles. In order to meet this challenge, we propose hardware accelerators which implement scheduling algorithms directly in custom hardware, thereby achieving very low scheduling latencies. We first present a generic framework for such scheduling accelerators, then instantiate several designs by implementing different scheduling algorithms as plug-ins. Our proposed hardware implementation uses an asynchronous or clockless circuit implementation style because of its significant benefits over clocked circuits, particularly in the area of modularity. Simulations of the resulting scheduler hardware indicates scheduling latencies as low as 14 processor cycles for the highest-priority tasks. 1. Introduction This paper targets a particular class of real-time systems, called packet processors, in which quick job turnover is a critical benchmark for performance. Such systems operate on a stream of packets, each packet typically executing for a limited number of clock cycles (e.g., 10 100 cycles). In order for packet scheduling overhead to be acceptable, it is vital that scheduling occurs rapidly, with delays no more than a few clock cycles. This scenario differs from traditional real-time operating systems where task execution times are typically orders of magnitude greater (and therefore scheduling speed is less critical). Our work introduces a scheduling approach that uses custom hardware accelerators to provide the scheduling speeds required by packet processing systems. Building schedulers directly in hardware has two benefits: significantly shortening scheduling delays, and freeing up the processor from the burden of scheduling. While these proposed hardware accelerators are applicable to any real-time packet processing system, an application of particular interest is real-time multimedia processing. Devices with multimedia components have become commonplace in consumer electronics, reflecting a widespread interest in interactive audio and video applications. To continue this trend, the breadth of technology offered by individual products is expanding: e.g., cellular phones now record video, play music, and access the internet, in addition to handling voice. a) b) c) Figure 1. Different scheduler implementations: (a) scheduling in software, (b) scheduling on hardware accelerator, (c) scheduling on hardware in parallel with job execution. For such streaming multimedia systems, it is often a challenge to efficiently manage scheduling of several tasks of different priorities. First, to maintain usability, certain real-time requirements must be met. In audio and video communication, for example, a maximum two-way delay of 300ms [8] can be tolerated. Therefore, in addition to the requirement that the processing of each multimedia packet be fast, there is also a requirement that the scheduling be fast and effective, so that deadlines are maintained. Second, it is typically desirable that overhead of task management itself does not impose a significant burden on the processor; the greater the amount of useful work performed by the processor, the higher is the quality of the multimedia experience. In this paper, we propose dedicated hardware accelerators for real-time scheduling, with the twin objectives of (i) increasing the scheduling speed, and (ii) removing the burden of scheduling from the processor. Figure 1 illustrates the benefit of our proposed approach. Figure 1(a) shows a timeline corresponding to conventional software-based scheduling: the scheduler imposes significant overheads in systems where task execution times are short. In Figure 1(b), a hardware accelerator is introduced to substantially shorten scheduling delays. In this scheme, scheduling decisions are made after the previous job has completed execution. Finally, Figure 1(c) shows a more concurrent scheduling approach, where the scheduler decides which task to execute next before the current task has finished execution. While (c) has the benefit of removing scheduling delays off of the critical path, (b) may provide better scheduling quality for late-arriving high-priority jobs. The performance benefits of hardware accelerators are most obvious when the cost of scheduling jobs is a nonnegligible portion of the execution time of a processor. How-

ever, even in systems where scheduling overheads are negligible compared to processing costs, using a hardware scheduler is still advantageous. In particular, hardware-based schedulers allow reduced complexity and overhead in software. They also provide more rapid response to dynamic behavior in the system or environment, e.g., job scheduling in a bursty environment. For the design of the scheduling hardware, an asynchronous or clockless circuit paradigm was chosen [5]. Asynchronous circuits dispense with centralized clocking, and instead use local handshakes to coordinate communication. As a result, they offer the potential benefits of lower power consumption, higher speed, and greater modularity and ease of design [3]. This work exploits their design modularity to enable the rapid and low-effort construction of custom scheduler hardware from high-level algorithmic descriptions. In particular, the lack of centralized clocking frees the designer from low-level timing considerations, allowing the design to be easily composed from reusable building blocks. This modularity allows us to propose a general framework, or hardware template for a generic scheduling accelerator. Different scheduling policies have been designed as distinct hardware components (i.e., plug ins ), any of which can be selected during physical design for insertion into the generic template, for either silicon fabrication or FPGA implementation. A wide range of scheduling algorithms were implemented in hardware, down to the gate level, using our approach. The schedulers implemented includes well-known static and dynamic schedulers, including rate-monotonic (RM) and earliest-deadline-first (EDF) [10]. The resulting hardware implementations were simulated, at the gate level, and their performance evaluated assuming they were part of a system containing a typical embedded compute processor running at 40 MHz. Simulations indicate promising scheduling speeds for our dynamic schedulers: only 14 31 processor cycle latencies for the highest priority jobs to be scheduled. On the other hand, if scheduling is performed in software, the scheduling latencies are several orders of magnitude greater. The remainder of this paper is organized as follows. Section 2 on asynchronous design, including a brief description of the specific synthesis flow used in our approach. Section 3 then presents the design and implementation of the proposed scheduling accelerator; several distinct scheduling policies are covered. Section 4 presents synthesis and simulation results for each of the schedulers implemented, including scheduling speeds obtained, as well as throughput and area overheads incurred. Finally, Section 5 gives conclusions and future research directions. 2. Previous Work and Background This section discusses previous work in the area of hardware accelerators for real-time scheduling, and provides a brief clock (a) Clocked implementation ack req ack req (b) Asynchronous implementation Figure 2. Clocked vs. Clockless FIFOs introduction to asynchronous design, including an overview of the sythesis flow that is used to implement the scheduler hardware proposed in this paper. 2.1 Previous Work on Scheduler Accelerators Several hardware-based approaches to real-time scheduling have been reported in literature, but each has significant limitations. Most of these approaches are based on hardware implementations using binary-comparator-trees [12], shiftregisters [14], and systolic arrays [9]. Each of these approaches has significant drawbacks, including high design complexity, lack of scalability and flexibility, and limited scheduling performance [11]. The approach of [4], however, is both fast and scalable. A novel pipelined heap architecture is introduced, which is capable of high-speed enqueueing and dequeueing. Further, the architecture can be scaled to arbitrary priority levels without performance degradation. Our approach, however, has several advantages over that of [4]. First, our approach provides a generalized framework for scheduling accelerators, along with modular scheduling plug in components. In contrast, their approach provides only one specific scheduling strategy. Second, their implementation is memory-based and relies on efficient manipulation of in memory. In contrast, our approach is flow, which avoids bottlenecks due to centralized memory accesses. Finally, our approach allows much more concurrent job insertion: jobs are inserted at the leaf nodes of the structure, thereby allowing a concurrent insertion of as many jobs as the number of leaf nodes. In contrast, their approach inserts jobs at the root node, and therefore only a single queueing operation can be initiated at a time. 2.2 Background: Asynchronous Design The current practice of synchronous hardware design is facing increasing difficulties as clock speeds approach 10 GHz, chip complexity approaches a billion transistors, and the demands for low power consumption and modular design become paramount [2]. 2

Asynchronous or clockless design is emerging as an attractive alternative, with the promise of alleviating these challenges by dispensing with centralized clocking altogether [5, 3]. Instead, events inside an asynchronous system are coordinated in a much more distributed fashion, using local handshakes between communicating components. Figure 2 illustrates this difference between synchronous and asynchronous design for the simple example of a FIFO. In the synchronous implementation, is computed in one stage and transfered to the next on every clock tick; thus, coordination between communicating stages is governed implicitly by the clock. In contrast, the asynchronous implementation of the FIFO replaces the global clock signal by a pair of request-acknowledge signals for each pair of communicating stages. 1 A stage initiates computation only when it receives new along with a request from its left neighbor. Once has been processed by the stage, the left neighbor is acknowledged, and the along with a request is relayed to the right neighbor. Advantages of Asynchronous Design. Asynchronous circuits have several key advantages over synchronous circuits (see [5, 3]). 1) Greater Energy Efficiency. In synchronous systems, idle components generally process garbage unless a clock gating protocol is in place. Asynchronous circuits inherently avoid unnecessary computation: components are only activated upon receipt of a handshake, thereby consuming little energy when idle [3]. 2) Better Electromagnetic Compatibility. While synchronous circuits produce noise spikes at clock frequencies and their harmonics, asynchronous circuits not only produce less total noise energy but it is also spread across the spectrum [3]. The benefit of lower noise emissions was the key motivation for Philips to develop fully asynchronous microprocessors, which have been used in tens of millions of commercial pagers, cell phones and smartcards used throughout Europe [7]. 3) Higher Speed. An asynchronous system can exploit -dependent computation times: when a result in a component is produced early it can be immediately communicated to the next stage. In contrast, in a synchronous system, that component would have to wait for the clock cycle to finish before the result can be communicated. Thus, while synchronous implementations are limited by worstcase clocking, asynchronous designs can potentially obtain average-case behavior [3]. 4) More Robust Arbitration. Arbitration and mutual exclusion, which are fundamental to real-time systems, can often lead to metastability in hardware. While metastability 1 Note that in asynchronous real-time systems the presence of a clock may be necessary to verify that all real-time deadlines are met; however, the purpose of such a clock is to provide a timing reference, rather than to govern the pace of all logic activity. &fifo = proc(in?chan packet & OUT!chan packet). begin & x: var packet forever do IN?x; OUT!x od end IN forever do ; x Figure 3. Haste example OUT can have drastic effects on clocked designs, its impact is less drastic in asynchronous designs. In particular, should a metastable state arise in a synchronous circuit, it must be resolved before the next clock edge to ensure correct circuitlevel behavior. In asynchronous systems, on the other hand, the metastable state is allowed to persist as long as necessary for its resolution; because there is no clock deadline to meet, the rest of the circuitry simply waits for the state to be resolved. 5) Greater Modularity and Ease-of-Design. Asychronous handshake protocols promote modularity, allowing components to be developed and reused in multiple designs [3]. Moreover, the time-consuming task of designing a low-skew high-speed clock distribution network is no longer necessary. As long as the communication protocol at a module s interface is met, the module will operate correctly regardless of the environment it is embedded within. This greater modularity of asynchronous components is key to design reuse, which is likely to become critical as chip complexities increase to a billion tansistors over the next few years. 2.3 High-Level Asynchronous Synthesis Flow The designs discussed in Section 3 were synthesized and simulated using the Haste/TiDE design flow (formerly Tangram ), a product of Philips/Handshake Solutions [1]. Haste is one of a few mature asynchronous design flows currently available; the toolset focuses on rapid design of custom asynchronous hardware. It targets medium-speed low-power implementations running at or below 400 MHz (in 0.13µm technology). The Haste toolset is essentially a silicon compiler: it accepts specifications written in a high-level hardware description language, and compiles them, via syntax-driven transla- 3

tion, into a gate-level circuit. The high-level language is a close variant of the CSP behavioral modeling language [6], which allows complex behaviors to be easily specified in a few lines of code. Figure 3 shows the Haste specification of a single stage of a FIFO. The stage has an input channel IN, through which it receives a packet from its left neighbor, and an output channel OUT, through which it transmits the packet to its right neighbor. Each channel consists of a pari of requestacknowledge wires along with the wires. In the specification, x is an storage variable, which corresponds to the storage element (i.e., latch or flip-flop) of the FIFO stage. Finally, the main construct in the body of the specification is a forever do loop, which specifies the following: (i) read a packet from channel IN and store it into variable x; then (ii) write the value stored in x to the output channel OUT; and (iii) perform this sequence of actions repreatedly, forever. The compiler parses the specification, and syntactically maps each construct onto a predefined library component to generate the hardware implementation, as shown in the bottom of Figure 3. In particular, there is a predefined component that implements the forever do construct: it repeatedly initiates handshakes with its target. Similarly, there is a predefined component that implements sequencing, called the sequencer and denoted by ;. The sequencer, upon receiving a handshake from its parent, performs a handshake with its left child followed by a handshake with its right child. The variable x maps to a storage element. Finally, the read and write operations i.e., read from channel IN and write to x, and subsequently, read from x and write to channel OUT map to redefined components called transferrers, denoted in the figure by. In summary, the compilation approach is quite simple but very powerful: fairly complex algorithms can be easily mapped to hardware. Gate-level implementations for fairly complex designs, such as complete microcontroller, can be generated in as little as a few minutes. Specifications of large systems are naturally decomposed into subsystems or smaller components (i.e., individual procedures). This was a key motivation for the approach of this paper to use the Haste/TiDE synthesis flow: individual scheduling policies can be independently implemented as separate components, which can then be plugged in into a generic hardware template to rapidly implement a scheduling accelerator. 3. Design and Implementation This section introduces several designs of a scheduling accelerator for a real-time packet processing system. A simple FIFO queue and our generalized replacement are described in subsection 3.1. A static scheduler is discussed in subsection 3.2. Dynamic schedulers are presented in subsection 3.3; both rate-monotonic (RM) and earlist-deadline-first Distributor Scheduler Processor Figure 4. General structure (EDF) designs have been implemented. Finally, a multiprocessor configuration using our architecture is proposed in subsection 3.4. 3.1 Overview In the following subsections, we explore the design space of asynchronous scheduling by introducing several potential schedulers. Many tradeoffs exist in the domain: algorithm complexity versus ease of design, latency versus throughput, correctness versus effectiveness; any one design may be optimized for one metric but suffer in another. Quantitative results such as latency and throughput are presented in Section 4, but the discussion in this section includes a qualitative comparison of the different designs in terms of their implementation complexity. In order to analyze the effectiveness of the proposed scheduler hardware, a base case is necessary for comparison. Here we use a simple first-in first-out (FIFO) queue as a reference. A FIFO queue essentially models a packet processing system ignoring job prioritization. The length of the queue must be sufficient enough for bursts of activity to be captured without packet loss. Any incoming job may be blocked by up to n jobs, where n is the number of stages in the pipeline. High priority jobs are likely to miss deadlines in such a system. In contrast to the FIFO queue which ignores job priorities, our implementations allow packet prioritization by breaking up the single queue into several priority queues and attaching a scheduler between the queues and the processor. The generalized design is shown in Figure 4. On the left is a distributor that quickly analyzes incoming packets and places them in the appropriate queue. After being routed and traveling through its assigned queue, a job is visible to the scheduler and may be selected for processing. If selected, it is nonpreemptively executed by the processor. The designs in the following subsections implement this general structure. 3.2 Static Scheduling The behavior of a static scheduler is determined prior to execution. For a packet processing system where incoming jobs may enter at arbitrary times, static schedulers such as a cyclic executives are a poor choice. Here we have modified the cyclic executive to allow more dynamic behavior. In our design all priority levels are assigned a share of the processor s time. A schedule is generated that main- 4

tains these shares and spaces processor access evenly. The schedule is written to a fast on-chip ROM. At runtime, the scheduler cycles through the ROM, reading from the queue indicated by the schedule. The behavior of this implementation differs from a cyclic executive when a queue scheduled to be read is empty. Rather than block or no-op, the schedule advances until a non-empty queue is read. Note that shares are no longer accurately maintained; this can be remedied in an even more dynamic implementation. In environments where priority queues are generally nonempty, this implementation can be effective. The flaws of the scheduler are apparent when traffic occurs in widelyspaced bursts or when priority assignments have been poorly mapped. Scaling the scheduler to a large priority set has a detrimental effect on performance; as the number of priority levels increases, queues are more likely to be vacant, reducing scheduler quality. 3.3 Dynamic Scheduling Dynamic schedulers react more appropriately to the behavior of a packet processing system. Here we discuss two simple RM schedulers and a more advanced scheduler which handles both RM and EDF. 3.3.1 RM: Sequential Selection The simple RM scheme mimics a basic software approach to scheduling. When the processor becomes idle, it queries the scheduler for the next available job. The scheduler then checks to see if any jobs are available, starting with the highest priority queue. If a job is available, it is scheduled for execution. If not, the next highest priority queue is checked and the process repeats. The general structure becomes a large series of if-then-else statements. The highest priority job can prevented from executing for the maximum execution time of any lower priority process. Choosing the highest priority job to execute is not a constant-time decision; the time taken by the scheduler is dependent on the priority of the job chosen. However, jobs of high priority take the least amount of time to be selected, leaving jobs of low priority with the greatest overhead. For systems with many priority levels, jobs with low priority can see significant overheads, affecting the throughput of the scheduler. A notable property of this scheduler is the lack of arbitration. All other dynamic implementations rely on mutual exclusion elements to make decisions. 3.3.2 RM: Parallel Selection One downside of the simple RM scheme just described is its non-constant decision time. To remedy this, we introduce a parallel approach to queue probing, which is analogous to the Distributor Heap Figure 5. Heap Scheduler Processor case construct in software. Instead of sequentially checking each priority queue, all queues can be checked in parallel, yielding a constant-time decision. Parallel decision is performed in Haste by the use of the select (sel) construct. Each guard is evaluated concurrently; if a guard is true, it is executed. When several guards are true we select the highest priority job for execution. Compared to the previous scheduler, parallel RM improves both throughput and latency for almost all jobs. As the number of priority levels increase, this scheduler maintains a good response time. However, jobs using sequential selection will see quicker response for the top several priority levels in large-scale systems. 3.3.3 RM: Early Selection The previous scheduler can be further parallelized by removing the scheduling decision from the critical path. Instead of the processor querying the scheduler for the next available job after completion, this decision can be made in parallel to execution. Should a higher priority job enter the system after the scheduling decision is made, it will be unable to execute in the next time slot. Using early selection, a job can be blocked by a maximum of two other jobs rather than just one when the scheduler is in the critical path. Gains in overall throughput and latency of low priority jobs are achieved at the expense of high priority jobs. 3.3.4 EDF and RM: Binary Heap The basic design is shown in Figure 5. Each priority queue empties into a child node of the heap structure. Pairs of nodes are connected by internal nodes recursively up to the root. The root node interacts with the processor, providing the next job scheduled for execution. Operation of the structure is as follows. First, jobs enter the system and are routed to the appropriate queue. For an RM scheme, jobs are placed in the queue corresponding to their priority level. In EDF, the jobs are distributed evenly across the queues to balance the heap. Some time after entering the queue, a job will arrive at a child node of the heap. If the node is ready to accept a new job, it reads the job from the queue. Each node keeps track of its own job and the priority/deadline of its parent s job. Should the new job arriving at a node be of higher priority/earlier deadline than the parent, 5

Asynch Priority Level Scheduler 1 2 3 4 5 6 7 8 Avg 104 103 107 103 109 109 107 108 107 Static 52 54 55 72 95 113 194 184 99 Sequential 14 17 21 31 52 117 214 303 90 Parallel 14 16 21 31 51 115 214 300 89 Early 31 34 38 49 65 116 183 246 91 Heap 15 17 22 32 52 116 214 293 89 Software 95 118 160 254 470 1090 2460 7724 1363 Table 1. Job latency to processor for each scheduler and priority level the jobs are swapped. In this way, the highest priority job in the system will bubble up to the root node. Since several jobs may be advancing through the system simultaneously, arbitration is necessary at internal nodes. A parent node will arbitrate between two children requesting swaps concurrently. When one swap is accepted, the parent updates each child with the priority of its new job. This design aims for high throughput and high schedulability. Of the designs presented, it is the only one capable of an EDF scheme. Furthermore, the system can be easily designed to select between an EDF and RM scheduling scheme at runtime with little degradation in performance. 3.4 Multiprocessor Configurations Adaptation of software schedulers to a multiprocessor environment can be a complex process. A simple approach is to map certain priority levels to each processor, e.g. even priorities map to one processor, odd priorities to another. The process can be made easier if the job distribution is known beforehand, which may not the case in a packet processor. In contrast, every hardware scheduler discussed here can be easily mapped to a multiprocessor configuration. Between the scheduler and the processors a distributor can be introduced. When a processor is empty, it queries the distributor for the next available job. This design prevents processors from idling when jobs are available. 4. Results 4.1 Experimental Setup Each scheduler was designed, implemented, tested, and simulated using the Haste/TiDE toolflow (formerly Tangram ) from Philips/Handshake Solutions [1], described earlier in subsection 2.3. Jobs were generated to simulate a bursty environment. Latency were measured as the time between the introduction of a job to the system and the time the job becomes available for execution at the processor. The properties of jobs are listed in Table 2. The execution cost of a job was randomized, generally between 20-30 clock cycles, depending on priority. Over 2500 jobs were executed during a 0.5ms interval for this simulation. For this example, eight priority levels were specified. For packet processing, a typical embedded multimedia processor running at 40MHz was assumed, which implies Job Priority Level Properties 1 2 3 4 5 6 7 8 Min Execution Time 15 20 15 15 20 25 20 30 Max Execution Time 20 30 25 30 30 45 40 50 Percent of All Jobs 10 10 15 15 15 15 10 10 Table 2. Packet distribution and execution time in cycles a processor cycle of 25ns duration. All latencies reported are in units of this cycle time. The following schedulers were implemented, (i) : simple FIFO queue, (ii) Static: the static scheduler of Section 3.2, (iii) Sequential: a dynamic RM scheduler using a series of sequential if-then-else constructs, (iv) Parallel: a dynamic RM scheduler using a select construct (parallel construct similar to a case statement), (v) Early: the same as Parallel, but the scheduling of a job is done in parallel with the execution of the current job, and (vi) Heap: the heap scheduler from Section 3.3.4. In addition, for comparison, a software scheduler was also simulated. The decision time for the software scheduler was assumed to be constant at 20 cycles. This assumption is actually quite favorable to the software scheduler; scheduling decisions in real-world implementations can take hundreds to thousands of cycles. Therefore, the relative gains obtained by our hardware scheduler will in practice be substantially higher than reported here. 4.2 Performance: Scheduling Latencies Table 1 shows the average latency of jobs at each priority level. A lower number in the chart indicates a higher priority level (i.e. level 1 is the highest priority). Latencies listed are in units of the embedded processor s cycle time, and indicate the average time from the moment a job is enqueued to the instant it reaches the processor (but before it is executed). The basic queue, which was used as a baseline, has a latency of around 107 cycles per job. Finally, the software scheduler has an job latency that is much higher: 1363 cycles per job. The table demonstrates the effectiveness of our scheduler hardware in prioritizing jobs. The highest-priority jobs experience significantly shortened latencies from 14 to 31 cycles for dynamic schedulers compared with 52 cycles for the static scheduler and 104 cycles for the simple queue. Since these times include a blocking delay due to nonpreemptive scheduling i.e., the time for which the processor is tied up with the previous job the actual scheduler latencies are therefore even shorter than 14 31 cycles. A key observation is that all schedulers see a decrease in latency for jobs until priority level 6. With the exception of Early, every dynamic scheduler had an average latency reduction of 86% or more for their highest priority jobs. The Early scheduler showed a latency reduction of 70%. This was at the expense of lower priority jobs, who saw up to a 3x increase in latency. 6

Asynch Throughput Area Scheduler (items/µs) (µm) 2 10.22 2729 Static 9.58 11359 Sequential 9.58 10995 Parallel 9.80 10816 Early 10.22 10821 Heap 10.05 24768 Table 3. Maximum throughput and total area for each scheduler The rightmost column in Table 1 shows the average latency of all tasks in the system. Amongst our scheduler, the worst performer in this category is the static scheduler. Queries to empty queues slowed down the scheduler, giving it an average latency worse than that of the dynamic schedulers. Performing best in this category are the Heap and Parallel schdulers. 4.3 Area and Throughput Overheads The cost of using an hardware scheduler manifests itself primarily in area and throughput. Table 4.3 shows the throughput degradation and area increase due to the additional scheduling logic in hardware. The static scheduler and the sequential scheduler saw a 6% reduction in overall throughput. These schedulers are both limited due to probing of empty queues. Because the scheduling time of Parallel occurs between processing of jobs, throughput is also degraded. Heap saw a less significant decline in throughput because most scheduling occurs in the background, although individual jobs have a longer path from entrance to execution than the basic queue. Early saw no change in throughput, as scheduling occurred during execution. The final column in the table lists the chip area consumed by each of the hardware accelerator implementations. Compared to the size of most embedded processors (e.g., 5 20 mm 2 ) [13], this area overhead (less than 0.03mm 2 ) is quite negligible. 5. Conclusion This paper proposed several asynchronous hardware schedulers as a replacement for software scheduling in real-time packet processing systems. Our results showed improved response times in all cases for high priority tasks. Each scheduler occupied a niche in the design space, achieving high performance in one area at the expense of another (area, throughput, latency). In ongoing and future work, we plan to conduct further analysis and refinement of the scheduler designs. In addition, we are developing performance bounds and heuristics to help choose the optimal scheduler. We also plan to expand our simulations to include multiprocessor implementations. An intriguing extension to this work is to explore the potential for dynamic voltage scaling in a hardware scheduling system. We will extend our scheduling approach to take into account this new dimension. References [1] Handshake Solutions, a Philips subsidiary. http://www.handshakesolutions.com/. [2] Int. Technology Roadmap for Semiconductors. Overall Roadmap Technology Characteristics. http://public. itrs.net. [3] C. H. K. v. Berkel, M. B. Josephs, and S. M. Nowick. Scanning the technology: Applications of asynchronous circuits. Proceedings of the IEEE, 87(2):223 233, Feb. 1999. [4] R. Bhagwan and B. Lin. Fast and scalable priority queue architecture for high-speed network switches. In Proc. IN- FOCOM, pages 538 547, 2000. [5] A. Davis and S. M. Nowick. An introduction to asynchronous circuit design. In A. Kent and J. G. Williams, editors, The Encyclopedia of Computer Science and Technology, volume 38. Marcel Dekker, New York, Feb. 1998. [6] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [7] J. Kessels, T. Kramer, G. den Besten, A. Peeters, and V. Timm. Applying asynchronous circuits in contactless smart cards. In Proc. Int. Symp. on Advanced Research in Asynchronous Circuits and Systems, pages 36 44. IEEE Computer Society Press, Apr. 2000. [8] T. Kurita, S. Iai, and N. Kitawaki. Effects of transmission delay in audiovisual communication. Electronics and Communications in Japan, 77(3):63 74, 1995. [9] P. Lavoie and Y. Savaria. A systolic architecture for fast stack sequential decoders. IEEE Trans. on Communications, 42(2 4):324 334, Feb Apr 1994. [10] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20(1):46 61, 1973. [11] S.-W. Moon, J. Rexford, and K. G. Shin. Scalable hardware priority queue architectures for high-speed packet switches. IEEE Transactions on Computers, 49(11):1215 1227, 2000. [12] D. Picker and R. Fellman. A VLSI priority packet queue with inheritance and overwrite. IEEE Transactions on VLSI Systems, 3(2):245 253, 1995. [13] S. Segars. The ARM9 Family high performance microprocessors for embedded applications. In Proc. Intl. Conference on Computer Design, pages 230 235, 1998. [14] K. Toda, K. Nishida, E. Takahashi, N. Michell, and Y. Yamaguchi. Design and implementation of a priority forwarding router chip for real-time interconnection networks. International Journal on Mini and Microcomputers, 17(1):42 51, 1995. 7