Design Trade-offs in Customized On-chip Crossbar Schedulers

Size: px

Start display at page:

Download "Design Trade-offs in Customized On-chip Crossbar Schedulers"

Reginald Hicks
5 years ago
Views:

1 J Sign Process Syst () 8:9 8 DOI.7/s-8--x Design Trade-offs in Customized On-chi Crossbar Schedulers Jae Young Hur Stehan Wong Todor Stefanov Received: October 7 / Revised: June 8 / cceted: ugust 8 / Published online: Setember 8 The uthor(s) 8. This article is ublished with oen access at Sringerlink.com bstract In this aer, we resent a design and an analysis of customized crossbar schedulers for reconfigurable on-chi crossbar networks. In order to alleviate the scalability roblem in a conventional crossbar network, we roose adative schedulers on customized crossbar orts. Secifically, we resent a scheduler with a weighted round robin arbitration scheme that takes into account the bandwidth requirements of secific alications. In addition, we roose the sharing of schedulers among multile orts in order to reduce the imlementation cost. The roosed schedulers arbitrate on-demand (at design time) interconnects and adhere to the link bandwidth requirements, where hysical toologies are identical to logical toologies for given alications. Considering conventional crossbar schedulers as reference designs, a comarative erformance analysis is conducted. The hardware scheduler modules are arameterized. Exeriments with ractical alications show that our custom schedulers occuy u to 8% less area, and maintain better erformance comared to the reference schedulers. J. Y. Hur (B) S. Wong T. Stefanov Comuter Engineering Lab., TU Delft, The Netherlands J.Y.Hur@ewi.tudelft.nl URL: htt://ce.et.tudelft.nl S. Wong J.S.S.M.Wong@tudelft.nl T. Stefanov Leiden Embedded Research Center, Leiden University, Leiden, The Netherlands stefanov@liacs.nl URL: htt:// Keywords Interconnection architectures Schedulers Network toology Reconfigurable hardware Introduction It is a well-known fact that a crossbar network rovides high network erformance and minimum network congestion. The non-blocking nature of communication and the relatively simle imlementation makes the crossbar oular as an internet switch (Cisco Systems, Inc., htt:// tyical crossbar consists of a scheduler and a switch fabric. The scheduler lays a key role in achieving high network erformance and becomes more imortant as the size of the network increases. commercial crossbar scheduler tyically accommodates an arbiter er inut/outut ort and each arbiter concurrently arbitrates the incoming ackets []. In these fully arallel schedulers, all-to-all connections are required to be accommodated since traffic atterns are in most cases unknown. Nevertheless, a major bottleneck of the fully arallel scheduler is the high cost due to the increasing amount of wires as the number of orts grows. Figure deicts the area of the islip crossbar scheduler [], which is widely used for the commercial crossbar switches. s the number of orts increases, the area of the scheduler increases in an unscalable manner. This is mainly due to the all-to-all interconnects inside the scheduler module. In addition, the crossbar scheduler is an imortant basic building block for modern networks-on-chi (NoC) []. The scheduler in an on-chi router accommodates all-toall connections. In many cases, such schedulers contain a single central arbiter that sequentially arbitrates a single request at a time, while data transmission can be

2 7 J Sign Process Syst () 8:9 8 Number of gates ( x ) 7 rea of i SLIP Scheduler 8 8 Number of orts Figure rea of crossbar scheduler []. arallel. Due to this, the arbitration latency increases as the number of crossbar orts increases. This work alleviates these roblems by establishing on-demand toologies using a crossbar in a reconfigurable multirocessor system-on-chi latform. This work is motivated by the following observations: The logical toology and traffic information can be derived from the arallel alication secification. Communication atterns of different alications reresent different logical toologies. Figure deicts task grahs of realistic alications taken from [ ], where numbers on links reresent the traffic loads (or required bandwidth). The numbers between braces in the deicted toologies indicate the number of nodes and the number of required links. s an examle, MJPEG{,} indicates that the MJPEG alication requires nodes and links. s deicted in Fig., logical toologies are alication-secific and require only a small ortion of all-to-all communications. It can be noted that we focus on communication erformance. Detailed descritions of the comutation nodes in Fig. are not relevant in this work. Single alications can be secified differently as observed in the MJPEG [Fig. (7) and (8)] and MPEG secifications [Fig. () and ()]. Moreover, the traffic load is differently distributed for different communication links as deicted in Fig. (see the numbers on the links). More than 7% of modern reconfigurable hardware, such as field-rogrammable gate arrays (FP- Gs), is reoccuied by millions of wire segments. s an examle, Virtex-II Pro device has nine metal layers to enhance the routing flexibility and for a designer to realize on-demand reconfigurability. In this aer, we resent a systematic design, an analysis, and an imlementation of novel alicationsecific crossbar schedulers. Our custom schedulers arbitrate only the necessary interconnects instead of all-to-all interconnects. In addition, our weighted round-robin scheduler can be adated to the traffic requirements known at design time. The resented vu au cu rast. 9 mem mem mem idct ds us bab risc idct iq mc () MPEG {, } transfer Decoder mbintra 88 () MPEG {, } 7 Predict _acdc 8 In mem hs In mem vs Jug Jug mem O dis () PIP {8, 8} Coy 8 Coy 8 Init 8 HPF LPF HPF Coy 9 9 Coy Coy 9 LPF Coy Coy HPF Sink Coy 9 Coy 9 Coy Sink LPF Sink 9 Coy Sink () Wavelet {, } Run le vld 7 7 dec iquan idct 7 Inv U cd arm scan sam red Strie 9 Vo mem rec ad Vo mem 9 () VOPD {, } Video in VLE Video out 9 in nr mem 8 9 vs hs mem 9 hvs mem 9 jug jug 9 se blend () MWD {, } Video in (7) MJPEG {,7} (8) MJPEG {, } 9 VLE Video out Figure Logical toologies with different traffic load distribution in ractical alications [ ].

3 J Sign Process Syst () 8:9 8 7 schedulers combine the high erformance of a arallel scheduler and the reduced area of customized interconnects. The main contributions of this work are: We roose a weighted round robin custom arallel scheduler (WCPS). The resented scheduler is imlemented with on-demand interconnects, where the hysical toologies are identical to the logical toologies for an alication. Exeriments on realistic alications indicate that our WCPS scheme erforms better than reference schedulers with moderate area overheads. We roose a custom arallel scheduler with shared arbitration scheme (SCPS). Exeriments show that the imlementation cost can be reduced with moderate erformance overheads, when the number of links er ort is sufficiently large. The organization of this aer is as follows. In Section, related work is resented. The roosed scheduler designs are described in Section. In Section, we resent the erformance evaluation. The hardware imlementation results are resented in Section. Finally, conclusions are drawn in Section. Related Work Our work is based on the general aroach for ondemand reconfigurable networks [, ]. In this aer, we resent a custom crossbar scheduler utilizing on-demand reconfigurable interconnects. Numerous NoCs targeting SICs (surveyed in []) emloy rigid underlying hysical networks. Tyically, acket routers constitute tiled NoC architectures. Each acket router accommodates a crossbar switch fabric and a scheduler with internally all-to-all hysical interconnects. Our scheduler is different from the schedulers in SIC-targeted NoC routers, since on-demand toology is established in our scheduler by utilizing the reconfigurability of a reconfigurable hardware. NoCs targeting FPGs [ 8] emloy fixed toologies defined at design-time. In [ 8], acket switched networks are resented and each router contains a full crossbar fabric. The toology in this related work is defined by the interconnections between routers or switches. The scheduler in [7] accommodates an arbiter er ort, which is similar to our aroach. In [7], single Dmesh acket router for an 8-bit flit occuies slices and block memories (BRMs) in a Virtex-II Pro (xcv) device. Our work is close to [8], in which a toology adative arameterized network comonent is resented. While the crossbar interconnects inside a router of [7, 8] are still all-to-all, the hysical toology of our crossbar network is identical to the on-demand toology of the alication. In addition, the toology of our work is defined by the direct interconnection between rocessors. In other words, our custom crossbar constitutes a system network. In [9], the crossbar network is imlemented utilizing a modern artial and/or dynamic reconfiguration technology. In [9], a large-sized (98 98 bits) crossbar network utilizing native rogrammable interconnects and look-u tables (LUTs) is resented. However, allto-all crossbars in [9] are not scalable, esecially for large-sized networks. Our work differs from [9] inthat our network is customized in order to alleviate the scalability roblem. In addition, our network is imlemented in a fully synthesizable VHDL. Our imlemented network is technology-indeendent, though we target FPGs in this work. Recently, a coule of alication-secific NoC designs were roosed. In [], a multi-ho router network customization is resented, whereas our network is single-ho based. In [, ], an internal STbus crossbar customization is resented. Our work is similar to [, ] in that the crossbar is customized. In [], a design method to generate a crossbar is resented, in which an alication is traced (in simulation) and the grah clustering technique is used to generate the crossbar network. Our design method differs from [ ] in that our custom crossbar is generated, where the hysical toology is identical to the on-demand toology of an alication. Our arbitration scheme can also be configured with regards to the traffic demand, where the hysical toology inside the scheduler is again identical to the required toology. In [], a design method to synthesize an alication-secific bus matrix is resented. Our work is similar to [] inthat the alication-secific interconnects are established between masters and slaves. Furthermore, our work is similar in that a shared arbitration method is resented. However, our design method differs in that the on-demand toology information and the traffic bandwidth information are extracted from the alication secification ste. On the other hand, in [], a simulation-based traffic trace is conducted to synthesize the alication-secific toology, which requires hours of design sace exloration time. Finally, a erformance comarison between the full crossbar and the artial bus matrix is not resented. Finally, our crossbar network is similar to the virtual outut queues (VO) in that the hysical FIFOs are established at the inut orts. The VO-based switch such as islip [] scheduler is not oular in the multi-ho NoC due to the high cost. We highlight that

4 7 J Sign Process Syst () 8:9 8 our crossbar interconnects are single-ho, not multiho NoC. In other words, a crossbar (with single-ho latency) constitutes a system network in the multirocessor system on chi latform. Our alicationsecific network combines high erformance of a VO-based crossbar and the low cost of customized interconnects. In [], a fast round-robin crossbar scheduler is designed and imlemented for high erformance comuter networks. In the conventional (weighted or riority-based) round-robin schedulers and [], allto-all interconnects are established because the traffic attern is unknown. These conventional full crossbar schedulers accommodate the circuitry to arbitrate all ossible requests from all orts. The major difference with our work is that our scheduler is alicationsecific. In the resented design flow, only necessary interconnects are systematically established based on the traffic atterns that an alication exhibits. Moreover, the arbitration is only erformed over actually established interconnects, instead of conventional allto-all interconnects. Our custom scheduling scheme differs from traditional traffic-secific scheduling schemes, such as a weighted round-robin, in that our scheduler does not arbitrate unnecessary interconnects. ccordingly, the cost is reduced by systematically establishing only necessary FIFOs. Customized On-chi Crossbar Scheduler s mentioned earlier, our objective is the design of customized reconfigurable crossbar schedulers in order to reduce the area comared to a fully arallel scheduler and increase the erformance comared to a sequential scheduler. The same crossbar scheduler is required to dynamically generate the control signals to configure the switch fabric within a crossbar.. Design Flow Our scheduler has been develoed to be integrated as modular communication comonents in the ESPM tool chain as deicted in Fig.. Details of the ES- PM design methodology can be found in [ 8]. In ESPM, three inut secifications are required, namely alication/maing/latform secification in XML. n alication is secified as a Kahn Process Network (KPN). In this work, the KPN is considered as a rogramming model. KPN is a network of concurrent rocesses that communicate over unbounded FIFO channels and synchronize by a blocking read on an emty FIFO. The KPN is a suitable model of Platform sec. P P P P Parameters set Data width = bits Number of rocessors = Crossbar network Write into FIFOs Processor Processor Processor Processor Maing sec. P P P P Video in/out VLE lication sec. (KPN) Video in/out Parameters set Network tye = crossbar Scheduler = custom arallel Toology table ESPM / XPS Tools Read request Data scheduler rbiter rbiter rbiter rbiter Control signals switch VLE Customized scheduler (netlist) Figure ESPM design flow [ 8]. FIFOs FIFOs FIFOs FIFOs lication (Matlab/C) COMPN YPI Parameters set Scheduler weight table bitstream IP cores library FPG comutation on to of the resented crossbar interconnection network, due to the following facts: The synchronization scheme is relatively simle. Processors synchronize only with the full/emty status of the hardware FIFOs. Subsequently, a arallel rogrammer does not need to exlicitly handle the synchronization, since the synchronization is inherently suorted by the hardware FIFOs. The system eliminates the traditional head of the line (HOL) blocking roblem that all ackets must wait behind the contended acket in the inut queue. This is due to the fact that a hysical FIFO is established for a logical channel in a dedicated manner. KPN secification is automatically generated from a sequential Matlab rogram using the COMPN tool [9]. We define the network toology in the KPN secification as the logical toology for an alication, which consists of tasks and logical channels. Exemlary logical toologies are deicted in Fig.. Single task or multile tasks are assigned to a hysical rocessor in the maing secification. The network tye and the ort maing information are secified in the latform secification. Figure deicts how custom crossbars can be imlemented from a four-node MJPEG alication secification. In the latform secification, four rocessors are ort-maed on a crossbar. From the maing and latform secifications, ort-maed network toology is extracted as a static arameter and assed to ESPM. We define the extracted ortmaed network toology as the on-demand toology

5 J Sign Process Syst () 8:9 8 7 that an alication requires, which consists of rocessors and hysical links. dditionally, the traffic bandwidth information is obtained from the YPI tool []. Subsequently, ESPM refines the abstract latform model to an elaborate arameterized RTL (hardware) and C/C++ (software) models, which are inuts to the commercial Xilinx Platform Studio (XPS) tool. The XPS tool generates the on-demand netlist (deicted in Fig. ) with regard to the arameters assed from the inut secifications. Our system model is based on the local write, remote read scheme. The rocessor issues the request that designates the target ort index and target FIFO index. Subsequently, the FIFO resonds with a data stream to the rocessor. The crossbar network transfers the requests (from rocessors to FIFOs) and thedata (from FIFOs to rocessors). The directions of requests (from left to right) and the direction of the data transfers (from right to left) are deicted in Fig.. Finally, a bitstream is generated for the FPG rototye board to check the functionality and the erformance.. Reference Scheduling Schemes We consider a sequential scheduler (SS) and a fully arallel scheduler (FPS) as references to comare with our custom schedulers. Figure deicts the behavior of the SS, FPS, and our custom schedulers for the MJPEG alication in Fig. (8). Figure () deicts the logical toology (or data flow grah) with nodes and channels. In our model of comutation, each communication channel is maed onto the FIFOs labeled by F F. Figure () deicts the system model. The hysical system consists of nodes and links (or FIFOs). The ith node is connected to rocessor ort i and FIFO ort i. For examle, in the sixth node, rocessor P is connected to the rocessor ort and FIFO F is connected to the FIFO ort, as deicted in Fig. (). FIFO index corresonds to the channel index, as deicted in Fig. (). The crossbar network transfers the requests (from rocessors to FIFOs) and the data (from FIFOs to rocessors). For examle, a bold line in Fig. () reresents that rocessor P sends the request signal to FIFO F. Subsequently, FIFO F sends data to rocessor P. Note that data in FIFO F is from rocessor P. Figure () deicts the biartite grah based on the system model. The thick and thin lines deict all ossible requests according to the toology in Fig. (). The thick lines reresent an examle request attern. In this examle, four rocessors request to four FIFOs. The numbers on the link reresent the relative amount of traffic. For examle, the link between and is times less utilized than the link between and. Figures () (8) deict the cycle-by-cycle behavior of five different schedulers for the request atterns (bold links) in Fig. (). In order to establish the link between the rocessor and the target FIFO, three stes are required. First, a rocessor requests to an arbiter. Second, the arbiter grants the request when the target ort is idle and the round-robin ointer oints to the requesting rocessor. Third, the request is acceted when the target FIFO contains data. These oerations are denoted by R, G, and. The round-robin ointer is denoted by the oval. For the sake of simlicity, the data is assumed to be requested by a rocessor in the first cycle. The arbiter is assumed to erform a circular (weighted) round-robin arbitration in the order of,,, and so on. fter the request is granted, a link between a rocessor and a FIFO ort is established using a handshaking rotocol, which is assumed to take cycles. The bold lines in Fig. () (8) reresent actual data transmission, which is assumed to take cycles. Figure () deicts the behavior of a SS, where one ort is arbitrated at a time. crossbar contains a central arbiter that sequentially grants a single request at a time, while data transmission can be arallel. request is served after a request in the revious ort index is arbitrated. Subsequently, 8 cycles are required to serve those requests, as deicted in Fig. (). Figure () deicts FPS, where homogeneous arbiters are located in each ort. Unlike the SS, multile requests can be arbitrated in arallel. Each arbiter checks for all orts whether there is a request or not. Consequently, cycles are required in total, as deicted in Fig. (). Our FPS imlementation is similar to the islip scheduler [], in that circular round-robin ointer is udated when the request is granted. Similar to islip scheduler, allto-all interconnects are established. The islip scheduler is designed for the inut-queued acket switch. Our FPS differs from the islip scheduler in that the FPS has been imlemented for on-chi multirocessor systems with distributed memories. dditionally, while the islip scheduler [] requires two stages of arbiter arrays, our FPS requires a single stage of the arbiter array. s Fig. () deicts, FPS erforms better than SS, since concurrent requests can be served in arallel.. Round Robin Custom Scheduler The round-robin custom arallel scheduler (CPS) scheme is resented in []. The CPS is similar to the FPS in that the scheduler consists of arbiter arrays. In the CPS, however, the round-robin ointer udate oeration is erformed only for on-demand interconnects.

6 7 J Sign Process Syst () 8:9 8 Ports P F F F P P P Processors P P P Read request to remote FIFOs Data transfer P P F F F P P F F F P F Index F Interconnection network P i i th rocessor Write into FIFOs i i th rocessor ort () Logical toology () System model i i th FIFO ort () Data requests examle ( nodes, channels) F i i th FIFO ( nodes, FIFOs) (Requests in bold arrows) R : data request G : request grant : accet : actual data transfer : round -robin ointer (by rocessor) (by arbiter) (by target FIFO) (between rocessor and FIFO) cycles scheme ointer R () Single arbiter G Processor P (SS) requests to FIFO F 8 that is connected to ort. R G Single request R G R G is arbitrated R G at a time. R G R G () Full arallel R G R G R G (FPS) Multile requests can be R G R G R G arbitrated in arallel. R G R G R G ll-to-all interconnects are established. R G R G () () Custom arallel R G R G R G (CPS) RG RG Only necessary interconnects RG RG are established. RG (7) Weighted round-robin custom (WCPS) Established RG RG interconnects R G RG RG The round-robin ointer oints to. Parallel arbitration logic that arbitrates only established interconnects (8) Shared custom ointer (SCPS) ointer R G Partly arallel arbitration ointer Processor P requests Request from rocessor P is granted, Request from rocessor P is acceted, because ort Physical link between and is established. to FIFO F that is because the round-robin ointer oints to Processor P reads data from FIFO F. connected to ort. is idle and target FIFO F contains data.. Ports Ports Ports Ports F F F7 F8 F9 F F F Figure Different scheduling schemes. F F F F7 F8 F9 FIFOs Processor orts FIFO orts We exloit the fact that the alication is secified by a directed grah, in which each task has ossibly a different number of connected channels. s a design technique, each arbiter is arameterized with regard to the on-demand logical toology. lication-secific and differently sized arbiters ensure that the toology of the hysical interconnects is identical to the on-demand toology secified by the alication artitioning. Given the on-demand toologies from the alication secifications, our CPS oerates as follows:. Request: rocessor issues a request, by designating the target FIFO ort and FIFO index.. Grant: If there is a request in the round-robin ointer and the target FIFO ort is idle, the request is validated.. ccet: The target FIFO status is checked. If the target FIFO is not emty, the request is acceted and the hysical link is established. The roundrobin ointer is udated to the one that aears next in the round-robin schedule, where the roundrobin schedule is determined by the on-demand toology for an alication. If the ointed request is a Clear_Request, the link is cleared. If there is no request, the round-robin ointer

7 J Sign Process Syst () 8:9 8 7 is incremented. Figure () deicts our examle scenario for the CPS. Each arbiter checks if there is a request for required links. s an examle, has five robable requests from. Therefore, the CPS arbiter at services five links. Similarly, has two robable requests from and. Therefore, the CPS arbiter at services two links. For comarison, the FPS arbiter [Fig. ()] services links at each ort. s Fig. () indicates, cycles are required to serve the request atterns in our examle. In general, CPS erforms better than FPS, since the request search sace of CPS is a subset of the full search sace of the FPS. rea reduction also can be exected, since only on-demand links are hysically established. Moreover, only one link is often connected to an arbiter, indicating that the arbiter erforms only handshaking oeration and imlementation cost can be reduced. dditionally, CPS erforms significantly better than SS, since the arbitration is erformed in arallel. In many cases, CPS occuies more area than SS. The area overhead issue is discussed in Section.. Weighted Round Robin Scheduler In the revious section, we resented a toologically customized scheduler (CPS) scheme. The CPS erforms better and rovides lower cost for an on-demand toology, comared to reference schedulers []. In the CPS scheme, an arbitration is erformed in a simle round-robin fashion and the differently utilized traffic bandwidth is not considered. In this section, we roose a weighted round-robin custom scheduling scheme (WCPS) in order to adat the scheduler to the traffic requirements... Weighted Round Robin Custom Scheduler (WCPS) Our roosed WCPS scheme is similar to the CPS in that the round-robin ointer is udated only for on-demand interconnects. Similarly, the toology of the hysical interconnects is also identical to the ondemand toology secified in the ESPM design flow. The difference between WCPS and CPS is that the ointer udate oeration in WCPS is erformed with regard to weights. The weight refers to the relative number of utilized tokens (or ackets), which corresonds to traffic bandwidth requirements. In this context, we define the weight by the relative number of requests for tokens utilized in a communication link. token refers to a set of data words, which is our rimitive communication unit. We exloit the fact that the weights can be determined by alication rofiling. In this work, we use the YPI tool [] (see Fig. ). s a design technique, each arbiter is arameterized with resect to the on-demand toology together with a weight distribution. By using differently sized arbiters and alication-secific weight information, we can increase the network erformance. This is due to the fact that the scheduler checks links with a high robability of requests more frequently. The main idea of the WCPS to match hysical network bandwidth to logical traffic bandwidth. Only when the traffic is uniformly distributed, our WCPS is identical to CPS. The novelty of our WCPS lies in that the arbitration is erformed over only actually established on-demand interconnects. This means that no hysical circuitry is established for unnecessary interconnects. It can be noted that our scheme of assigning weights is static. We are motivated by the fact that the traffic information can be statically derived from the alication secification. The traffic variation is not significant esecially in data-streaming alications, since the traffic attern is rather regular and eriodic. The resented design flow raidly instantiates on-demand interconnects and scheduler IPs, based on the extracted traffic information. In this work, we assume that the token (or acket) sizes for the alications in Fig. are similar (or same). Though the token size can be different in the ESPM design flow, the token sizes are similar for the alications considered in this work. s an examle, YPI [] rofiling result indicates that 97% of the Wavelet traffic and 8% of the MJPEG traffic contain identical-sized tokens... WCPS Examle In Fig. (), weights for each link are deicted as an examle. The number on the channels refers to the relative number of requests for tokens. s an examle, the weight of the links from to is, resectively. The weight of the link from to is. This means that the links from to are five times more utilized than the link from to. Therefore, the WCPS arbiter at checks five times more often than. We imlement WCPS using the weight table which contains a list with the sequence of rocessor orts that an arbiter checks. Figure () deicts the weight table for ort. The sequence is ermutated such that the same rocessor ort index is sarsely distributed. We use a weight counter for each rocessor ort index to handle the ermutation. s an examle, an arbitration sequence for ort is deicted in Fig. (). In this examle, the maximum weight is, indicating that there are five sub-rounds. Initially, the weight counters are

8 7 J Sign Process Syst () 8:9 8 rbiter for ort with weighted round-robin ointer () WCPS arbiter () Weight table for ort = = = = = Iterate st nd rd th th sub-round = = = = = = = = = = = = = = = = = = = = () Request check sequence Figure Weighted round robin custom arallel scheduler. = = = = = set by the weight table, as deicted in the rectangle in Fig. (). In each sub-round, the arbiter searches rocessor orts, which have the same counter values. Subsequently, the weight counter value is decremented. When all the weight counter values are zero, or all the sub-rounds are finished, then the counter values are set by the initial weight table. The rocess is iterated, similarly to a round robin aroach. Figure (7) deicts an examle scenario for the WCPS. The weight from to is five times greater than the weight from to. In this case, cycles are required to serve the request atterns. For comarison, the CPS requires cycles. The WCPS requires less cycles than the CPS, because the arbitration time at ort is reduced. In most cases, WCPS erforms better than CPS, since the arbitration is adatively erformed with regard to the traffic information. WCPS occuies more area than CPS, since the internal logic is relatively more comlex. The area overhead issue is also discussed in Section. scheme in order to rovide a design trade-off between erformance and cost... Shared Custom Parallel Scheduler (SCPS) In this work, we roose a custom scheduler with the shared arbitration (SCPS) scheme. In SCPS, multile arbiters are shared, such that each arbiter sequentially erforms arbitration while multile arbiters oerate in arallel. Our SCPS scheme combines the advantages of the low cost in SS and increased erformance in CPS. Figure deicts an examle of CPS and SCPS for a six-node MJPEG alication. Figure () deicts CPS, where an arbiter is established er ort. Figure () deicts SCPS, where arbiters for ort (, ), (, ), and (, ) are shared. The traffic requirement is also deicted in Fig., derived from the YPI tool []. is the total incoming traffic bandwidth or an arrival rate to the ort. We aim to reduce the imlementation cost, by sharing the arbiter resources. Figure indicates that the number of arbiters in SCPS is, while the number of arbiters in CPS is. This means the imlementation cost can be reduced. Our SCPS is also constructed over on-demand interconnects. Comared to CPS, the network erformance can be likely decreased. This is due to the fact that a shared arbiter is accommodated for multile orts and the round-robin search sace can be increased for the shared arbiters. Figure (8) deicts the examle scenario for the SCPS, where sequential arbiters oerate in arallel. s Fig. (8) shows, 9 cycles are required to serve the request atterns. For comarison, the CPS requires cycles and the WCPS requires cycles... Clustering Method In this section, we resent a method to cluster arbiters. Our aim is to minimize the erformance degradation by balancing the traffic bandwidth, comared to CPS.. Shared rbiters for Custom Crossbars In our (W)CPS scheme, an arbiter is accommodated at each crossbar ort in order to erform arallel arbitration. In addition, we considered that a single task is maed onto a single rocessor. However, the number of tasks (or rocesses) is often greater than the number of orts (or rocessors). In this case, it is required to ma multile tasks onto a single rocessor. Subsequently, the number of links er node increases and the imlementation cost of (W)CPS increases accordingly. In this section, we roose a novel shared arbitration () CPS /9 /9 /9 9/9 Figure Shared custom arallel scheduler. () SCPS /9 9/9

9 J Sign Process Syst () 8: In this work, we consider clusters of two arbiters only. Given a CPS scheme, our method can be described as follows:. Calculate cost function: Calculate the individual cost using a cost function for arbiter j,where j. is the number of CPS arbiters.. Sort: Sort arbiters in an increasing order, based on the derived cost function.. Iterate clustering: Iteratively cluster arbiters with the lowest and the highest cost function, in the sorted list. The cost function is a relative metric to reresent the utilization of an arbiter. We define the cost function for the arbiter j as the following: C { } links j j = U j, () where C{ j } refers to the cost function of a CPS arbiter in the j th ort. links j refers to the number of hysical links which are connected to the arbiter j. links j corresonds to the relative arbitration latency, as described in the next section. U j refers to the summation of the traffic for the arbiter j and corresonds to relative arbiter utilization. The individual channel utilization for U j is deicted in Fig.. s an examle, U is derived by ++++ [see Fig. ()]. We multily U 9 j by links j in order to reflect the actual utilization of an arbiter. s an examle in Fig., we cluster arbiters in the following way:. Calculate cost function: We calculate the cost function for each arbiter. s an examle, links = and U = ++++ for arbiter. Therefore C{} = ( ++++ ) =.. Similarly, 9 9 C{} = C{} = C{} = C{} =.9, and C{} =... Sort: rbiters are sorted in an increasing order. We obtain ({},{},{},{},{},{}). Iterate clustering: We cluster {,}, {,}, and {,}. In this way, a highly utilized arbiter and a less utilized arbiter can be clustered. lso, more than arbiters (or variable-sized arbiters) can be shared in a similar way, from the sorted list. Performance nalysis In this section, we resent comarative erformance evaluations for different schedulers. In Section., we define the erformance metric. Different crossbar schedulers are comared in terms of the service rates. In Section., we conduct the queueing modeling and comare the entire network erformance for considered alications.. Scheduler Performance.. Performance Metric In order to comare the relative scheduler erformance for given alications, we consider the following erformance metric for different schedulers: M scheduler = N i= μ i u i N, () where M scheduler is a erformance metric for each scheduler and refers to the relative service rate in tokens/s. μ i is the service rate of the arbiter for the ith logical channel. N is the total number of logical channels. u i refers to the normalized channel utilization for the ith channel. The u i is multilied by the arbiter service rate μ i, in order to reflect the actual amount of traffic for an alication. The service rate μ i can be modeled as: μ i = (T arbit + T transmit ), () where T arbit is an arbitration latency to establish a link. T transmit is the actual data transmission latency after the link is established. T transmit can be derived as Num_Word Clk sys, where Num_Word refers to the number of data words, or the token size. Clk sys refers to the system clock frequency... Crossbar Schedulers In this subsection, we resent a erformance comarison between five different schedulers. We can fairly comare different scheduling schemes, since only the arbitration latencies are different. T arbit in Eq. for different crossbar schedulers can be aroximated as follows: #orts C hand T arbit_ss = k (a) T arbit_fps = k #orts Clk sys + C hand Clk sys #links + Chand T arbit_cps = k Clk sys (b) (c)

10 78 J Sign Process Syst () 8:9 8 #links ( ) W std W max + C hand T arbit_wcps = k #links T arbit_scps = k Clk sys { (α Chand )+( α)c hand } (d) Clk sys, (e) T arbit_ss refers to the arbitration latency (in seconds) for the sequential scheduler (SS). k is the scaling factor to calibrate the hardware imlementation. The request check latency is modeled by #orts cycles. We divide by, since the circular round-robin ointer is statistically located in the middle of the search sace. In the SS, there is only a single arbiter in the system. Only after the current rocessor ort is served, the next ort is served. C hand refers to the handshaking latency in number of cycles. Therefore, we model these sequential oerations by multilying the handshaking latency C hand by #orts. T arbit_fps refers to the arbitration latency in the fully arallel scheduler (FPS). In the FPS, the arbiter at each ort obliviously checks for all orts. The request check latency is modeled by #orts cycles, similarly to the SS. However, multile requests can be concurrently served. Therefore, we model these arallel arbitration oerations, by adding #orts and the Chand. T arbit_cps refers to the arbitration latency for the custom arallel scheduler (CPS). Similarly to the FPS, we model the arbitration latency by adding the request check latency and the C hand. However, the request check latency is modeled by #links, since the actual arbitration is erformed for the only established hysical links, instead of all links. #links is equal or less than #orts. Therefore, it is obvious that T arbit_cps is less than T arbit_ss and T arbit_fps. Only if the required toology is all-to-all, then the T arbit_cps is equal to T arbit_fps. T arbit_wcps refers to the arbitration latency for the weighted round-robin scheduler (WCPS). In Eq. d, W std refers to the standard deviation of weights for connected links in an arbiter. W max refers to the maximum weight in the connected links to an arbiter. We divide by W max in order to normalize the weight. Comared to the roundrobin CPS, the arbitration latency can be reduced. s an examle, in Fig. (), for ort is W std W max.8, or.. We used the term W std W max to statistically model the reduced arbitration latency, due to the following facts. First, the higher weight deviation means that the variation of link bandwidths (that an alication requires) is high. Second, our scheduler allocates higher bandwidth (or time slots) to the links that require high bandwidth. Third, the traffic attern of our streamed multimedia alications is highly regular and eriodic. s an examle, the MJPEG alication is erformed block-wise, where each block contains 8 8 image ixels. Due to the regularity of the traffic attern, the statistical model can be simlified as shown in Eq. d. It must be noted that the W std is calculated for only connected links. In other words, the WCPS arbiter at each ort checks for only the necessary orts with different weights. T arbit_scps refers to the arbitration latency for the shared custom arallel scheduler (SCPS). In Eq. e, α is if the number of arbiters is. In this case, SCPS is the same as SS. α is if the number of arbiters is greater than. In this case, the SCPS oerates in a same way as CPS. The difference between CPS and SCPS is that the #links of a SCPS arbiter is in many cases greater than the #links of CPS arbiter. For examle, as deicted in Fig. (), #links for (,, ) are (,,). For comarison, as deicted in Fig. (), #links for ( ) are (,,,,,)... MJPEG Examle We consider the six-node MJPEG alication, deicted in Fig. (8). The different scheduling schemes are given in Fig. 7. The service rates for each scheme are derived as follows. We assume that the schedulers oerate at MHz and k k are. C hand is assumed to be cycles er token, since each of the request and the acknowledgement requires cycle, resectively. To obtain T transmit, it is assumed that each word takes cycle to transmit. SS: The service rate μ s er token for the centralized SS arbiter is derived as follows. Considering the single-word token communications, cycle or Num_Word =, T transmit is (in seconds). MHz Since C hand is two cycles and number of orts is, T arbit_ss is obtained from Eq. a and it is ( ) cycles. By substituting T MHz arbit_ss to Eq., μ s MHz can be derived by =. ( +)cycles tokens/s. The erformance metric M SS is derived from Eq.. s Fig. 7() deicts, there are channels, or N =. The relative channel utilization u is also deicted in Fig. 7(). s a result, M SS is derived by (. )( ) or. tokens/s, from Eq.. The derived M SS indicates the relative servicerateforthemjpegtraffic.

11 J Sign Process Syst () 8: µ s () SS µ (w)c µ (w)c µ (w)c µ (w)c µ (w)c µ(w)c () (W)CPS /9 /9 /9 /9 /9 /9 /9 /9 Figure 7 Different schedulers. µ f µ f µ f µ f µ f µ f () FPS µ sc µ sc µ sc () SCPS /9 /9 /9 /9 /9 /9 /9 /9 FPS: Similarly, a service rate μ f for each FPS MHz arbiter can be derived by =.7 ( ++)cycles tokens/s [see Fig. 7()]. M FPS is derived by (.7 )( ) or.8 tokens/s, from Eq.. CPS: service rate μ c for each CPS arbiter is determined by the toology [see Fig. 7()]. μ c for MHz an arbiter is = tokens/s. ( ++)cycles MHz Similarly, μ c, μ c, μ c, μ c is = ( ++)cycles tokens/s. μ c is tokens/s. s a result, M CPS is 7. tokens/s and is derived by the following equation: M CPS ( )( ) ( )( 8 9 ) + ( )( ) 9 9 =. WCPS: service rate μ wc for each WCPS arbiter is determined by the toology as well as the weight distribution [see Fig. 7()]. The weight distribution is deicted in Fig. (8). s an examle, the standard deviation W std of (,,,, ) is.9 for arbiter in Video_in node. The maximum W weight W max is. Subsequently, std W max is calculated by.9, or.. Therefore, μ wc for arbiter is MHz = ( (.)++)cycles tokens/s. For comarison, the service rate in CPS for arbiter is tokens/s. This means that the service rate of WCPS is % higher than the service rate of CPS for arbiter. On the other hand, the weight distribution is uniform for arbiters.for the orts, the standard deviation W std is, indicating that the arbitration is erformed in the same way as in the round-robin CPS. In case of, only single link is established. Subsequently, the service rates, μ wc μ wc are the same for both CPS and WCPS. M WCPS is 7. tokens/s and is derived by the following equation: M WCPS = ( )( ) ( )( 8 9 ) + ( )( ) 9 9 SCPS: service rate μ sc for each SCPS arbiter can be derived in a similar way as CPS [see Fig. 7()]. s described earlier, the difference between CPS and SCPS is the number of links that each arbiter handles. μ sc for an arbiter (for orts, ) MHz is = ( ++)cycles tokens/s. Similarly, μ sc = μ sc = tokens/s. M SCPS is. tokens/s and is derived by the following equation: ( )( ) ( )( ) M SCPS =. Similarly, the erformance metric for different schedulers with different alications is derived, as deicted in Fig. 8. The alication task grahs of the - node MPEG, the eight-node PIP, the -node VOPD, the -node MWD were taken from []. The task grah of the six-node MPEG secification was taken from []. The task grahs of the MJPEG and Wavelet alications were taken from []. When the token size is -word, our WCPS rovides better service rate by factor of comared to SS and comared to FPS. Our WCPS erforms % better than CPS for MPEG and Wavelet alications. The erformance imrovement of WCPS over CPS is relatively small for other alications due to the following reasons. First, the toology in ractical alications is simle in the sense that the average number of links er node is only.. In many cases, only a single link is connected to the crossbar ort, indicating that no arbitration is necessary for those orts. The SCPS is times better than SS and.7 times better than FPS. When the token size is - word, the CPS rovides on average % better service rate than SCPS. This erformance degradation of SCPS over CPS is due to the sharing arbiter resources. s Fig. 8() shows, when the token size is -words, our WCPS erforms still better than the reference schedulers, while the imrovement is significantly less than the case of the small-sized tokens. This is due.

12 8 J Sign Process Syst () 8:9 8 x tokens/s Relative service rate SS FPS CPS WCPS SCPS x tokens/s Relative service rate SS FPS CPS WCPS SCPS MJPEG{,7} MPEG {,} MJPEG {,} PIP {8,8} MWD {,} VOPD {,} MPEG {,} () Token size = word () Token size = words Figure 8 Relative service rate for different schedulers. Wavelet {,} MJPEG{,7} MPEG {,} MJPEG {,} PIP {8,8} MWD {,} VOPD {,} MPEG {,} Wavelet {,} to the fact that the token transmission latency is a dominant term to determine the service rate.. Network Performance In the revious section, the comarative erformance of different scheduler modules is resented. In this section, the entire network erformance is comared in terms of latency and throughut... ueueing Modeling We formulated a network erformance model to comare the relative latency and throughut. Our analysis is based on the queuing model [], since the queuing model rovides a reasonable fit to the reality with relatively simle formulation. Based on the general queuing model, the following assumtions are made: The system network conforms to the Jackson model []. Each queue behaves as an indeendent single server and the total network latency can be modeled as the combination of each service latency. Each server is analyzed by an (M/M/) queuing model. In other words, the incoming traffic obeys the Poisson distribution. The data arrivals occur randomly and the service time distribution is exonential. If the server is idle, a data in the queue is served immediately. The queue size is adequately large to avoid the stall of the data flow. The Jackson s oen queueing model is based on the network of queues [] and can be utilized as a modeling method in general, since most of NoC-based systems accommodate buffers (or queues) for the communication. The queueing model can be suitably alied to our system due to the following facts: The KPN model and the actual system are indeed a network of queues. The incoming data stream attern is statistically random. token in the FIFO is indeendently served by a single scheduler (or server) at each crossbar ort. In addition, the hysical queue size is sufficiently large. s an examle, we imlement a FIFO using (multile) embedded block RMs, where single block RM rimitive in our target device can accommodate -bit words, which is large enough. Consequently, the general network latency can be modeled as: T network = N i= i μ i i, () where T network is the total latency of the system network. N is the number of queueing systems. is the total incoming arrival token rate to the network (or outgoing rate from the network). i is the incoming arrival token rate to the ith queue. μ i is the service rate of the arbiter for the ith queue... Case Studies We studied four cases with two different network sizes and two different token sizes. First, we consider the sixnode MJPEG alication, deicted in Fig. (8) which requires a small network size. The ort-maed system

13 J Sign Process Syst () 8:9 8 8 Video in P P rocessor rocessor ort FIFO ort, ' ' ' ' ' ' µ µ µ µ µ µ c c c c c c /9 /9 /9 /9 / /9 () Network model for (W)CPS Figure 9 case study. 8 /9 /9 P P P P () Network of queues,, 7 9 ' µ sc legend Video out P queueing system 7 /9, 8 9 µ ' sc,, µ ' sc, () Network model for SCPS model is deicted in Fig. 9(). Considering rocessor P as a streamed data source, P generates the data inarateof (tokens/s). token rate in each queue is derived from the YPI rofiler [], as deicted in Fig. 9(). Figure 9() and () deict the network model for (W)CPS and SCPS. The service rate μ for each scheduler is derived in the same way as in Section.. with Eq.. The total network latency can be derived by substituting the service rate μ to Eq.. s a result, the network latencies are derived and deicted in Fig. (a). The erformance analysis indicates that the WCPS reduces latency by at least % than SS and at least % better than the FPS for all token rate ranges. lso, the erformance is better imroved as the token rate increases. Moreover, the network with our WCPS saturates at the bandwidth of tokens/s and the network with SS saturates at the bandwidth of tokens/s. Therefore, WCPS is. better than SS in terms of throughut. Similarly, our WCPS is better than FPS in terms of throughut. Comared to CPS, WCPS rovides marginal erformance imrovement. This is due to the fact that the weight variation is not high, as deicted in Fig. 9(). Comared to SCPS, WCPS reduces the latency by at least % and rovides % better throughut. Second, we consider the MJPEG case study for a large token size with Num_Word =. The network erformance can be derived in the similar way as in the revious case, while only the service rate differs. cycles MHz T transmit is derived by, because the Num_Word is. s a result, Fig. (b) shows that the WCPS erforms at least % better than SS and.% better than FPS for all ranges. The erformance imrovement Figure Network erformance Network latency (us) Token rate (X tokens/s) (a) Token size = word CPS FPS SS WCPS SCPS () -node MJPEG alication in Figure (8 ) 8 Network latency (us) 9 Token rate (X tokens/s) (b) Token size = words Network latency (us). Token rate (X tokens/s) (a) Token size = word CPS FPS SS WCPS SCPS () -node Wavelet alication in Figure ( ) 8 Network latency (us) 8 Token rate (X tokens/s) (b) Token size = words

14 8 J Sign Process Syst () 8:9 8 is smaller than the case of single-word token transactions, since T transmit is a dominant factor for the network latency, comared to T arbit. Third, we consider the -node Wavelet alication deicted in Fig. (), which requires a large-sized network and a single-word token size. s the number of crossbar orts increases, T arbit for SS and FPS roortionally increases. However, T arbit for (W)CPS does not increase, since the average number of orts for the round-robin ointer is.. In other words, on average. orts are only required to be arbitrated by an arbiter, instead of orts. Figure (a) deicts the network latency for single-word token transactions. The network latency of CPS is reduced by at least 8% comared to SS and at least 7% comared to FPS. The WCPS rovides % higher throughut than CPS. Finally, Fig. (b) deicts the network latency for -word token transactions in the Wavelet alication. CPS erforms at least % better than SS and % better than FPS. It can be suggested that the resented CPS, WCPS, SCPS schemes are more beneficial for small sized tokens communicated over large networks. Imlementation Results The aforementioned scheduler modules were imlemented in VHDL to integrate the resented network comonents in the ESPM design environment [ 8]. The resented schedulers are imlemented with arameterized arbiter arrays. The scheduler modules are generic in terms of data width, number of orts, on-demand toologies, and a traffic weight table. The switch module in [] is used as a common interconnects fabric and the communication controller in [7] is used as a common network interface. The functionality of the network is verified by VHDL simulations. In order to fairly comare different schedulers, we imlemented the schedulers such that k k = for Eq.. Imlementation details can be found in []. The exemlary behaviors of the imlemented schedulers are deicted in Fig.. The imlemented scheduler modules are comared in terms of area utilization. We exerimented with different task grah toologies of realistic alications, deicted in Fig.. The imlemented schedulers were synthesized using the Xilinx ISE 8. tool targeting the Virtex-II Pro (xcv-7-89) FPG and the areas were obtained and deicted in Fig. (). Our CPS requires on average 8% less area comared to the FPS. The CPS requires on average 8% more area comared to the network with SS. We consider that the area overhead of our CPS over SS is relatively less significant, since our target xcv device contains 9,8 slices and chi-wise overhead of CPS over SS is on average %. The area of our network is not only deendent on the number of nodes that determine its size but also on the network toology. It is observed that the higher area reduction is obtained as the network size increases. This is due to the fact that the average number of links er node is. and does not increase as the number of nodes increases. WCPS occuies on average % more area than CPS. The SCPS occuies % more area than CPS. s described earlier, this is due to the fact that in many cases only one link is connected to arbiters in CPS scheme. Due to this, the MJPEG {,7} SS FPS CPS WCPS SCPS MPEG {,} Number of slices Number of slices Number of slices MJPEG {,} PIP {8, 8} MWD {,} VOPD {,} MPEG {,} Wavelet {,} VERGE {, } SS FPS CPS WCPS SCPS () One-task to one-rocessor maing () Multile-to-one maing 8 7 SS FPS CPS WCPS SCPS Wavelet {, } MJPEG+Wavelet Figure Exerimental results.

15 J Sign Process Syst () 8:9 8 8 Init HPF Coy LPF () node-wavelet {, } Sink Video in Init Coy HPF LPF VLE Video out Sink () -node MJPEG + -node Wavelet {, } Figure Toology after multile tasks are maed onto a rocessor. requires 7% less area than the CPS scheme. Figure deicts the erformance metric defined in Section. for the five-node Wavelet reresentation. s Fig. () deicts, the WCPS rovides % better service rate for -word token sizes. The WCPS scheme also occuies % more area than the CPS, whereas the chi-wise overhead is %, for five-node Wavelet reresentation. The exerimental result suggests that when the number of links er ort increases, the SCPS can be beneficial, roviding a design trade-offs between cost and erformance. CPS arbiter logic can be greatly simlified, esecially when a single task is maed onto a rocessor. In addition, the single SCPS arbiter contains relatively more comlex decoding logic than single CPS arbiter. The exeriment shows that CPS scheme is suitable when single task is maed onto a hysical rocessor. In many cases, however, the number of tasks is often greater than the number of hysical rocessors for given alications. Therefore, multile tasks are required to be maed onto a hysical rocessor because a target device is limited in size. Subsequently, the required number of links er ort increases. Figure () deicts the five-node reresentation for the -node Wavelet alication. Figure () is derived by clustering the same tasks onto a single rocessor. s a result, the number of links er node is, or.. Note that the number of links er node is, or. when a single task is maed onto one rocessor. Figure () deicts a toology of a synthetic alication that combines the six-node MJPEG and the five-node Wavelet secification. In this case, the number of links er node is, or.. In both cases, the number of links er node increases when comared to a one-to-one maing. We imlemented different schedulers for these cases and comared the area cost. s deicted in Fig. (), the SCPS scheme x tokens/s 9 Relative service rate SS FPS CPS WCPS SCPS x tokens/s Relative service rate SS FPS CPS WCPS SCPS () Token size = word () Token size = words Figure Performance for five-node wavelet secification. Conclusions The erformance of a arallel alication increases when the underlying network can be matched to the on-demand logical toology and adhered to the bandwidth requirements of an alication. In this aer, we resented crossbar schedulers designed for reconfigurable on-chi-networks caable of adjusting themselves to custom toologies and custom bandwidth requirements. We conducted a queueing analysis to determine the overall network erformance. From the erformance evaluation and imlementation, the following conclusions can be drawn: The resented custom schedulers (CPS, WCPS, SCPS) rovide better erformance than conventional schedulers, esecially for small-sized tokens communicated over large-size network. In addition, our schedulers efficiently utilize the on-chi resources by adating themselves to the logical toologies. When the tasks are one-to-one maed onto rocessors, the custom arallel scheduler (CPS) erforms significantly better than SS, FPS, SCPS. The weighted round-robin custom scheduler (WCPS) can increase the network erformance, by assigning different network bandwidth to different traffic requirements. The WCPS is beneficial when the traffic load is not evenly distributed. The shared custom scheduler (SCPS) rovides a design trade-offs for cost and erformance when comared to the mentioned schedulers. The SCPS can be beneficial when the number of links er ort increases. The sequential scheduler (SS) rovides adequate erformance and cost when the token size is large and the size of a network is fairly small. Our schedulers were imlemented using arameterized arbiter arrays. By utilizing the on-demand

16 8 J Sign Process Syst () 8:9 8 toology and traffic requirements as design arameters, the schedulers were adated to given alications without modifying the network imlementation. We showed that our schedulers erform better and occuy significantly less area than conventional fully arallel schedulers. Our schedulers erform significantly better and occuy moderately more area than sequential schedulers. Oen ccess This article is distributed under the terms of the Creative Commons ttribution Noncommercial License which ermits any noncommercial use, distribution, and reroduction in any medium, rovided the original author(s) and source are credited. References. Mckeown, N. (999). The islip scheduling algorithm for inut-queued switches. IEEE/CM Transaction on Networking (TON), 7(), 88, ril.. Bjerregaard, T., & Mahadevan, S. (). survey of research and ractices of network-on-chi. CM Comuting Surveys, 8(),, March.. Vassiliadis, S., & Sourdis, I. (). FLUX networks: Interconnects on demand. In Proceedings of international conference on comuter systems architectures modelling and simulation (IC-SMOS ) (. 7), July.. Vassiliadis, S., & Sourdis, I. (7). FLUX interconnection networks on demand. Journal of Systems rchitecture, (), , October.. Moraes, F., Calazans, N., Mello,., Möller, L., & Ost, L. (). HERMES: n infrastructure for low area overhead acket-switching netwoks on chi. Integration, The VLSI Journal, 8(), 9 9, October.. Marescaux, T., Nollet, V., Mignolet, J.-Y., Moffat,. B. W., vasare, P., Coene, P., et al. (). Run-time suort for heterogeneous multitasking on reconfigurable SoCs. Integration, The VLSI Journal, 8(), Sethuraman, B., Bhattacharya, P., Khan, J., & Vemuri, R. (). LiPaR: lightweight arallel router for FPG based networks on chi. In Proceedings of the great lakes symosium on VLSI (GLSVLSI ) (. 7), ril. 8. Bartic, T.., Mignolet, J.-Y., Nollet, V., Marescaux, T., Verkest, D., Vernalde, S., et al. (). Toology adative network-on-chi design and imlementation. IEE Proceedings. Comuters and Digital Techniques, (), 7 7, July. 9. Brebner, G., & Levi, D. (). Networking on chi with latform FPGs. In Proceedings of the IEEE international conference on field-rogrammable technology (FPT ) (. ), December.. Srinivasan, K., & Chatha, K. S. () low comlexity heuristic for design of custom network-on-chi architectures. In Proceedings of international conference on design, automation and test in Euroe (DTE ) (. ), March.. Murali, S., & Micheli, G. D. () n alication-secific design methodology for STbus crossbar generation. In Proceedings of international conference on design, automation and test in Euroe (DTE ) (. 7 8), March.. Loghi, M., ngiolini, F., Bertozzi, D., Benini, L., & Zafalon, R. (). nalyzing on-chi communication in a MPSoC environment. In Proceedings of international conference on design, automation and test in Euroe (DTE ) (. 7 77), February.. Murali, S., Benini, L., & Micheli, G. D. (7). n alication-secific design methodology for on-chi crossbar generation. IEEE Transactions on Comuter-ided Design of Integrated Circuits and Systems, (7), 8 9, July.. Pasricha, S., Dutt, N., & Ben-Romdhane, M., (). Constraint-driven bus matrix synthesis for MPSoC. In Proceedings of th sia and South Pacific design automation conference (SP-DC ) (. ), January.. Guta, P., & McKeown, N. (999). Designing and imlementing a fast crossbar scheduler. IEEE Micro, 9(), 89, January February.. Nikolov, H., Stefanov, T., & Derettere, E. (). Multirocessor system design with ESPM. In Proceedings of the th IEEE/CM/IFIP international conference on HW/SW codesign and system synthesis (CODES-ISSS ) (. ), October. 7. Nikolov, H., Stefanov, T., & Derettere, E. (). Efficient automated synthesis, rogramming, and imlementation of multi-rocessor latforms on FPG chis. In Proceedings of th international conference on field rogrammable logic and alications (FPL ) (. 8), ugust. 8. Nikolov, H., Stefanov, T., & Derettere, E. (8). Systematic and automated multi-rocessor system design, rogramming, and imlementation. IEEE Transactions on Comuter-ided Design of Integrated Circuits, 7(),, March. 9. Kienhuis, B., Rijkema, E., & Derettere, E. (). COM- PN: Deriving rocess networks from Matlab for embedded signal rocessing architectures. In Proceedings of 8th international worksho on hardware/software codesign (CODES ) (. 7), May.. de Kock, E.., Essink, G., Smits, W. J. M., van der Wolf, P., Brunel, J.-Y., Kruijtzer, W. M., et al. (). YPI: lication modeling for signal rocessing systems. In Proceedings of the 7th design automation conference (DC ) (. ), June.. Hur, J. Y., Stefanov, T., Wong, S., & Vassiliadis, S. (7). Customizing reconfigurable on-chi crossbar scheduler. In Proceedings of IEEE 8th international conference on alication-secific systems, architectures and rocessors (SP 7) (. ), July.. Bertozzi, D., Jalabert,., Murali, S., Tamhankar, R., Stergiou, S., Benini, L., et al. (). NoC synthesis flow for customized domain secific multirocessor systems-onchi. IEEE Transactions on Parallel and Distributed Systems, (), 9, February.. Chang, K.-C., Shen, J.-S., & Chen, T.-F. (). Evaluation and design trade-offs between circuit-switched and acketswitched NOCs for alication-secific SOCs. In Proceedings of the th design automation conference (DC ) (. 8), July.. Hur, J. Y., Stefanov, T., Wong, S., & Vassiliadis, S. (7). Systematic customization of on-chi crossbar interconnects. In Proceedings of international worksho on alied reconfigurable comuting (RC 7) (. 7), March.. Baldwin, R. O., Davis IV, N. J., Midkiff, S. F., & Kobza, J. E. (). ueueing network analysis: Concets, terminology, and methods. The Journal of Systems and Software, (), 99 7, May.

J Sign Process Syst () 8:9 8 8 Netherlands, in December.

He has considerable exerience in the design of embedded reconfigurable media rocessors.

His research interests include embedded systems, multimedia rocessors, comlex instruction set architectures, reconfigurable and arallel rocessing, microcoded machines, and distributed/grid rocessing.

17 J Sign Process Syst () 8:9 8 8 Netherlands, in December. He is currently working as an assistant rofessor at the Comuter Engineering Laboratory at the Delft University of Technology (TU Delft), The Netherlands. He has considerable exerience in the design of embedded reconfigurable media rocessors. He has worked also on microcoded FPG comlex instruction engines and the modeling of arallel rocessor communication networks. His research interests include embedded systems, multimedia rocessors, comlex instruction set architectures, reconfigurable and arallel rocessing, microcoded machines, and distributed/grid rocessing. Jae Young Hur received the B.S. degree in electronics engineering from Cheju National University, Cheju, South Korea in 99. He received the M.S. degrees in electronics engineering from Sogang University, Seoul, South Korea in 998 and from Munich University of Technology, Munich, Germany in. From 999 to, he was a semiconductor engineer in Samsung Electronics, Ltd., South Korea. Since November, he has been with the comuter engineering laboratory, Delft University of Technology, The Netherlands, where he is currently a research assistant (PhD student). His research interests include embedded systems design, networks-on-chi, VLSI design, and reconfigurable comuting. Stehan Wong received his PhD in Comuter Engineering from the Electrical Engineering, Mathematics and Comuter Science faculty of the Delft University of Technology (TU Delft), The Todor Stefanov received the Dil.Ing. and M.S. degrees in comuter engineering from The Technical University of Sofia, Sofia, Bulgaria, in 998 and the Ph.D. degree in comuter science from Leiden University, Leiden, The Netherlands, in. From 998 to May, he was a Research and Develoment Engineer with Innovative Micro Systems, Ltd., Sofia. From June to ugust 7, he was with the Leiden Institute of dvanced Comuter Science, Leiden University, where he was a Research ssistant (PhD student) and a PostDoc Researcher at the Leiden Embedded Research Center. From Setember 7 to ugust 8, he was a Senior Researcher at the Comuter Engineering Lab, Delft University of Technology, Delft, The Netherlands. Since Setember, 8, Todor Stefanov has been an ssistant Professor with the Leiden Institute of dvanced Comuter Science, Leiden University where he erforms research at the Leiden Embedded Research Center. His research interests include several asects of embedded systems design, with articular emhasis on systemlevel design automation, multirocessor systems-on-chi design, and hardware/software codesign.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.