Design Trade-offs in Customized On-chip Crossbar Schedulers

Size: px
Start display at page:

Download "Design Trade-offs in Customized On-chip Crossbar Schedulers"

Transcription

1 J Sign Process Syst () 8:9 8 DOI.7/s-8--x Design Trade-offs in Customized On-chi Crossbar Schedulers Jae Young Hur Stehan Wong Todor Stefanov Received: October 7 / Revised: June 8 / cceted: ugust 8 / Published online: Setember 8 The uthor(s) 8. This article is ublished with oen access at Sringerlink.com bstract In this aer, we resent a design and an analysis of customized crossbar schedulers for reconfigurable on-chi crossbar networks. In order to alleviate the scalability roblem in a conventional crossbar network, we roose adative schedulers on customized crossbar orts. Secifically, we resent a scheduler with a weighted round robin arbitration scheme that takes into account the bandwidth requirements of secific alications. In addition, we roose the sharing of schedulers among multile orts in order to reduce the imlementation cost. The roosed schedulers arbitrate on-demand (at design time) interconnects and adhere to the link bandwidth requirements, where hysical toologies are identical to logical toologies for given alications. Considering conventional crossbar schedulers as reference designs, a comarative erformance analysis is conducted. The hardware scheduler modules are arameterized. Exeriments with ractical alications show that our custom schedulers occuy u to 8% less area, and maintain better erformance comared to the reference schedulers. J. Y. Hur (B) S. Wong T. Stefanov Comuter Engineering Lab., TU Delft, The Netherlands J.Y.Hur@ewi.tudelft.nl URL: htt://ce.et.tudelft.nl S. Wong J.S.S.M.Wong@tudelft.nl T. Stefanov Leiden Embedded Research Center, Leiden University, Leiden, The Netherlands stefanov@liacs.nl URL: htt:// Keywords Interconnection architectures Schedulers Network toology Reconfigurable hardware Introduction It is a well-known fact that a crossbar network rovides high network erformance and minimum network congestion. The non-blocking nature of communication and the relatively simle imlementation makes the crossbar oular as an internet switch (Cisco Systems, Inc., htt:// tyical crossbar consists of a scheduler and a switch fabric. The scheduler lays a key role in achieving high network erformance and becomes more imortant as the size of the network increases. commercial crossbar scheduler tyically accommodates an arbiter er inut/outut ort and each arbiter concurrently arbitrates the incoming ackets []. In these fully arallel schedulers, all-to-all connections are required to be accommodated since traffic atterns are in most cases unknown. Nevertheless, a major bottleneck of the fully arallel scheduler is the high cost due to the increasing amount of wires as the number of orts grows. Figure deicts the area of the islip crossbar scheduler [], which is widely used for the commercial crossbar switches. s the number of orts increases, the area of the scheduler increases in an unscalable manner. This is mainly due to the all-to-all interconnects inside the scheduler module. In addition, the crossbar scheduler is an imortant basic building block for modern networks-on-chi (NoC) []. The scheduler in an on-chi router accommodates all-toall connections. In many cases, such schedulers contain a single central arbiter that sequentially arbitrates a single request at a time, while data transmission can be

2 7 J Sign Process Syst () 8:9 8 Number of gates ( x ) 7 rea of i SLIP Scheduler 8 8 Number of orts Figure rea of crossbar scheduler []. arallel. Due to this, the arbitration latency increases as the number of crossbar orts increases. This work alleviates these roblems by establishing on-demand toologies using a crossbar in a reconfigurable multirocessor system-on-chi latform. This work is motivated by the following observations: The logical toology and traffic information can be derived from the arallel alication secification. Communication atterns of different alications reresent different logical toologies. Figure deicts task grahs of realistic alications taken from [ ], where numbers on links reresent the traffic loads (or required bandwidth). The numbers between braces in the deicted toologies indicate the number of nodes and the number of required links. s an examle, MJPEG{,} indicates that the MJPEG alication requires nodes and links. s deicted in Fig., logical toologies are alication-secific and require only a small ortion of all-to-all communications. It can be noted that we focus on communication erformance. Detailed descritions of the comutation nodes in Fig. are not relevant in this work. Single alications can be secified differently as observed in the MJPEG [Fig. (7) and (8)] and MPEG secifications [Fig. () and ()]. Moreover, the traffic load is differently distributed for different communication links as deicted in Fig. (see the numbers on the links). More than 7% of modern reconfigurable hardware, such as field-rogrammable gate arrays (FP- Gs), is reoccuied by millions of wire segments. s an examle, Virtex-II Pro device has nine metal layers to enhance the routing flexibility and for a designer to realize on-demand reconfigurability. In this aer, we resent a systematic design, an analysis, and an imlementation of novel alicationsecific crossbar schedulers. Our custom schedulers arbitrate only the necessary interconnects instead of all-to-all interconnects. In addition, our weighted round-robin scheduler can be adated to the traffic requirements known at design time. The resented vu au cu rast. 9 mem mem mem idct ds us bab risc idct iq mc () MPEG {, } transfer Decoder mbintra 88 () MPEG {, } 7 Predict _acdc 8 In mem hs In mem vs Jug Jug mem O dis () PIP {8, 8} Coy 8 Coy 8 Init 8 HPF LPF HPF Coy 9 9 Coy Coy 9 LPF Coy Coy HPF Sink Coy 9 Coy 9 Coy Sink LPF Sink 9 Coy Sink () Wavelet {, } Run le vld 7 7 dec iquan idct 7 Inv U cd arm scan sam red Strie 9 Vo mem rec ad Vo mem 9 () VOPD {, } Video in VLE Video out 9 in nr mem 8 9 vs hs mem 9 hvs mem 9 jug jug 9 se blend () MWD {, } Video in (7) MJPEG {,7} (8) MJPEG {, } 9 VLE Video out Figure Logical toologies with different traffic load distribution in ractical alications [ ].

3 J Sign Process Syst () 8:9 8 7 schedulers combine the high erformance of a arallel scheduler and the reduced area of customized interconnects. The main contributions of this work are: We roose a weighted round robin custom arallel scheduler (WCPS). The resented scheduler is imlemented with on-demand interconnects, where the hysical toologies are identical to the logical toologies for an alication. Exeriments on realistic alications indicate that our WCPS scheme erforms better than reference schedulers with moderate area overheads. We roose a custom arallel scheduler with shared arbitration scheme (SCPS). Exeriments show that the imlementation cost can be reduced with moderate erformance overheads, when the number of links er ort is sufficiently large. The organization of this aer is as follows. In Section, related work is resented. The roosed scheduler designs are described in Section. In Section, we resent the erformance evaluation. The hardware imlementation results are resented in Section. Finally, conclusions are drawn in Section. Related Work Our work is based on the general aroach for ondemand reconfigurable networks [, ]. In this aer, we resent a custom crossbar scheduler utilizing on-demand reconfigurable interconnects. Numerous NoCs targeting SICs (surveyed in []) emloy rigid underlying hysical networks. Tyically, acket routers constitute tiled NoC architectures. Each acket router accommodates a crossbar switch fabric and a scheduler with internally all-to-all hysical interconnects. Our scheduler is different from the schedulers in SIC-targeted NoC routers, since on-demand toology is established in our scheduler by utilizing the reconfigurability of a reconfigurable hardware. NoCs targeting FPGs [ 8] emloy fixed toologies defined at design-time. In [ 8], acket switched networks are resented and each router contains a full crossbar fabric. The toology in this related work is defined by the interconnections between routers or switches. The scheduler in [7] accommodates an arbiter er ort, which is similar to our aroach. In [7], single Dmesh acket router for an 8-bit flit occuies slices and block memories (BRMs) in a Virtex-II Pro (xcv) device. Our work is close to [8], in which a toology adative arameterized network comonent is resented. While the crossbar interconnects inside a router of [7, 8] are still all-to-all, the hysical toology of our crossbar network is identical to the on-demand toology of the alication. In addition, the toology of our work is defined by the direct interconnection between rocessors. In other words, our custom crossbar constitutes a system network. In [9], the crossbar network is imlemented utilizing a modern artial and/or dynamic reconfiguration technology. In [9], a large-sized (98 98 bits) crossbar network utilizing native rogrammable interconnects and look-u tables (LUTs) is resented. However, allto-all crossbars in [9] are not scalable, esecially for large-sized networks. Our work differs from [9] inthat our network is customized in order to alleviate the scalability roblem. In addition, our network is imlemented in a fully synthesizable VHDL. Our imlemented network is technology-indeendent, though we target FPGs in this work. Recently, a coule of alication-secific NoC designs were roosed. In [], a multi-ho router network customization is resented, whereas our network is single-ho based. In [, ], an internal STbus crossbar customization is resented. Our work is similar to [, ] in that the crossbar is customized. In [], a design method to generate a crossbar is resented, in which an alication is traced (in simulation) and the grah clustering technique is used to generate the crossbar network. Our design method differs from [ ] in that our custom crossbar is generated, where the hysical toology is identical to the on-demand toology of an alication. Our arbitration scheme can also be configured with regards to the traffic demand, where the hysical toology inside the scheduler is again identical to the required toology. In [], a design method to synthesize an alication-secific bus matrix is resented. Our work is similar to [] inthat the alication-secific interconnects are established between masters and slaves. Furthermore, our work is similar in that a shared arbitration method is resented. However, our design method differs in that the on-demand toology information and the traffic bandwidth information are extracted from the alication secification ste. On the other hand, in [], a simulation-based traffic trace is conducted to synthesize the alication-secific toology, which requires hours of design sace exloration time. Finally, a erformance comarison between the full crossbar and the artial bus matrix is not resented. Finally, our crossbar network is similar to the virtual outut queues (VO) in that the hysical FIFOs are established at the inut orts. The VO-based switch such as islip [] scheduler is not oular in the multi-ho NoC due to the high cost. We highlight that

4 7 J Sign Process Syst () 8:9 8 our crossbar interconnects are single-ho, not multiho NoC. In other words, a crossbar (with single-ho latency) constitutes a system network in the multirocessor system on chi latform. Our alicationsecific network combines high erformance of a VO-based crossbar and the low cost of customized interconnects. In [], a fast round-robin crossbar scheduler is designed and imlemented for high erformance comuter networks. In the conventional (weighted or riority-based) round-robin schedulers and [], allto-all interconnects are established because the traffic attern is unknown. These conventional full crossbar schedulers accommodate the circuitry to arbitrate all ossible requests from all orts. The major difference with our work is that our scheduler is alicationsecific. In the resented design flow, only necessary interconnects are systematically established based on the traffic atterns that an alication exhibits. Moreover, the arbitration is only erformed over actually established interconnects, instead of conventional allto-all interconnects. Our custom scheduling scheme differs from traditional traffic-secific scheduling schemes, such as a weighted round-robin, in that our scheduler does not arbitrate unnecessary interconnects. ccordingly, the cost is reduced by systematically establishing only necessary FIFOs. Customized On-chi Crossbar Scheduler s mentioned earlier, our objective is the design of customized reconfigurable crossbar schedulers in order to reduce the area comared to a fully arallel scheduler and increase the erformance comared to a sequential scheduler. The same crossbar scheduler is required to dynamically generate the control signals to configure the switch fabric within a crossbar.. Design Flow Our scheduler has been develoed to be integrated as modular communication comonents in the ESPM tool chain as deicted in Fig.. Details of the ES- PM design methodology can be found in [ 8]. In ESPM, three inut secifications are required, namely alication/maing/latform secification in XML. n alication is secified as a Kahn Process Network (KPN). In this work, the KPN is considered as a rogramming model. KPN is a network of concurrent rocesses that communicate over unbounded FIFO channels and synchronize by a blocking read on an emty FIFO. The KPN is a suitable model of Platform sec. P P P P Parameters set Data width = bits Number of rocessors = Crossbar network Write into FIFOs Processor Processor Processor Processor Maing sec. P P P P Video in/out VLE lication sec. (KPN) Video in/out Parameters set Network tye = crossbar Scheduler = custom arallel Toology table ESPM / XPS Tools Read request Data scheduler rbiter rbiter rbiter rbiter Control signals switch VLE Customized scheduler (netlist) Figure ESPM design flow [ 8]. FIFOs FIFOs FIFOs FIFOs lication (Matlab/C) COMPN YPI Parameters set Scheduler weight table bitstream IP cores library FPG comutation on to of the resented crossbar interconnection network, due to the following facts: The synchronization scheme is relatively simle. Processors synchronize only with the full/emty status of the hardware FIFOs. Subsequently, a arallel rogrammer does not need to exlicitly handle the synchronization, since the synchronization is inherently suorted by the hardware FIFOs. The system eliminates the traditional head of the line (HOL) blocking roblem that all ackets must wait behind the contended acket in the inut queue. This is due to the fact that a hysical FIFO is established for a logical channel in a dedicated manner. KPN secification is automatically generated from a sequential Matlab rogram using the COMPN tool [9]. We define the network toology in the KPN secification as the logical toology for an alication, which consists of tasks and logical channels. Exemlary logical toologies are deicted in Fig.. Single task or multile tasks are assigned to a hysical rocessor in the maing secification. The network tye and the ort maing information are secified in the latform secification. Figure deicts how custom crossbars can be imlemented from a four-node MJPEG alication secification. In the latform secification, four rocessors are ort-maed on a crossbar. From the maing and latform secifications, ort-maed network toology is extracted as a static arameter and assed to ESPM. We define the extracted ortmaed network toology as the on-demand toology

5 J Sign Process Syst () 8:9 8 7 that an alication requires, which consists of rocessors and hysical links. dditionally, the traffic bandwidth information is obtained from the YPI tool []. Subsequently, ESPM refines the abstract latform model to an elaborate arameterized RTL (hardware) and C/C++ (software) models, which are inuts to the commercial Xilinx Platform Studio (XPS) tool. The XPS tool generates the on-demand netlist (deicted in Fig. ) with regard to the arameters assed from the inut secifications. Our system model is based on the local write, remote read scheme. The rocessor issues the request that designates the target ort index and target FIFO index. Subsequently, the FIFO resonds with a data stream to the rocessor. The crossbar network transfers the requests (from rocessors to FIFOs) and thedata (from FIFOs to rocessors). The directions of requests (from left to right) and the direction of the data transfers (from right to left) are deicted in Fig.. Finally, a bitstream is generated for the FPG rototye board to check the functionality and the erformance.. Reference Scheduling Schemes We consider a sequential scheduler (SS) and a fully arallel scheduler (FPS) as references to comare with our custom schedulers. Figure deicts the behavior of the SS, FPS, and our custom schedulers for the MJPEG alication in Fig. (8). Figure () deicts the logical toology (or data flow grah) with nodes and channels. In our model of comutation, each communication channel is maed onto the FIFOs labeled by F F. Figure () deicts the system model. The hysical system consists of nodes and links (or FIFOs). The ith node is connected to rocessor ort i and FIFO ort i. For examle, in the sixth node, rocessor P is connected to the rocessor ort and FIFO F is connected to the FIFO ort, as deicted in Fig. (). FIFO index corresonds to the channel index, as deicted in Fig. (). The crossbar network transfers the requests (from rocessors to FIFOs) and the data (from FIFOs to rocessors). For examle, a bold line in Fig. () reresents that rocessor P sends the request signal to FIFO F. Subsequently, FIFO F sends data to rocessor P. Note that data in FIFO F is from rocessor P. Figure () deicts the biartite grah based on the system model. The thick and thin lines deict all ossible requests according to the toology in Fig. (). The thick lines reresent an examle request attern. In this examle, four rocessors request to four FIFOs. The numbers on the link reresent the relative amount of traffic. For examle, the link between and is times less utilized than the link between and. Figures () (8) deict the cycle-by-cycle behavior of five different schedulers for the request atterns (bold links) in Fig. (). In order to establish the link between the rocessor and the target FIFO, three stes are required. First, a rocessor requests to an arbiter. Second, the arbiter grants the request when the target ort is idle and the round-robin ointer oints to the requesting rocessor. Third, the request is acceted when the target FIFO contains data. These oerations are denoted by R, G, and. The round-robin ointer is denoted by the oval. For the sake of simlicity, the data is assumed to be requested by a rocessor in the first cycle. The arbiter is assumed to erform a circular (weighted) round-robin arbitration in the order of,,, and so on. fter the request is granted, a link between a rocessor and a FIFO ort is established using a handshaking rotocol, which is assumed to take cycles. The bold lines in Fig. () (8) reresent actual data transmission, which is assumed to take cycles. Figure () deicts the behavior of a SS, where one ort is arbitrated at a time. crossbar contains a central arbiter that sequentially grants a single request at a time, while data transmission can be arallel. request is served after a request in the revious ort index is arbitrated. Subsequently, 8 cycles are required to serve those requests, as deicted in Fig. (). Figure () deicts FPS, where homogeneous arbiters are located in each ort. Unlike the SS, multile requests can be arbitrated in arallel. Each arbiter checks for all orts whether there is a request or not. Consequently, cycles are required in total, as deicted in Fig. (). Our FPS imlementation is similar to the islip scheduler [], in that circular round-robin ointer is udated when the request is granted. Similar to islip scheduler, allto-all interconnects are established. The islip scheduler is designed for the inut-queued acket switch. Our FPS differs from the islip scheduler in that the FPS has been imlemented for on-chi multirocessor systems with distributed memories. dditionally, while the islip scheduler [] requires two stages of arbiter arrays, our FPS requires a single stage of the arbiter array. s Fig. () deicts, FPS erforms better than SS, since concurrent requests can be served in arallel.. Round Robin Custom Scheduler The round-robin custom arallel scheduler (CPS) scheme is resented in []. The CPS is similar to the FPS in that the scheduler consists of arbiter arrays. In the CPS, however, the round-robin ointer udate oeration is erformed only for on-demand interconnects.

6 7 J Sign Process Syst () 8:9 8 Ports P F F F P P P Processors P P P Read request to remote FIFOs Data transfer P P F F F P P F F F P F Index F Interconnection network P i i th rocessor Write into FIFOs i i th rocessor ort () Logical toology () System model i i th FIFO ort () Data requests examle ( nodes, channels) F i i th FIFO ( nodes, FIFOs) (Requests in bold arrows) R : data request G : request grant : accet : actual data transfer : round -robin ointer (by rocessor) (by arbiter) (by target FIFO) (between rocessor and FIFO) cycles scheme ointer R () Single arbiter G Processor P (SS) requests to FIFO F 8 that is connected to ort. R G Single request R G R G is arbitrated R G at a time. R G R G () Full arallel R G R G R G (FPS) Multile requests can be R G R G R G arbitrated in arallel. R G R G R G ll-to-all interconnects are established. R G R G () () Custom arallel R G R G R G (CPS) RG RG Only necessary interconnects RG RG are established. RG (7) Weighted round-robin custom (WCPS) Established RG RG interconnects R G RG RG The round-robin ointer oints to. Parallel arbitration logic that arbitrates only established interconnects (8) Shared custom ointer (SCPS) ointer R G Partly arallel arbitration ointer Processor P requests Request from rocessor P is granted, Request from rocessor P is acceted, because ort Physical link between and is established. to FIFO F that is because the round-robin ointer oints to Processor P reads data from FIFO F. connected to ort. is idle and target FIFO F contains data.. Ports Ports Ports Ports F F F7 F8 F9 F F F Figure Different scheduling schemes. F F F F7 F8 F9 FIFOs Processor orts FIFO orts We exloit the fact that the alication is secified by a directed grah, in which each task has ossibly a different number of connected channels. s a design technique, each arbiter is arameterized with regard to the on-demand logical toology. lication-secific and differently sized arbiters ensure that the toology of the hysical interconnects is identical to the on-demand toology secified by the alication artitioning. Given the on-demand toologies from the alication secifications, our CPS oerates as follows:. Request: rocessor issues a request, by designating the target FIFO ort and FIFO index.. Grant: If there is a request in the round-robin ointer and the target FIFO ort is idle, the request is validated.. ccet: The target FIFO status is checked. If the target FIFO is not emty, the request is acceted and the hysical link is established. The roundrobin ointer is udated to the one that aears next in the round-robin schedule, where the roundrobin schedule is determined by the on-demand toology for an alication. If the ointed request is a Clear_Request, the link is cleared. If there is no request, the round-robin ointer

7 J Sign Process Syst () 8:9 8 7 is incremented. Figure () deicts our examle scenario for the CPS. Each arbiter checks if there is a request for required links. s an examle, has five robable requests from. Therefore, the CPS arbiter at services five links. Similarly, has two robable requests from and. Therefore, the CPS arbiter at services two links. For comarison, the FPS arbiter [Fig. ()] services links at each ort. s Fig. () indicates, cycles are required to serve the request atterns in our examle. In general, CPS erforms better than FPS, since the request search sace of CPS is a subset of the full search sace of the FPS. rea reduction also can be exected, since only on-demand links are hysically established. Moreover, only one link is often connected to an arbiter, indicating that the arbiter erforms only handshaking oeration and imlementation cost can be reduced. dditionally, CPS erforms significantly better than SS, since the arbitration is erformed in arallel. In many cases, CPS occuies more area than SS. The area overhead issue is discussed in Section.. Weighted Round Robin Scheduler In the revious section, we resented a toologically customized scheduler (CPS) scheme. The CPS erforms better and rovides lower cost for an on-demand toology, comared to reference schedulers []. In the CPS scheme, an arbitration is erformed in a simle round-robin fashion and the differently utilized traffic bandwidth is not considered. In this section, we roose a weighted round-robin custom scheduling scheme (WCPS) in order to adat the scheduler to the traffic requirements... Weighted Round Robin Custom Scheduler (WCPS) Our roosed WCPS scheme is similar to the CPS in that the round-robin ointer is udated only for on-demand interconnects. Similarly, the toology of the hysical interconnects is also identical to the ondemand toology secified in the ESPM design flow. The difference between WCPS and CPS is that the ointer udate oeration in WCPS is erformed with regard to weights. The weight refers to the relative number of utilized tokens (or ackets), which corresonds to traffic bandwidth requirements. In this context, we define the weight by the relative number of requests for tokens utilized in a communication link. token refers to a set of data words, which is our rimitive communication unit. We exloit the fact that the weights can be determined by alication rofiling. In this work, we use the YPI tool [] (see Fig. ). s a design technique, each arbiter is arameterized with resect to the on-demand toology together with a weight distribution. By using differently sized arbiters and alication-secific weight information, we can increase the network erformance. This is due to the fact that the scheduler checks links with a high robability of requests more frequently. The main idea of the WCPS to match hysical network bandwidth to logical traffic bandwidth. Only when the traffic is uniformly distributed, our WCPS is identical to CPS. The novelty of our WCPS lies in that the arbitration is erformed over only actually established on-demand interconnects. This means that no hysical circuitry is established for unnecessary interconnects. It can be noted that our scheme of assigning weights is static. We are motivated by the fact that the traffic information can be statically derived from the alication secification. The traffic variation is not significant esecially in data-streaming alications, since the traffic attern is rather regular and eriodic. The resented design flow raidly instantiates on-demand interconnects and scheduler IPs, based on the extracted traffic information. In this work, we assume that the token (or acket) sizes for the alications in Fig. are similar (or same). Though the token size can be different in the ESPM design flow, the token sizes are similar for the alications considered in this work. s an examle, YPI [] rofiling result indicates that 97% of the Wavelet traffic and 8% of the MJPEG traffic contain identical-sized tokens... WCPS Examle In Fig. (), weights for each link are deicted as an examle. The number on the channels refers to the relative number of requests for tokens. s an examle, the weight of the links from to is, resectively. The weight of the link from to is. This means that the links from to are five times more utilized than the link from to. Therefore, the WCPS arbiter at checks five times more often than. We imlement WCPS using the weight table which contains a list with the sequence of rocessor orts that an arbiter checks. Figure () deicts the weight table for ort. The sequence is ermutated such that the same rocessor ort index is sarsely distributed. We use a weight counter for each rocessor ort index to handle the ermutation. s an examle, an arbitration sequence for ort is deicted in Fig. (). In this examle, the maximum weight is, indicating that there are five sub-rounds. Initially, the weight counters are

8 7 J Sign Process Syst () 8:9 8 rbiter for ort with weighted round-robin ointer () WCPS arbiter () Weight table for ort = = = = = Iterate st nd rd th th sub-round = = = = = = = = = = = = = = = = = = = = () Request check sequence Figure Weighted round robin custom arallel scheduler. = = = = = set by the weight table, as deicted in the rectangle in Fig. (). In each sub-round, the arbiter searches rocessor orts, which have the same counter values. Subsequently, the weight counter value is decremented. When all the weight counter values are zero, or all the sub-rounds are finished, then the counter values are set by the initial weight table. The rocess is iterated, similarly to a round robin aroach. Figure (7) deicts an examle scenario for the WCPS. The weight from to is five times greater than the weight from to. In this case, cycles are required to serve the request atterns. For comarison, the CPS requires cycles. The WCPS requires less cycles than the CPS, because the arbitration time at ort is reduced. In most cases, WCPS erforms better than CPS, since the arbitration is adatively erformed with regard to the traffic information. WCPS occuies more area than CPS, since the internal logic is relatively more comlex. The area overhead issue is also discussed in Section. scheme in order to rovide a design trade-off between erformance and cost... Shared Custom Parallel Scheduler (SCPS) In this work, we roose a custom scheduler with the shared arbitration (SCPS) scheme. In SCPS, multile arbiters are shared, such that each arbiter sequentially erforms arbitration while multile arbiters oerate in arallel. Our SCPS scheme combines the advantages of the low cost in SS and increased erformance in CPS. Figure deicts an examle of CPS and SCPS for a six-node MJPEG alication. Figure () deicts CPS, where an arbiter is established er ort. Figure () deicts SCPS, where arbiters for ort (, ), (, ), and (, ) are shared. The traffic requirement is also deicted in Fig., derived from the YPI tool []. is the total incoming traffic bandwidth or an arrival rate to the ort. We aim to reduce the imlementation cost, by sharing the arbiter resources. Figure indicates that the number of arbiters in SCPS is, while the number of arbiters in CPS is. This means the imlementation cost can be reduced. Our SCPS is also constructed over on-demand interconnects. Comared to CPS, the network erformance can be likely decreased. This is due to the fact that a shared arbiter is accommodated for multile orts and the round-robin search sace can be increased for the shared arbiters. Figure (8) deicts the examle scenario for the SCPS, where sequential arbiters oerate in arallel. s Fig. (8) shows, 9 cycles are required to serve the request atterns. For comarison, the CPS requires cycles and the WCPS requires cycles... Clustering Method In this section, we resent a method to cluster arbiters. Our aim is to minimize the erformance degradation by balancing the traffic bandwidth, comared to CPS.. Shared rbiters for Custom Crossbars In our (W)CPS scheme, an arbiter is accommodated at each crossbar ort in order to erform arallel arbitration. In addition, we considered that a single task is maed onto a single rocessor. However, the number of tasks (or rocesses) is often greater than the number of orts (or rocessors). In this case, it is required to ma multile tasks onto a single rocessor. Subsequently, the number of links er node increases and the imlementation cost of (W)CPS increases accordingly. In this section, we roose a novel shared arbitration () CPS /9 /9 /9 9/9 Figure Shared custom arallel scheduler. () SCPS /9 9/9

9 J Sign Process Syst () 8: In this work, we consider clusters of two arbiters only. Given a CPS scheme, our method can be described as follows:. Calculate cost function: Calculate the individual cost using a cost function for arbiter j,where j. is the number of CPS arbiters.. Sort: Sort arbiters in an increasing order, based on the derived cost function.. Iterate clustering: Iteratively cluster arbiters with the lowest and the highest cost function, in the sorted list. The cost function is a relative metric to reresent the utilization of an arbiter. We define the cost function for the arbiter j as the following: C { } links j j = U j, () where C{ j } refers to the cost function of a CPS arbiter in the j th ort. links j refers to the number of hysical links which are connected to the arbiter j. links j corresonds to the relative arbitration latency, as described in the next section. U j refers to the summation of the traffic for the arbiter j and corresonds to relative arbiter utilization. The individual channel utilization for U j is deicted in Fig.. s an examle, U is derived by ++++ [see Fig. ()]. We multily U 9 j by links j in order to reflect the actual utilization of an arbiter. s an examle in Fig., we cluster arbiters in the following way:. Calculate cost function: We calculate the cost function for each arbiter. s an examle, links = and U = ++++ for arbiter. Therefore C{} = ( ++++ ) =.. Similarly, 9 9 C{} = C{} = C{} = C{} =.9, and C{} =... Sort: rbiters are sorted in an increasing order. We obtain ({},{},{},{},{},{}). Iterate clustering: We cluster {,}, {,}, and {,}. In this way, a highly utilized arbiter and a less utilized arbiter can be clustered. lso, more than arbiters (or variable-sized arbiters) can be shared in a similar way, from the sorted list. Performance nalysis In this section, we resent comarative erformance evaluations for different schedulers. In Section., we define the erformance metric. Different crossbar schedulers are comared in terms of the service rates. In Section., we conduct the queueing modeling and comare the entire network erformance for considered alications.. Scheduler Performance.. Performance Metric In order to comare the relative scheduler erformance for given alications, we consider the following erformance metric for different schedulers: M scheduler = N i= μ i u i N, () where M scheduler is a erformance metric for each scheduler and refers to the relative service rate in tokens/s. μ i is the service rate of the arbiter for the ith logical channel. N is the total number of logical channels. u i refers to the normalized channel utilization for the ith channel. The u i is multilied by the arbiter service rate μ i, in order to reflect the actual amount of traffic for an alication. The service rate μ i can be modeled as: μ i = (T arbit + T transmit ), () where T arbit is an arbitration latency to establish a link. T transmit is the actual data transmission latency after the link is established. T transmit can be derived as Num_Word Clk sys, where Num_Word refers to the number of data words, or the token size. Clk sys refers to the system clock frequency... Crossbar Schedulers In this subsection, we resent a erformance comarison between five different schedulers. We can fairly comare different scheduling schemes, since only the arbitration latencies are different. T arbit in Eq. for different crossbar schedulers can be aroximated as follows: #orts C hand T arbit_ss = k (a) T arbit_fps = k #orts Clk sys + C hand Clk sys #links + Chand T arbit_cps = k Clk sys (b) (c)

10 78 J Sign Process Syst () 8:9 8 #links ( ) W std W max + C hand T arbit_wcps = k #links T arbit_scps = k Clk sys { (α Chand )+( α)c hand } (d) Clk sys, (e) T arbit_ss refers to the arbitration latency (in seconds) for the sequential scheduler (SS). k is the scaling factor to calibrate the hardware imlementation. The request check latency is modeled by #orts cycles. We divide by, since the circular round-robin ointer is statistically located in the middle of the search sace. In the SS, there is only a single arbiter in the system. Only after the current rocessor ort is served, the next ort is served. C hand refers to the handshaking latency in number of cycles. Therefore, we model these sequential oerations by multilying the handshaking latency C hand by #orts. T arbit_fps refers to the arbitration latency in the fully arallel scheduler (FPS). In the FPS, the arbiter at each ort obliviously checks for all orts. The request check latency is modeled by #orts cycles, similarly to the SS. However, multile requests can be concurrently served. Therefore, we model these arallel arbitration oerations, by adding #orts and the Chand. T arbit_cps refers to the arbitration latency for the custom arallel scheduler (CPS). Similarly to the FPS, we model the arbitration latency by adding the request check latency and the C hand. However, the request check latency is modeled by #links, since the actual arbitration is erformed for the only established hysical links, instead of all links. #links is equal or less than #orts. Therefore, it is obvious that T arbit_cps is less than T arbit_ss and T arbit_fps. Only if the required toology is all-to-all, then the T arbit_cps is equal to T arbit_fps. T arbit_wcps refers to the arbitration latency for the weighted round-robin scheduler (WCPS). In Eq. d, W std refers to the standard deviation of weights for connected links in an arbiter. W max refers to the maximum weight in the connected links to an arbiter. We divide by W max in order to normalize the weight. Comared to the roundrobin CPS, the arbitration latency can be reduced. s an examle, in Fig. (), for ort is W std W max.8, or.. We used the term W std W max to statistically model the reduced arbitration latency, due to the following facts. First, the higher weight deviation means that the variation of link bandwidths (that an alication requires) is high. Second, our scheduler allocates higher bandwidth (or time slots) to the links that require high bandwidth. Third, the traffic attern of our streamed multimedia alications is highly regular and eriodic. s an examle, the MJPEG alication is erformed block-wise, where each block contains 8 8 image ixels. Due to the regularity of the traffic attern, the statistical model can be simlified as shown in Eq. d. It must be noted that the W std is calculated for only connected links. In other words, the WCPS arbiter at each ort checks for only the necessary orts with different weights. T arbit_scps refers to the arbitration latency for the shared custom arallel scheduler (SCPS). In Eq. e, α is if the number of arbiters is. In this case, SCPS is the same as SS. α is if the number of arbiters is greater than. In this case, the SCPS oerates in a same way as CPS. The difference between CPS and SCPS is that the #links of a SCPS arbiter is in many cases greater than the #links of CPS arbiter. For examle, as deicted in Fig. (), #links for (,, ) are (,,). For comarison, as deicted in Fig. (), #links for ( ) are (,,,,,)... MJPEG Examle We consider the six-node MJPEG alication, deicted in Fig. (8). The different scheduling schemes are given in Fig. 7. The service rates for each scheme are derived as follows. We assume that the schedulers oerate at MHz and k k are. C hand is assumed to be cycles er token, since each of the request and the acknowledgement requires cycle, resectively. To obtain T transmit, it is assumed that each word takes cycle to transmit. SS: The service rate μ s er token for the centralized SS arbiter is derived as follows. Considering the single-word token communications, cycle or Num_Word =, T transmit is (in seconds). MHz Since C hand is two cycles and number of orts is, T arbit_ss is obtained from Eq. a and it is ( ) cycles. By substituting T MHz arbit_ss to Eq., μ s MHz can be derived by =. ( +)cycles tokens/s. The erformance metric M SS is derived from Eq.. s Fig. 7() deicts, there are channels, or N =. The relative channel utilization u is also deicted in Fig. 7(). s a result, M SS is derived by (. )( ) or. tokens/s, from Eq.. The derived M SS indicates the relative servicerateforthemjpegtraffic.

11 J Sign Process Syst () 8: µ s () SS µ (w)c µ (w)c µ (w)c µ (w)c µ (w)c µ(w)c () (W)CPS /9 /9 /9 /9 /9 /9 /9 /9 Figure 7 Different schedulers. µ f µ f µ f µ f µ f µ f () FPS µ sc µ sc µ sc () SCPS /9 /9 /9 /9 /9 /9 /9 /9 FPS: Similarly, a service rate μ f for each FPS MHz arbiter can be derived by =.7 ( ++)cycles tokens/s [see Fig. 7()]. M FPS is derived by (.7 )( ) or.8 tokens/s, from Eq.. CPS: service rate μ c for each CPS arbiter is determined by the toology [see Fig. 7()]. μ c for MHz an arbiter is = tokens/s. ( ++)cycles MHz Similarly, μ c, μ c, μ c, μ c is = ( ++)cycles tokens/s. μ c is tokens/s. s a result, M CPS is 7. tokens/s and is derived by the following equation: M CPS ( )( ) ( )( 8 9 ) + ( )( ) 9 9 =. WCPS: service rate μ wc for each WCPS arbiter is determined by the toology as well as the weight distribution [see Fig. 7()]. The weight distribution is deicted in Fig. (8). s an examle, the standard deviation W std of (,,,, ) is.9 for arbiter in Video_in node. The maximum W weight W max is. Subsequently, std W max is calculated by.9, or.. Therefore, μ wc for arbiter is MHz = ( (.)++)cycles tokens/s. For comarison, the service rate in CPS for arbiter is tokens/s. This means that the service rate of WCPS is % higher than the service rate of CPS for arbiter. On the other hand, the weight distribution is uniform for arbiters.for the orts, the standard deviation W std is, indicating that the arbitration is erformed in the same way as in the round-robin CPS. In case of, only single link is established. Subsequently, the service rates, μ wc μ wc are the same for both CPS and WCPS. M WCPS is 7. tokens/s and is derived by the following equation: M WCPS = ( )( ) ( )( 8 9 ) + ( )( ) 9 9 SCPS: service rate μ sc for each SCPS arbiter can be derived in a similar way as CPS [see Fig. 7()]. s described earlier, the difference between CPS and SCPS is the number of links that each arbiter handles. μ sc for an arbiter (for orts, ) MHz is = ( ++)cycles tokens/s. Similarly, μ sc = μ sc = tokens/s. M SCPS is. tokens/s and is derived by the following equation: ( )( ) ( )( ) M SCPS =. Similarly, the erformance metric for different schedulers with different alications is derived, as deicted in Fig. 8. The alication task grahs of the - node MPEG, the eight-node PIP, the -node VOPD, the -node MWD were taken from []. The task grah of the six-node MPEG secification was taken from []. The task grahs of the MJPEG and Wavelet alications were taken from []. When the token size is -word, our WCPS rovides better service rate by factor of comared to SS and comared to FPS. Our WCPS erforms % better than CPS for MPEG and Wavelet alications. The erformance imrovement of WCPS over CPS is relatively small for other alications due to the following reasons. First, the toology in ractical alications is simle in the sense that the average number of links er node is only.. In many cases, only a single link is connected to the crossbar ort, indicating that no arbitration is necessary for those orts. The SCPS is times better than SS and.7 times better than FPS. When the token size is - word, the CPS rovides on average % better service rate than SCPS. This erformance degradation of SCPS over CPS is due to the sharing arbiter resources. s Fig. 8() shows, when the token size is -words, our WCPS erforms still better than the reference schedulers, while the imrovement is significantly less than the case of the small-sized tokens. This is due.

12 8 J Sign Process Syst () 8:9 8 x tokens/s Relative service rate SS FPS CPS WCPS SCPS x tokens/s Relative service rate SS FPS CPS WCPS SCPS MJPEG{,7} MPEG {,} MJPEG {,} PIP {8,8} MWD {,} VOPD {,} MPEG {,} () Token size = word () Token size = words Figure 8 Relative service rate for different schedulers. Wavelet {,} MJPEG{,7} MPEG {,} MJPEG {,} PIP {8,8} MWD {,} VOPD {,} MPEG {,} Wavelet {,} to the fact that the token transmission latency is a dominant term to determine the service rate.. Network Performance In the revious section, the comarative erformance of different scheduler modules is resented. In this section, the entire network erformance is comared in terms of latency and throughut... ueueing Modeling We formulated a network erformance model to comare the relative latency and throughut. Our analysis is based on the queuing model [], since the queuing model rovides a reasonable fit to the reality with relatively simle formulation. Based on the general queuing model, the following assumtions are made: The system network conforms to the Jackson model []. Each queue behaves as an indeendent single server and the total network latency can be modeled as the combination of each service latency. Each server is analyzed by an (M/M/) queuing model. In other words, the incoming traffic obeys the Poisson distribution. The data arrivals occur randomly and the service time distribution is exonential. If the server is idle, a data in the queue is served immediately. The queue size is adequately large to avoid the stall of the data flow. The Jackson s oen queueing model is based on the network of queues [] and can be utilized as a modeling method in general, since most of NoC-based systems accommodate buffers (or queues) for the communication. The queueing model can be suitably alied to our system due to the following facts: The KPN model and the actual system are indeed a network of queues. The incoming data stream attern is statistically random. token in the FIFO is indeendently served by a single scheduler (or server) at each crossbar ort. In addition, the hysical queue size is sufficiently large. s an examle, we imlement a FIFO using (multile) embedded block RMs, where single block RM rimitive in our target device can accommodate -bit words, which is large enough. Consequently, the general network latency can be modeled as: T network = N i= i μ i i, () where T network is the total latency of the system network. N is the number of queueing systems. is the total incoming arrival token rate to the network (or outgoing rate from the network). i is the incoming arrival token rate to the ith queue. μ i is the service rate of the arbiter for the ith queue... Case Studies We studied four cases with two different network sizes and two different token sizes. First, we consider the sixnode MJPEG alication, deicted in Fig. (8) which requires a small network size. The ort-maed system

13 J Sign Process Syst () 8:9 8 8 Video in P P rocessor rocessor ort FIFO ort, ' ' ' ' ' ' µ µ µ µ µ µ c c c c c c /9 /9 /9 /9 / /9 () Network model for (W)CPS Figure 9 case study. 8 /9 /9 P P P P () Network of queues,, 7 9 ' µ sc legend Video out P queueing system 7 /9, 8 9 µ ' sc,, µ ' sc, () Network model for SCPS model is deicted in Fig. 9(). Considering rocessor P as a streamed data source, P generates the data inarateof (tokens/s). token rate in each queue is derived from the YPI rofiler [], as deicted in Fig. 9(). Figure 9() and () deict the network model for (W)CPS and SCPS. The service rate μ for each scheduler is derived in the same way as in Section.. with Eq.. The total network latency can be derived by substituting the service rate μ to Eq.. s a result, the network latencies are derived and deicted in Fig. (a). The erformance analysis indicates that the WCPS reduces latency by at least % than SS and at least % better than the FPS for all token rate ranges. lso, the erformance is better imroved as the token rate increases. Moreover, the network with our WCPS saturates at the bandwidth of tokens/s and the network with SS saturates at the bandwidth of tokens/s. Therefore, WCPS is. better than SS in terms of throughut. Similarly, our WCPS is better than FPS in terms of throughut. Comared to CPS, WCPS rovides marginal erformance imrovement. This is due to the fact that the weight variation is not high, as deicted in Fig. 9(). Comared to SCPS, WCPS reduces the latency by at least % and rovides % better throughut. Second, we consider the MJPEG case study for a large token size with Num_Word =. The network erformance can be derived in the similar way as in the revious case, while only the service rate differs. cycles MHz T transmit is derived by, because the Num_Word is. s a result, Fig. (b) shows that the WCPS erforms at least % better than SS and.% better than FPS for all ranges. The erformance imrovement Figure Network erformance Network latency (us) Token rate (X tokens/s) (a) Token size = word CPS FPS SS WCPS SCPS () -node MJPEG alication in Figure (8 ) 8 Network latency (us) 9 Token rate (X tokens/s) (b) Token size = words Network latency (us). Token rate (X tokens/s) (a) Token size = word CPS FPS SS WCPS SCPS () -node Wavelet alication in Figure ( ) 8 Network latency (us) 8 Token rate (X tokens/s) (b) Token size = words

14 8 J Sign Process Syst () 8:9 8 is smaller than the case of single-word token transactions, since T transmit is a dominant factor for the network latency, comared to T arbit. Third, we consider the -node Wavelet alication deicted in Fig. (), which requires a large-sized network and a single-word token size. s the number of crossbar orts increases, T arbit for SS and FPS roortionally increases. However, T arbit for (W)CPS does not increase, since the average number of orts for the round-robin ointer is.. In other words, on average. orts are only required to be arbitrated by an arbiter, instead of orts. Figure (a) deicts the network latency for single-word token transactions. The network latency of CPS is reduced by at least 8% comared to SS and at least 7% comared to FPS. The WCPS rovides % higher throughut than CPS. Finally, Fig. (b) deicts the network latency for -word token transactions in the Wavelet alication. CPS erforms at least % better than SS and % better than FPS. It can be suggested that the resented CPS, WCPS, SCPS schemes are more beneficial for small sized tokens communicated over large networks. Imlementation Results The aforementioned scheduler modules were imlemented in VHDL to integrate the resented network comonents in the ESPM design environment [ 8]. The resented schedulers are imlemented with arameterized arbiter arrays. The scheduler modules are generic in terms of data width, number of orts, on-demand toologies, and a traffic weight table. The switch module in [] is used as a common interconnects fabric and the communication controller in [7] is used as a common network interface. The functionality of the network is verified by VHDL simulations. In order to fairly comare different schedulers, we imlemented the schedulers such that k k = for Eq.. Imlementation details can be found in []. The exemlary behaviors of the imlemented schedulers are deicted in Fig.. The imlemented scheduler modules are comared in terms of area utilization. We exerimented with different task grah toologies of realistic alications, deicted in Fig.. The imlemented schedulers were synthesized using the Xilinx ISE 8. tool targeting the Virtex-II Pro (xcv-7-89) FPG and the areas were obtained and deicted in Fig. (). Our CPS requires on average 8% less area comared to the FPS. The CPS requires on average 8% more area comared to the network with SS. We consider that the area overhead of our CPS over SS is relatively less significant, since our target xcv device contains 9,8 slices and chi-wise overhead of CPS over SS is on average %. The area of our network is not only deendent on the number of nodes that determine its size but also on the network toology. It is observed that the higher area reduction is obtained as the network size increases. This is due to the fact that the average number of links er node is. and does not increase as the number of nodes increases. WCPS occuies on average % more area than CPS. The SCPS occuies % more area than CPS. s described earlier, this is due to the fact that in many cases only one link is connected to arbiters in CPS scheme. Due to this, the MJPEG {,7} SS FPS CPS WCPS SCPS MPEG {,} Number of slices Number of slices Number of slices MJPEG {,} PIP {8, 8} MWD {,} VOPD {,} MPEG {,} Wavelet {,} VERGE {, } SS FPS CPS WCPS SCPS () One-task to one-rocessor maing () Multile-to-one maing 8 7 SS FPS CPS WCPS SCPS Wavelet {, } MJPEG+Wavelet Figure Exerimental results.

15 J Sign Process Syst () 8:9 8 8 Init HPF Coy LPF () node-wavelet {, } Sink Video in Init Coy HPF LPF VLE Video out Sink () -node MJPEG + -node Wavelet {, } Figure Toology after multile tasks are maed onto a rocessor. requires 7% less area than the CPS scheme. Figure deicts the erformance metric defined in Section. for the five-node Wavelet reresentation. s Fig. () deicts, the WCPS rovides % better service rate for -word token sizes. The WCPS scheme also occuies % more area than the CPS, whereas the chi-wise overhead is %, for five-node Wavelet reresentation. The exerimental result suggests that when the number of links er ort increases, the SCPS can be beneficial, roviding a design trade-offs between cost and erformance. CPS arbiter logic can be greatly simlified, esecially when a single task is maed onto a rocessor. In addition, the single SCPS arbiter contains relatively more comlex decoding logic than single CPS arbiter. The exeriment shows that CPS scheme is suitable when single task is maed onto a hysical rocessor. In many cases, however, the number of tasks is often greater than the number of hysical rocessors for given alications. Therefore, multile tasks are required to be maed onto a hysical rocessor because a target device is limited in size. Subsequently, the required number of links er ort increases. Figure () deicts the five-node reresentation for the -node Wavelet alication. Figure () is derived by clustering the same tasks onto a single rocessor. s a result, the number of links er node is, or.. Note that the number of links er node is, or. when a single task is maed onto one rocessor. Figure () deicts a toology of a synthetic alication that combines the six-node MJPEG and the five-node Wavelet secification. In this case, the number of links er node is, or.. In both cases, the number of links er node increases when comared to a one-to-one maing. We imlemented different schedulers for these cases and comared the area cost. s deicted in Fig. (), the SCPS scheme x tokens/s 9 Relative service rate SS FPS CPS WCPS SCPS x tokens/s Relative service rate SS FPS CPS WCPS SCPS () Token size = word () Token size = words Figure Performance for five-node wavelet secification. Conclusions The erformance of a arallel alication increases when the underlying network can be matched to the on-demand logical toology and adhered to the bandwidth requirements of an alication. In this aer, we resented crossbar schedulers designed for reconfigurable on-chi-networks caable of adjusting themselves to custom toologies and custom bandwidth requirements. We conducted a queueing analysis to determine the overall network erformance. From the erformance evaluation and imlementation, the following conclusions can be drawn: The resented custom schedulers (CPS, WCPS, SCPS) rovide better erformance than conventional schedulers, esecially for small-sized tokens communicated over large-size network. In addition, our schedulers efficiently utilize the on-chi resources by adating themselves to the logical toologies. When the tasks are one-to-one maed onto rocessors, the custom arallel scheduler (CPS) erforms significantly better than SS, FPS, SCPS. The weighted round-robin custom scheduler (WCPS) can increase the network erformance, by assigning different network bandwidth to different traffic requirements. The WCPS is beneficial when the traffic load is not evenly distributed. The shared custom scheduler (SCPS) rovides a design trade-offs for cost and erformance when comared to the mentioned schedulers. The SCPS can be beneficial when the number of links er ort increases. The sequential scheduler (SS) rovides adequate erformance and cost when the token size is large and the size of a network is fairly small. Our schedulers were imlemented using arameterized arbiter arrays. By utilizing the on-demand

16 8 J Sign Process Syst () 8:9 8 toology and traffic requirements as design arameters, the schedulers were adated to given alications without modifying the network imlementation. We showed that our schedulers erform better and occuy significantly less area than conventional fully arallel schedulers. Our schedulers erform significantly better and occuy moderately more area than sequential schedulers. Oen ccess This article is distributed under the terms of the Creative Commons ttribution Noncommercial License which ermits any noncommercial use, distribution, and reroduction in any medium, rovided the original author(s) and source are credited. References. Mckeown, N. (999). The islip scheduling algorithm for inut-queued switches. IEEE/CM Transaction on Networking (TON), 7(), 88, ril.. Bjerregaard, T., & Mahadevan, S. (). survey of research and ractices of network-on-chi. CM Comuting Surveys, 8(),, March.. Vassiliadis, S., & Sourdis, I. (). FLUX networks: Interconnects on demand. In Proceedings of international conference on comuter systems architectures modelling and simulation (IC-SMOS ) (. 7), July.. Vassiliadis, S., & Sourdis, I. (7). FLUX interconnection networks on demand. Journal of Systems rchitecture, (), , October.. Moraes, F., Calazans, N., Mello,., Möller, L., & Ost, L. (). HERMES: n infrastructure for low area overhead acket-switching netwoks on chi. Integration, The VLSI Journal, 8(), 9 9, October.. Marescaux, T., Nollet, V., Mignolet, J.-Y., Moffat,. B. W., vasare, P., Coene, P., et al. (). Run-time suort for heterogeneous multitasking on reconfigurable SoCs. Integration, The VLSI Journal, 8(), Sethuraman, B., Bhattacharya, P., Khan, J., & Vemuri, R. (). LiPaR: lightweight arallel router for FPG based networks on chi. In Proceedings of the great lakes symosium on VLSI (GLSVLSI ) (. 7), ril. 8. Bartic, T.., Mignolet, J.-Y., Nollet, V., Marescaux, T., Verkest, D., Vernalde, S., et al. (). Toology adative network-on-chi design and imlementation. IEE Proceedings. Comuters and Digital Techniques, (), 7 7, July. 9. Brebner, G., & Levi, D. (). Networking on chi with latform FPGs. In Proceedings of the IEEE international conference on field-rogrammable technology (FPT ) (. ), December.. Srinivasan, K., & Chatha, K. S. () low comlexity heuristic for design of custom network-on-chi architectures. In Proceedings of international conference on design, automation and test in Euroe (DTE ) (. ), March.. Murali, S., & Micheli, G. D. () n alication-secific design methodology for STbus crossbar generation. In Proceedings of international conference on design, automation and test in Euroe (DTE ) (. 7 8), March.. Loghi, M., ngiolini, F., Bertozzi, D., Benini, L., & Zafalon, R. (). nalyzing on-chi communication in a MPSoC environment. In Proceedings of international conference on design, automation and test in Euroe (DTE ) (. 7 77), February.. Murali, S., Benini, L., & Micheli, G. D. (7). n alication-secific design methodology for on-chi crossbar generation. IEEE Transactions on Comuter-ided Design of Integrated Circuits and Systems, (7), 8 9, July.. Pasricha, S., Dutt, N., & Ben-Romdhane, M., (). Constraint-driven bus matrix synthesis for MPSoC. In Proceedings of th sia and South Pacific design automation conference (SP-DC ) (. ), January.. Guta, P., & McKeown, N. (999). Designing and imlementing a fast crossbar scheduler. IEEE Micro, 9(), 89, January February.. Nikolov, H., Stefanov, T., & Derettere, E. (). Multirocessor system design with ESPM. In Proceedings of the th IEEE/CM/IFIP international conference on HW/SW codesign and system synthesis (CODES-ISSS ) (. ), October. 7. Nikolov, H., Stefanov, T., & Derettere, E. (). Efficient automated synthesis, rogramming, and imlementation of multi-rocessor latforms on FPG chis. In Proceedings of th international conference on field rogrammable logic and alications (FPL ) (. 8), ugust. 8. Nikolov, H., Stefanov, T., & Derettere, E. (8). Systematic and automated multi-rocessor system design, rogramming, and imlementation. IEEE Transactions on Comuter-ided Design of Integrated Circuits, 7(),, March. 9. Kienhuis, B., Rijkema, E., & Derettere, E. (). COM- PN: Deriving rocess networks from Matlab for embedded signal rocessing architectures. In Proceedings of 8th international worksho on hardware/software codesign (CODES ) (. 7), May.. de Kock, E.., Essink, G., Smits, W. J. M., van der Wolf, P., Brunel, J.-Y., Kruijtzer, W. M., et al. (). YPI: lication modeling for signal rocessing systems. In Proceedings of the 7th design automation conference (DC ) (. ), June.. Hur, J. Y., Stefanov, T., Wong, S., & Vassiliadis, S. (7). Customizing reconfigurable on-chi crossbar scheduler. In Proceedings of IEEE 8th international conference on alication-secific systems, architectures and rocessors (SP 7) (. ), July.. Bertozzi, D., Jalabert,., Murali, S., Tamhankar, R., Stergiou, S., Benini, L., et al. (). NoC synthesis flow for customized domain secific multirocessor systems-onchi. IEEE Transactions on Parallel and Distributed Systems, (), 9, February.. Chang, K.-C., Shen, J.-S., & Chen, T.-F. (). Evaluation and design trade-offs between circuit-switched and acketswitched NOCs for alication-secific SOCs. In Proceedings of the th design automation conference (DC ) (. 8), July.. Hur, J. Y., Stefanov, T., Wong, S., & Vassiliadis, S. (7). Systematic customization of on-chi crossbar interconnects. In Proceedings of international worksho on alied reconfigurable comuting (RC 7) (. 7), March.. Baldwin, R. O., Davis IV, N. J., Midkiff, S. F., & Kobza, J. E. (). ueueing network analysis: Concets, terminology, and methods. The Journal of Systems and Software, (), 99 7, May.

17 J Sign Process Syst () 8:9 8 8 Netherlands, in December. He is currently working as an assistant rofessor at the Comuter Engineering Laboratory at the Delft University of Technology (TU Delft), The Netherlands. He has considerable exerience in the design of embedded reconfigurable media rocessors. He has worked also on microcoded FPG comlex instruction engines and the modeling of arallel rocessor communication networks. His research interests include embedded systems, multimedia rocessors, comlex instruction set architectures, reconfigurable and arallel rocessing, microcoded machines, and distributed/grid rocessing. Jae Young Hur received the B.S. degree in electronics engineering from Cheju National University, Cheju, South Korea in 99. He received the M.S. degrees in electronics engineering from Sogang University, Seoul, South Korea in 998 and from Munich University of Technology, Munich, Germany in. From 999 to, he was a semiconductor engineer in Samsung Electronics, Ltd., South Korea. Since November, he has been with the comuter engineering laboratory, Delft University of Technology, The Netherlands, where he is currently a research assistant (PhD student). His research interests include embedded systems design, networks-on-chi, VLSI design, and reconfigurable comuting. Stehan Wong received his PhD in Comuter Engineering from the Electrical Engineering, Mathematics and Comuter Science faculty of the Delft University of Technology (TU Delft), The Todor Stefanov received the Dil.Ing. and M.S. degrees in comuter engineering from The Technical University of Sofia, Sofia, Bulgaria, in 998 and the Ph.D. degree in comuter science from Leiden University, Leiden, The Netherlands, in. From 998 to May, he was a Research and Develoment Engineer with Innovative Micro Systems, Ltd., Sofia. From June to ugust 7, he was with the Leiden Institute of dvanced Comuter Science, Leiden University, where he was a Research ssistant (PhD student) and a PostDoc Researcher at the Leiden Embedded Research Center. From Setember 7 to ugust 8, he was a Senior Researcher at the Comuter Engineering Lab, Delft University of Technology, Delft, The Netherlands. Since Setember, 8, Todor Stefanov has been an ssistant Professor with the Leiden Institute of dvanced Comuter Science, Leiden University where he erforms research at the Leiden Embedded Research Center. His research interests include several asects of embedded systems design, with articular emhasis on systemlevel design automation, multirocessor systems-on-chi design, and hardware/software codesign.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

A power efficient flit-admission scheme for wormhole-switched networks on chip

A power efficient flit-admission scheme for wormhole-switched networks on chip A ower efficient flit-admission scheme for wormhole-switched networks on chi Zhonghai Lu, Li Tong, Bei Yin and Axel Jantsch Laboratory of Electronics and Comuter Systems Royal Institute of Technology,

More information

Layered Switching for Networks on Chip

Layered Switching for Networks on Chip Layered Switching for Networks on Chi Zhonghai Lu zhonghai@kth.se Ming Liu mingliu@kth.se Royal Institute of Technology, Sweden Axel Jantsch axel@kth.se ABSTRACT We resent and evaluate a novel switching

More information

Hardware-Accelerated Formal Verification

Hardware-Accelerated Formal Verification Hardare-Accelerated Formal Verification Hiroaki Yoshida, Satoshi Morishita 3 Masahiro Fujita,. VLSI Design and Education Center (VDEC), University of Tokyo. CREST, Jaan Science and Technology Agency 3.

More information

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

The Scalability and Performance of Common Vector Solution to Generalized Label Continuity Constraint in Hybrid Optical/Packet Networks

The Scalability and Performance of Common Vector Solution to Generalized Label Continuity Constraint in Hybrid Optical/Packet Networks The Scalability and Performance of Common Vector Solution to Generalized abel Continuity Constraint in Hybrid Otical/Pacet etwors Shujia Gong and Ban Jabbari {sgong, bjabbari}@gmuedu George Mason University

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

A Model-Adaptable MOSFET Parameter Extraction System

A Model-Adaptable MOSFET Parameter Extraction System A Model-Adatable MOSFET Parameter Extraction System Masaki Kondo Hidetoshi Onodera Keikichi Tamaru Deartment of Electronics Faculty of Engineering, Kyoto University Kyoto 66-1, JAPAN Tel: +81-7-73-313

More information

Argo Programming Guide

Argo Programming Guide Argo Programming Guide Evangelia Kasaaki, asmus Bo Sørensen February 9, 2015 Coyright 2014 Technical University of Denmark This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comut. 71 (2011) 288 301 Contents lists available at ScienceDirect J. Parallel Distrib. Comut. journal homeage: www.elsevier.com/locate/jdc Quality of security adatation in arallel

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip Downloaded from orbit.dtu.dk on: Jan 25, 2019 A Metaheuristic Scheduler for Time Division Multilexed Network-on-Chi Sørensen, Rasmus Bo; Sarsø, Jens; Pedersen, Mark Ruvald; Højgaard, Jasur Publication

More information

EP2200 Performance analysis of Communication networks. Topic 3 Congestion and rate control

EP2200 Performance analysis of Communication networks. Topic 3 Congestion and rate control EP00 Performance analysis of Communication networks Toic 3 Congestion and rate control Congestion, rate and error control Lecture material: Bertsekas, Gallager, Data networks, 6.- I. Kay, Stochastic modeling,

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

Implementation of Evolvable Fuzzy Hardware for Packet Scheduling Through Online Context Switching

Implementation of Evolvable Fuzzy Hardware for Packet Scheduling Through Online Context Switching Imlementation of Evolvable Fuzzy Hardware for Packet Scheduling Through Online Context Switching Ju Hui Li, eng Hiot Lim and Qi Cao School of EEE, Block S Nanyang Technological University Singaore 639798

More information

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Grah Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, Matei Rieanu Deartment of Electrical and Comuter Engineering, The University

More information

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University

More information

Control plane and data plane. Computing systems now. Glacial process of innovation made worse by standards process. Computing systems once upon a time

Control plane and data plane. Computing systems now. Glacial process of innovation made worse by standards process. Computing systems once upon a time Classical work Architecture A A A Intro to SDN A A Oerating A Secialized Packet A A Oerating Secialized Packet A A A Oerating A Secialized Packet A A Oerating A Secialized Packet Oerating Secialized Packet

More information

Figure 8.1: Home age taken from the examle health education site (htt:// Setember 14, 2001). 201

Figure 8.1: Home age taken from the examle health education site (htt://  Setember 14, 2001). 201 200 Chater 8 Alying the Web Interface Profiles: Examle Web Site Assessment 8.1 Introduction This chater describes the use of the rofiles develoed in Chater 6 to assess and imrove the quality of an examle

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

Source Coding and express these numbers in a binary system using M log

Source Coding and express these numbers in a binary system using M log Source Coding 30.1 Source Coding Introduction We have studied how to transmit digital bits over a radio channel. We also saw ways that we could code those bits to achieve error correction. Bandwidth is

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

Semi-Markov Process based Model for Performance Analysis of Wireless LANs

Semi-Markov Process based Model for Performance Analysis of Wireless LANs Semi-Markov Process based Model for Performance Analysis of Wireless LANs Murali Krishna Kadiyala, Diti Shikha, Ravi Pendse, and Neeraj Jaggi Deartment of Electrical Engineering and Comuter Science Wichita

More information

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal International Journal of Information and Electronics Engineering, Vol. 1, No. 1, July 011 An Efficient VLSI Architecture for Adative Rank Order Filter for Image Noise Removal M. C Hanumantharaju, M. Ravishankar,

More information

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance

More information

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

An FPGA Implementation of ( )-Regular Low-Density. Parity-Check Code Decoder

An FPGA Implementation of ( )-Regular Low-Density. Parity-Check Code Decoder An FPGA Imlementation of ( )-Regular Low-Density Parity-Check Code Decoder Tong Zhang ECSE Det., RPI Troy, NY tzhang@ecse.ri.edu Keshab K. Parhi ECE Det., Univ. of Minnesota Minneaolis, MN arhi@ece.umn.edu

More information

10. Multiprocessor Scheduling (Advanced)

10. Multiprocessor Scheduling (Advanced) 10. Multirocessor Scheduling (Advanced) Oerating System: Three Easy Pieces AOS@UC 1 Multirocessor Scheduling The rise of the multicore rocessor is the source of multirocessorscheduling roliferation. w

More information

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions

More information

Space-efficient Region Filling in Raster Graphics

Space-efficient Region Filling in Raster Graphics "The Visual Comuter: An International Journal of Comuter Grahics" (submitted July 13, 1992; revised December 7, 1992; acceted in Aril 16, 1993) Sace-efficient Region Filling in Raster Grahics Dominik Henrich

More information

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems Cheng-Yuan Chang and I-Lun Tseng Deartment of Comuter Science and Engineering Yuan Ze University,

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Min Hu, Saad Ali and Mubarak Shah Comuter Vision Lab, University of Central Florida {mhu,sali,shah}@eecs.ucf.edu Abstract Learning tyical

More information

Distributed Systems (5DV147)

Distributed Systems (5DV147) Distributed Systems (5DV147) Mutual Exclusion and Elections Fall 2013 1 Processes often need to coordinate their actions Which rocess gets to access a shared resource? Has the master crashed? Elect a new

More information

TOPP Probing of Network Links with Large Independent Latencies

TOPP Probing of Network Links with Large Independent Latencies TOPP Probing of Network Links with Large Indeendent Latencies M. Hosseinour, M. J. Tunnicliffe Faculty of Comuting, Information ystems and Mathematics, Kingston University, Kingston-on-Thames, urrey, KT1

More information

Statistical Detection for Network Flooding Attacks

Statistical Detection for Network Flooding Attacks Statistical Detection for Network Flooding Attacks C. S. Chao, Y. S. Chen, and A.C. Liu Det. of Information Engineering, Feng Chia Univ., Taiwan 407, OC. Email: cschao@fcu.edu.tw Abstract In order to meet

More information

Privacy Preserving Moving KNN Queries

Privacy Preserving Moving KNN Queries Privacy Preserving Moving KNN Queries arxiv:4.76v [cs.db] 4 Ar Tanzima Hashem Lars Kulik Rui Zhang National ICT Australia, Deartment of Comuter Science and Software Engineering University of Melbourne,

More information

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005*

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005* The Anubis Service Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL-2005-72 June 8, 2005* timed model, state monitoring, failure detection, network artition Anubis is a fully

More information

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model Ad Hoc Networks (4) 5 68 Contents lists available at SciVerse ScienceDirect Ad Hoc Networks journal homeage: www.elsevier.com/locate/adhoc Latency-minimizing data aggregation in wireless sensor networks

More information

Skip List Based Authenticated Data Structure in DAS Paradigm

Skip List Based Authenticated Data Structure in DAS Paradigm 009 Eighth International Conference on Grid and Cooerative Comuting Ski List Based Authenticated Data Structure in DAS Paradigm Jieing Wang,, Xiaoyong Du,. Key Laboratory of Data Engineering and Knowledge

More information

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

EE678 Application Presentation Content Based Image Retrieval Using Wavelets EE678 Alication Presentation Content Based Image Retrieval Using Wavelets Grou Members: Megha Pandey megha@ee. iitb.ac.in 02d07006 Gaurav Boob gb@ee.iitb.ac.in 02d07008 Abstract: We focus here on an effective

More information

Optimization of Collective Communication Operations in MPICH

Optimization of Collective Communication Operations in MPICH To be ublished in the International Journal of High Performance Comuting Alications, 5. c Sage Publications. Otimization of Collective Communication Oerations in MPICH Rajeev Thakur Rolf Rabenseifner William

More information

Record Route IP Traceback: Combating DoS Attacks and the Variants

Record Route IP Traceback: Combating DoS Attacks and the Variants Record Route IP Traceback: Combating DoS Attacks and the Variants Abdullah Yasin Nur, Mehmet Engin Tozal University of Louisiana at Lafayette, Lafayette, LA, US ayasinnur@louisiana.edu, metozal@louisiana.edu

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

A Novel Iris Segmentation Method for Hand-Held Capture Device

A Novel Iris Segmentation Method for Hand-Held Capture Device A Novel Iris Segmentation Method for Hand-Held Cature Device XiaoFu He and PengFei Shi Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200030, China {xfhe,

More information

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island A GPU Heterogeneous Cluster Scheduling Model for Preventing Temerature Heat Island Yun-Peng CAO 1,2,a and Hai-Feng WANG 1,2 1 School of Information Science and Engineering, Linyi University, Linyi Shandong,

More information

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY Andrew Lam 1, Steven J.E. Wilton 1, Phili Leong 2, Wayne Luk 3 1 Elec. and Com. Engineering 2 Comuter Science

More information

Extracting Optimal Paths from Roadmaps for Motion Planning

Extracting Optimal Paths from Roadmaps for Motion Planning Extracting Otimal Paths from Roadmas for Motion Planning Jinsuck Kim Roger A. Pearce Nancy M. Amato Deartment of Comuter Science Texas A&M University College Station, TX 843 jinsuckk,ra231,amato @cs.tamu.edu

More information

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScrit Objects Shiyi Wei and Barbara G. Ryder Deartment of Comuter Science, Virginia Tech, Blacksburg, VA, USA. {wei,ryder}@cs.vt.edu

More information

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime GDP: Using Dataflow Proerties to Accurately Estimate Interference-Free Performance at Runtime Magnus Jahre Deartment of Comuter Science Norwegian University of Science and Technology (NTNU) Email: magnus.jahre@ntnu.no

More information

Phase Transitions in Interconnection Networks with Finite Buffers

Phase Transitions in Interconnection Networks with Finite Buffers Abstract Phase Transitions in Interconnection Networks with Finite Buffers Yelena Rykalova Boston University rykalova@bu.edu Lev Levitin Boston University levitin@bu.edu This aer resents theoretical models

More information

Face Recognition Based on Wavelet Transform and Adaptive Local Binary Pattern

Face Recognition Based on Wavelet Transform and Adaptive Local Binary Pattern Face Recognition Based on Wavelet Transform and Adative Local Binary Pattern Abdallah Mohamed 1,2, and Roman Yamolskiy 1 1 Comuter Engineering and Comuter Science, University of Louisville, Louisville,

More information

Fast Distributed Process Creation with the XMOS XS1 Architecture

Fast Distributed Process Creation with the XMOS XS1 Architecture Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James

More information

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing Mikael Taveniku 2,3, Anders Åhlander 1,3, Magnus Jonsson 1 and Bertil Svensson 1,2

More information

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics structure arises in many alications of geometry. The dual structure, called a Delaunay triangulation also has many interesting roerties. Figure 3: Voronoi diagram and Delaunay triangulation. Search: Geometric

More information

A Reconfigurable Architecture for Quad MAC VLIW DSP

A Reconfigurable Architecture for Quad MAC VLIW DSP A Reconfigurable Architecture for Quad MAC VLIW DSP Sangwook Kim, Sungchul Yoon, Jaeseuk Oh, Sungho Kang Det. of Electrical & Electronic Engineering, Yonsei University 132 Shinchon-Dong, Seodaemoon-Gu,

More information

Distributed Estimation from Relative Measurements in Sensor Networks

Distributed Estimation from Relative Measurements in Sensor Networks Distributed Estimation from Relative Measurements in Sensor Networks #Prabir Barooah and João P. Hesanha Abstract We consider the roblem of estimating vectorvalued variables from noisy relative measurements.

More information

Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform

Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chi Platform Uzi Vishkin George C. Caragea Bryant Lee Aril 2006 University of Maryland, College Park, MD 20740 UMIACS-TR

More information

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1 10 File System 1 We will examine this chater in three subtitles: Mass Storage Systems OERATING SYSTEMS FILE SYSTEM 1 File System Interface File System Imlementation 10.1.1 Mass Storage Structure 3 2 10.1

More information

Simulation Modelling Practice and Theory

Simulation Modelling Practice and Theory Simulation Modelling Practice and Theory 19 (2011) 494 515 Contents lists available at ScienceDirect Simulation Modelling Practice and Theory journal homeage: www.elsevier.com/locate/simat Simulating and

More information

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching Mitigating the Imact of Decomression Latency in L1 Comressed Data Caches via Prefetching by Sean Rea A thesis resented to Lakehead University in artial fulfillment of the requirement for the degree of

More information

Modeling Reliable Broadcast of Data Buffer over Wireless Networks

Modeling Reliable Broadcast of Data Buffer over Wireless Networks JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 3, 799-810 (016) Short Paer Modeling Reliable Broadcast of Data Buffer over Wireless Networs Deartment of Comuter Engineering Yeditee University Kayışdağı,

More information

Using Standard AADL for COMPASS

Using Standard AADL for COMPASS Using Standard AADL for COMPASS (noll@cs.rwth-aachen.de) AADL Standards Meeting Aachen, Germany; July 5 8, 06 Overview Introduction SLIM Language Udates COMPASS Develoment Roadma Fault Injections Parametric

More information

Wavelet Based Statistical Adapted Local Binary Patterns for Recognizing Avatar Faces

Wavelet Based Statistical Adapted Local Binary Patterns for Recognizing Avatar Faces Wavelet Based Statistical Adated Local Binary atterns for Recognizing Avatar Faces Abdallah A. Mohamed 1, 2 and Roman V. Yamolskiy 1 1 Comuter Engineering and Comuter Science, University of Louisville,

More information

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Stehan Baumann, Kai-Uwe Sattler Databases and Information Systems Grou Technische Universität Ilmenau, Ilmenau, Germany

More information

Tracking Multiple Targets Using a Particle Filter Representation of the Joint Multitarget Probability Density

Tracking Multiple Targets Using a Particle Filter Representation of the Joint Multitarget Probability Density Tracing Multile Targets Using a Particle Filter Reresentation of the Joint Multitarget Probability Density Chris Kreucher, Keith Kastella, Alfred Hero This wor was suorted by United States Air Force contract

More information

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans Available online at htt://ijdea.srbiau.ac.ir Int. J. Data Enveloment Analysis (ISSN 2345-458X) Vol.5, No.2, Year 2017 Article ID IJDEA-00422, 12 ages Research Article International Journal of Data Enveloment

More information

AUTOMATIC 3D SURFACE RECONSTRUCTION BY COMBINING STEREOVISION WITH THE SLIT-SCANNER APPROACH

AUTOMATIC 3D SURFACE RECONSTRUCTION BY COMBINING STEREOVISION WITH THE SLIT-SCANNER APPROACH AUTOMATIC 3D SURFACE RECONSTRUCTION BY COMBINING STEREOVISION WITH THE SLIT-SCANNER APPROACH A. Prokos 1, G. Karras 1, E. Petsa 2 1 Deartment of Surveying, National Technical University of Athens (NTUA),

More information

Patterned Wafer Segmentation

Patterned Wafer Segmentation atterned Wafer Segmentation ierrick Bourgeat ab, Fabrice Meriaudeau b, Kenneth W. Tobin a, atrick Gorria b a Oak Ridge National Laboratory,.O.Box 2008, Oak Ridge, TN 37831-6011, USA b Le2i Laboratory Univ.of

More information

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method ITB J. Eng. Sci. Vol. 39 B, No. 1, 007, 1-19 1 Leak Detection Modeling and Simulation for Oil Pieline with Artificial Intelligence Method Pudjo Sukarno 1, Kuntjoro Adji Sidarto, Amoranto Trisnobudi 3,

More information

Distributed Algorithms

Distributed Algorithms Course Outline With grateful acknowledgement to Christos Karamanolis for much of the material Jeff Magee & Jeff Kramer Models of distributed comuting Synchronous message-assing distributed systems Algorithms

More information

Analytical Modelling and Message Delay Performance Evaluation of the IEEE MAC Protocol

Analytical Modelling and Message Delay Performance Evaluation of the IEEE MAC Protocol Analytical Modelling and Message Delay Performance Evaluation of the IEEE 8.6 MAC Protocol Luís F. M. de Moraes and Danielle Loes F. G. Vieira Laboratório de Redes de Alta Velocidade - Universidade Federal

More information

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution Texture Maing with Vector Grahics: A Nested Mimaing Solution Wei Zhang Yonggao Yang Song Xing Det. of Comuter Science Det. of Comuter Science Det. of Information Systems Prairie View A&M University Prairie

More information

A Comprehensive Minimum Energy Routing Scheme for Wireless Ad hoc Networks

A Comprehensive Minimum Energy Routing Scheme for Wireless Ad hoc Networks A Comrehensive Minimum Energy Routing Scheme for Wireless Ad hoc Networks Jinhua Zhu, Chunming Qiao and Xin Wang Deartment of Comuter Science and Engineering State University of New York at Buffalo, Buffalo,

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems Matlab Virtual Reality Simulations for otimizations and raid rototying of flexible lines systems VAMVU PETRE, BARBU CAMELIA, POP MARIA Deartment of Automation, Comuters, Electrical Engineering and Energetics

More information

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs An emirical analysis of looy belief roagation in three toologies: grids, small-world networks and random grahs R. Santana, A. Mendiburu and J. A. Lozano Intelligent Systems Grou Deartment of Comuter Science

More information

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 M. Gajbe a A. Canning, b L-W. Wang, b J. Shalf, b H. Wasserman, b and R. Vuduc, a a Georgia Institute of Technology,

More information

FIELD-programmable gate arrays (FPGAs) are quickly

FIELD-programmable gate arrays (FPGAs) are quickly This article has been acceted for inclusion in a future issue of this journal Content is final as resented, with the excetion of agination IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

More information

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS Kevin Miller, Vivian Lin, and Rui Zhang Grou ID: 5 1. INTRODUCTION The roblem we are trying to solve is redicting future links or recovering missing links

More information

PacketScore: A Statistical Packet Filtering Scheme against Distributed Denial-of-Service Attacks

PacketScore: A Statistical Packet Filtering Scheme against Distributed Denial-of-Service Attacks PacketScore: A Statistical Packet Filtering Scheme against Distributed Denial-of-Service Attacks Yoohwan Kim Wing Cheong Lau Mooi Choo Chuah H. Jonathan Chao School of Comuter Science Det. of Information

More information

SEARCH ENGINE MANAGEMENT

SEARCH ENGINE MANAGEMENT e-issn 2455 1392 Volume 2 Issue 5, May 2016. 254 259 Scientific Journal Imact Factor : 3.468 htt://www.ijcter.com SEARCH ENGINE MANAGEMENT Abhinav Sinha Kalinga Institute of Industrial Technology, Bhubaneswar,

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

Construction of Irregular QC-LDPC Codes in Near-Earth Communications

Construction of Irregular QC-LDPC Codes in Near-Earth Communications Journal of Communications Vol. 9, No. 7, July 24 Construction of Irregular QC-LDPC Codes in Near-Earth Communications Hui Zhao, Xiaoxiao Bao, Liang Qin, Ruyan Wang, and Hong Zhang Chongqing University

More information

An Efficient End-to-End QoS supporting Algorithm in NGN Using Optimal Flows and Measurement Feed-back for Ubiquitous and Distributed Applications

An Efficient End-to-End QoS supporting Algorithm in NGN Using Optimal Flows and Measurement Feed-back for Ubiquitous and Distributed Applications An Efficient End-to-End QoS suorting Algorithm in NGN Using Otimal Flows and Measurement Feed-back for Ubiquitous and Distributed Alications Se Youn Ban, Seong Gon Choi 2, Jun Kyun Choi Information and

More information

Tiling for Performance Tuning on Different Models of GPUs

Tiling for Performance Tuning on Different Models of GPUs Tiling for Performance Tuning on Different Models of GPUs Chang Xu Deartment of Information Engineering Zhejiang Business Technology Institute Ningbo, China colin.xu198@gmail.com Steven R. Kirk, Samantha

More information

Real-Time Streaming of Point-Based 3D Video

Real-Time Streaming of Point-Based 3D Video Real-Time Streaming of Point-Based 3D Video Edouard Lamboray Stehan Würmlin Markus Gross Comuter Grahics Laboratory Swiss Federal Institute of Technology (ETH) Zurich, Switzerland {lamboray, wuermlin,

More information

Fuzzy Logic Controller of Random Early Detection based on Average Queue Length and Packet Loss Rate

Fuzzy Logic Controller of Random Early Detection based on Average Queue Length and Packet Loss Rate Fuzzy Logic Controller of Random Early etection based on Average Queue Length and acket Loss Rate Hussein Abdel-jaber, 2 Mohamed Mahafzah, 2 Fadi Thabtah, Mike Woodward eartment of Comuting, University

More information

PRO: a Model for Parallel Resource-Optimal Computation

PRO: a Model for Parallel Resource-Optimal Computation PRO: a Model for Parallel Resource-Otimal Comutation Assefaw Hadish Gebremedhin Isabelle Guérin Lassous Jens Gustedt Jan Arne Telle Abstract We resent a new arallel comutation model that enables the design

More information

A joint multi-layer planning algorithm for IP over flexible optical networks

A joint multi-layer planning algorithm for IP over flexible optical networks A joint multi-layer lanning algorithm for IP over flexible otical networks Vasileios Gkamas, Konstantinos Christodoulooulos, Emmanouel Varvarigos Comuter Engineering and Informatics Deartment, University

More information