Application Mapping for Express Channel-Based Networks-on-Chip

Size: px

Start display at page:

Download "Application Mapping for Express Channel-Based Networks-on-Chip"

Cornelia Malone
5 years ago
Views:

1 Applicatio Mappig for Express Chael-Based Networks-o-Chip Di Zhu, Lizhog Che, Siyu Yue, ad Massoud Pedram Uiversity of Souther Califoria Los Ageles, Califoria, USA 9009 {dizhu, lizhogc, siyuyue, Abstract With the emergece of may-core multiprocessor system-o-chips (MPSoCs), the o-chip etworks are facig serious challeges i providig fast commuicatio for various tasks ad cores. Oe promisig solutio show i recet studies is to add express chaels to the etwork as shortcuts to bypass itermediate routers, thereby reducig packet latecy. However, this approach also greatly chages the packet delay estimatio ad traffic behaviors of the etwork, both of which have ot yet bee exploited i existig mappig algorithms. I this paper, we explore the opportuities i optimizig applicatio mappig for express chael-based o-chip etworks. Specifically, we derive a ew delay model for this type of etworks, idetify their uique characteristics, ad propose a efficiet heuristic mappig algorithm that icreases the bypassig opportuities by reducig uecessary turs that would otherwise impose the etire router pipelie delay to packets. Simulatio results show that the proposed algorithm ca achieve a ~X reductio i the umber of turs ad 0~6% reductio i the average packet delay. Keywords etwork-o-chip; applicatio mappig; express chaels I. INTRODUCTION With the itegratio of tes to possibly a hudred of cores o a chip [][], multiprocessor system-o-chips (MPSoCs) have bee provided with tremedous opportuities for parallel executio. A key challege of the parallel paradigm is the desig of high performace o-chip etwork (a.k.a. OCN or NoC) that ca coect various IP blocks or tasks ruig o differet cores. However, as the etwork sizes cotiue to grow, traditioal NoC topologies such as mesh or cocetrated mesh [] have bee facig serious performace issues due to their iheret ature of hop-by-hop packet forwardig. A more scalable approach that has bee paid icreasig attetio is to add express chaels [][][] to the tile-based NoCs. These express chaels act as shortcuts betwee oeighborig tiles to bypass all itermediate routers, thereby acceleratig packet trasfer. Nevertheless, the additio of express chaels sigificatly chages the traffic patters ad requires differet delay calculatio models betwee tiles. For example, packets o express chaels caot make turs; so packets eed to get off the express chaels ad go through the etire router pipelie stages i order to make a tur, which slows dow the packet trasport. These ad other ew characteristics exhibited i express chael-based etworks are ot captured ad exploited i existig applicatio mappig algorithms that are resposible for mappig tasks to physical tiles. I this paper, we ivestigate the opportuity of optimizig This work is supported i part by the Software ad Hardware Foudatios program of the NSF s Directorate for Computer & Iformatio Sciece & Egieerig. applicatio mappig for express chael-based etworks. Specially, we idetify the critical differeces betwee traditioal etworks ad express chael-based etworks, derive a ew delay model reflectig express chaels, mathematically formulated the correspodig applicatio mappig problem, ad proposed a efficiet heuristic mappig algorithm based o the key observatios of the problem characteristics. The proposed algorithm, Tur Reductio Algorithm for Mappig (TRAM), is able to ot oly effectively map tasks with large commuicatio rate closer to each other as what have bee achieved i previous algorithms, but also maximize the aligmet of heavily commuicatig tasks i both rows ad colums, thus reducig uecessary turs that would otherwise impose the log delay of router pipelie to packets. The rest of the paper is orgaized as follows. Sectio II provides more backgroud o express chael-based o-chip etworks ad motivates the eed for ew mappig algorithms. Sectio III formulates the problem, ad Sectio IV explais the details of the proposed TRAM algorithm. Sectio V ad VI describe evaluatio methodology ad preset simulatio results. Fially, Sectio VII cocludes the paper. II. BACKGROUND AND MOTIVATION A. Express Chael-Based O-chip Networks While mesh topology has traditioally bee used for tilebased NoCs, packets i mesh etworks must be forwarded hopby-hop, which exposes the router delay (e.g., 3~ cycles) ad lik delay (e.g., cycle) at every hop to the packet latecy. To mitigate the latecy problem of mesh, particularly for large etworks, cocetratio [] (Figure b CMesh) has bee proposed i which multiple IP blocks or tasks are placed o the same tile to form a task cluster. All tasks i a task cluster occupy oe tile ad share oe router. With a cocetratio degree of, the etwork diameter ca be reduced by half. However, due to the layout costraits ad the icreased router complexity, it is difficult to employ high cocetratio degrees, thus limitig the latecy reductio through this techique. As more research beig coducted to improve NoC performace, recet studies show promise of addig express chaels o top of cocetratio to accelerate packet trasfer [][][]. Figure (c) shows a example of the popular flatteed butterfly (FB) topology [] that adds separate liks to coect two oeighbor tiles directly (e.g., from top-left tile to top-right tile). To better utilize the lik resources, a etwork with multi-drop express chaels (Figure d MECS) [] is proposed to combie separate liks to a uified lik but with multiple drops, so that o additioal iput or output ports are eeded. Packets are routed o the express chaels as much as possible ad use /DATE/ 0 EDAA

2 S A S A B B (a) Mesh (b) CMesh (c) Flatteed Butterfly (d) MECS Figure. O-chip etworks without express chaels: (a) ad (b), ad with express chaels: (c) ad (d). o-express chaels oly if cotetio occurs. I this way, itermediate routers o the same row or colum ca be bypassed, resultig i oly the lik latecy. However, i order to chage dimesio, packets eed to get off the express chaels ad eter the ormal router/switch pipelie to make the turs. Also, dimesio-order routig is typically used i FB ad MECS istead of adaptive routig []. This is because adaptive routig may geerate a large umber of turs, causig most packets to go through ormal routers, which defeats the purpose of addig express chaels. B. Related Work Applicatio mappig is a importat compoet i the desig of multiprocessor systems. MPSoC applicatios such as video ecoder/decoder typically cosist of may tasks that are workig collaboratively to perform certai fuctios. By mappig frequetly or heavily commuicatig tasks to physically close tiles, the average packet delay ad power cosumptio ca be greatly reduced. Due to the importace of applicatio mappig, a umber of mappig algorithms have bee proposed. For example, Hu et al. i [9] use graphs to model the characteristic of applicatios ad propose a brach-ad-boud algorithm to miimize commuicatio eergy of mappig. A two-step geetic algorithm is proposed i [] to map applicatios o mesh-based NoCs to optimize task graph executio. Murali et al. focus o miimizig commuicatio delay uder badwidth costraits i [6]. Che et al. preset mechaisms for joit optimizatio by task schedulig, applicatio mappig, data mappig ad routig o NoC-based CMPs []. Faruque et al. use a distributed approach based o agets for applicatio mappig ad greatly lowered the moitorig traffic ad computatioal effort compared to cetralized schemes []. I [0], Jag et al. form the mappig of heterogeeous cores o irregular mesh-based MPSoCs to a mixed-iteger programmig problem ad proposed two effective heuristic algorithms. While the above works are very effective i achievig their correspodig objectives, these algorithms are ot able to distiguish the differeces i tile commuicatio latecy betwee the two types of etworks. For istace, i mesh etworks, as log as two tiles (e.g., A ad B i Figure b) have the same Mahatta distaces from a source tile (e.g., S i Figure b), the latecies are the same; whereas i express chael-based etworks, the tile with less turs has shorter latecy (e.g., cycles from S to A i Figure c) tha the tile with more turs (e.g., cycles from S to B i Figure c). Therefore, applyig existig mappig algorithms to express chael-based NoCs may result i suboptimal or iefficiet mappig solutios. III. PROBLEM STATEMENT A. Network, Applicatio, ad Average Packet Delay Several importat defiitios are give below. Defiitio Network Topology: ) A CMesh etwork has a etwork size of tiles. ) Cocetratio degree is the umber of processig elemets (PEs) that ca be placed o oe tile. Therefore, a CMesh-based MPSoC with a cocetratio degree of ca hold at most PEs. Defiitio Applicatio: 3) A applicatio cotais a set of tasks { }, each executed o oe PE. Tasks commuicate with each other durig executio to exchage data, maitai coherecy, etc. ) A task cluster is a set of tasks that are grouped together to be placed o oe tile of a CMesh etwork. Cocetratio degree idicates a task cluster cotais at most tasks. Sice the partitioig of tasks ito task clusters greatly depeds o the specific fuctioalities ad restrictios of each task i a particular applicatio, i this paper, we assume the task clusters are give for a applicatio, ad focus o the mai problem of mappig task clusters to tiles o the NoC. Defiitio 3 A applicatio mappig solutio is a permutatio, so that task cluster is mapped to tile. I order to give a formal defiitio of average packet delay, we defie the commuicatio graph of a applicatio ad the tile delay graph of a give NoC topology as follows. Defiitio A commuicatio graph is a directed graph, i which each vertex represets a task cluster ad each edge deotes the commuicatio from to. The weight associated with edge deotes the commuicatio rate, i.e., the average umber of flits set from to per uit time. Defiitio A tile delay graph is a complete directed graph, i which each vertex represets a tile. There is a edge betwee ay two vertices (tiles). The weight associated with edge represets the delay from tile to tile whe followig the routig path (e.g., XY routig path) from to. Give that task cluster is mapped to tile, the average packet delay of a applicatio ca be defied as follows.

3 Defiitio 6 The average packet delay (APD) of a applicatio ca be calculated by Note that this equatio is applicable to both CMesh etworks as well as etworks with express chaels. The key differece is the tile delay model used i task delay graph i Defiitio, which is discussed ext. B. Delay Models ) Tile delay model for CMesh etworks Defiitio Uit-legth lik delay is the umber of cycles (typically ) betwee eighborig tiles. Delays for log express chaels are proportioal to the legth. Router delay is the umber of cycles a packet takes to go through a router, i.e., the umber of router pipelie stages. I CMesh etworks without express chaels, each packet has to go through the etire router pipelie for each hop it travels. Therefore the tile delay o CMesh etwork without express chaels ca be calculated by: () where is the Mahatta distace betwee tile ad, ad is the per router cotetio latecy which depeds o traffic load. I cotemporary NoCs, because of the large likwidth (e.g., 6-bit) ad low load of real applicatios, the value of is usually betwee 0. to cycles per router (also observed i our simulatios). Also ote that this delay model has already icluded the ijectio router ad the ejectio router to accout for ed-to-ed tile delay. ) Tile delay model for express chael-based etworks To derive the tile delay model for express chael-based etworks, we first defie a auxiliary tur fuctio as below: Defiitio A tur fuctio is used to idetify whether packets set from tile to tile eed to make a tur assumig XY routig: The tur fuctio is crucial i determiig the packet delay o express-chael etworks. If ad are o the same row or colum, the router of will directly sed packets to the express chael from to, so that packets oly go through two router pipelies (the ijectio router ad ejectio router) before reachig the destiatio tile. Otherwise, packets are set to the router of the turig poit tile first, which is i the same colum with the destiatio tile. Packets go through three routers i total i this case. With the above tur fuctio, the tile delay model from tile to tile ca be expressed by: () () (3) Figure exemplifies the base packet latecy from tile to all other tiles i a CMesh-based NoC ad express chaelbased etworks, assumig ad (the 3-cycle router follows a caoical pipelie desig cosistig of virtual chael allocatio, switch allocatio ad switch traversal, with the optimizatio of look-ahead routig to hide routig computatio). Figure highlights why algorithms proposed for CMesh-based NoCs are less effective whe applied to express chael-based NoCs directly. I the CMesh delay model, tile,, are are cosidered to have the same packet delay to ; whereas i the ew delay model with express chaels, ad have 33% larger delays compared to the other two (a) Tile delay o CMesh (b) Tile delay o MECS Figure. Tile delay of packets with source at tile. C. Problem Formulatio With the above defiitios ad delay models, we ca formulate the applicatio mappig problem as follows: Give: ) A express chael-based etwork, cotaiig tiles; ) The applicatio commuicatio graph, with commuicatio rate as the edge weight; ad 3) The tile delay graph, with delay as the edge weight; Fid: Mappig of task clusters to tiles: Miimize the average packet delay: The above formulated problem has the form of a Quadratic Assigmet Problem (QAP). A geeral QAP is NP-hard [6]. Eumeratig all possible solutios is costly eve for a simple NoC, ot to metio larger etworks. However, the special characteristics of the tile delay model of expresschael etworks may give us some isights for desigig effective heuristic algorithms. IV. PROPOSED ALGORITHM I this sectio, we propose a efficiet heuristic algorithm that rus i polyomial time for applicatio mappig i express chael-based etworks. The proposed algorithm, Tur Reductio Algorithm for Mappig (TRAM), utilizes the followig two observatios. First, as tiles o the same row or colum have smaller packet delay, aligig task clusters with large commuicatio rate i the same row or colum ca effectively reduce both delay ad turs. Secod, similar to mappig methods o CMesh etworks, as the lik delay liearly depeds o the Mahatta distace betwee source ad destiatio tiles accordig to Equatio (), it is still beeficial to put task clusters as close to each other as possible. TRAM cotais three mai steps to realize these objectives. Step Partitio task clusters ito sets ad place each set o oe row of the express-chael etwork. The partitioig is based o Kerigha Li (KL) algorithm [], a efficiet heuristic algorithm for solvig graph partitioig problems. It attempts to partitio a graph ito two sets with equal sizes, such that the sum of edge weights betwee vertices i the two sets are miimized (mi-cut). ()

4 N h= h= h=3 N/ N/ N/ N/ N/ N/ We call KL algorithm i a hierarchical fashio util we get sets each with task clusters, as show i Figure 3(a). After each two-way partitioig, we use a heuristic to determie the placemet of the two sets. Take the partitioig stage i Figure 3(a) as a example. We ame each two sets a KL sectio (i.e., KL sectios are labeled to ). The order amog these four KL sectios is decided at the previous stage, ad KL has fiished the partitioig i the curret four KL sectios. The orders of the pair of sets withi each KL sectio eed to be determied. Cosider the KL sectio, which cotais the third ad fourth sets. Let deote the total commuicatio rate betwee the third set ad all the sets above KL sectio (i.e. sectio ), ad deote the total commuicatio rate betwee the third set ad all the sets below sectio (i.e. sectio 3 ad ). Similarly we defie for the fourth set. We calculate ad compare the differeces betwee high/low commuicatio rate, i.e. ad, ad the place the set with higher i the third row ad the other i the fourth row, so that the heavier commuicatio is put closer to the outside of the KL sectio. The orders i other sectios are determied similarly. The complete pseudo code for step is show below: for from to // curret umber of sectios is // i this iteratio we get sets for from to i curret sectio, call KL to get the ew - th ad -th sets if place -th set at -th row place -th set at -th row else place -th set at -th row place -th set at -th row The time complexity of KL algorithm is sice the graph has vertices. Calculatig ad takes operatios. Therefore the time complexity of Step is accordig to the master theorem [3]. Step Distribute task clusters i each set to the colums of the etwork. The first step fixes the positios of rows whereas the order of task clusters withi each row remais usolved. I Step, we 3 Colum Colum Colum tc tc tc (k-) rows k th row colums (a) Step : Row Placemet (b) Step : Colum aragemet (c) Step 3: Colum Adjustmet Figure 3. Three steps of TRAM. iteratively distributes of task clusters withi each row to the colums. The order of task clusters i the first row is radomly assiged, of which the possible performace loss ca be restored i Step 3. At the iteratio, with the task clusters i the first rows already placed, the placemet of the task clusters of the set is determied to miimize the average packet delay cosiderig the commuicatio rate betwee the curret row ad the first rows, as show i Figure 3(b). The above problem at each iteratio is a assigmet problem: I the cost matrix, deotes the APD cotributed by placed at the -th colum. It is solved by Hugaria algorithm [3] optimally. The pseudo code for Step is show below: Radomly assig tasks clusters i the first row to each colum; for from to (the -th row) Calculate the cost matrix ; Call Hugaria with the cost matrix as iput; Assig task clusters i the -th row to each colum accordig to the Hugaria assigmet results; Hugaria algorithm ca achieve a time complexity of. Calculatig the cost matrix has a time complexity of. Therefore the time complexity of Step is. Step 3 Rearrage the colums to miimize the lik delay of commuicatio traffic o horizotal liks. The process is similar to Step, except that each colum is treated as a ode i the iput graph of KL algorithm. The time complexity of Step 3 is. Takig ito accout all the three steps, the overall time complexity of the proposed algorithm is. V. EVALUATION METHODOLOGY A. Schemes Uder Compariso As mesh etwork without cocetratio has much higher latecy tha other structures, i order to provide more fair compariso, we use CMesh as the baselie. The followig six applicatio mappig schemes o CMesh ad MECS architectures are compared: ) MC_CMesh (the baselie): Mote Carlo method o CMesh, which picks the mappig with the smallest latecy amog a large umber of radomly geerated mappig solutios based o CMesh structure; ) SA_CMesh: simulated aealig algorithm o CMesh structure; 3) MC_MECS: Mote Carlo method o MECS structure; ) SA_MECS: simulated aealig algorithm o MECS structure usig the ew tile delay model; ) SA_CMesh(MECS): the mappig solutio is first geerated by SA_CMesh, ad the apply the solutio o MECS structure; ad 6) TRAM: our proposed approach.

5 Normalized APD (a) mpeg Commuicatio Graph (b) toybox Commuicatio Graph (a) mpeg Commuicatio Graph (b) toybox Commuicatio Graph Figure. Commuicatio graph for mpeg ad toybox. Figure. Mappig results of mpeg ad toybox. MC_CMesh SA_CMesh MC_MECS SA_CMesh(MECS) SA_MECS TRAM mpeg toybox vopd mms tgff_r tgff_r tgff_sp tgff_sp average Figure 6. Normalized average packet delay for eight differet applicatios. Sice Mote Carlo ad simulated aealig are algorithms that have tradeoff betwee rutime ad performace, for fair compariso, the rutime of both algorithms are cofigured to be roughly the same as the rutime of our proposed algorithm. B. Simulatio Setup The proposed TRAM algorithm is evaluated quatitatively uder both typical ad stressed workloads. This icludes the traces of four real applicatios, amely mpeg, toybox, vopd, ad mms, as well as four radom task graphs geerated by TGFF [], referred to as tgff_r, tgff_r tgff_sp ad tgff_sp. Figure shows the commuicatio rate graph of mpeg ad toybox (vopd ad mms are omitted here due to space limitatio). Each ode deotes a task cluster, ad the edge width idicates the relative magitude of the commuicatio rate. The tgff_r ad tgff_r are two radom graphs while tgff_sp ad tgff_sp are two series-parallel graphs formed recursively by joiig two sub-graphs i series ad parallel, mimickig the stressed behaviors of multithreaded applicatios. Collectively, these eight iputs comprise a represetative set of MPSoC scearios. A 6-task cofiguratio with cocetratio degree is simulated for majority of the evaluatio. I additio, 6- task cofiguratio is also evaluated for scalability discussio. I the simulatio results, the APDs are calculated accordig to our delay model. Rutime is based o a machie with a Itel Core i-30 processor. NoC power is calculated usig the latest NoC power model dset [] uder m ad V. The uit-legth lik delay is set to ad is set to 3. For each of the test case, the cotetio delay is acquired by feedig the trace i a cycle-accurate NoC simulator. VI. RESULTS AND ANALYSIS A. Impact o Performace We first evaluate the effectiveess of TRAM to reduce turs. Table I compares the percetage of commuicatio traffic that eeds to make turs i express-chael etworks for differet algorithms. It ca be see that the proposed TRAM is able to achieve a average of ~X reductio i the percetage compared to other algorithms. Figure presets the mappig results obtaied by TRAM for mpeg ad toybox. A dashed arrow meas the packet from source to destiatio tile eeds to take a tur. Whe TRAM is used, oly.% ad.% of the traffic eeds to make turs for mpeg ad toybox, respectively. It is worth otig that, while the proposed algorithm is optimizig for the umber of turs, most of the heavily commuicatig tasks (as idicated by wider edges) are also mapped close to each other, as ca be see from Figure. The reduced turs ad closer physical distaces result i cosiderable improvemet of packet latecy. Figure 6 plots the results of average packet delay for the eight differet test cases. Compared to the baselie system, the proposed TRAM algorithm reduces the packet delay by 6.% o average. Also, TRAM is 0% better tha SA_CMesh(MECS). This idicates that the mappig solutio geerated from CMesh-based etworks is ot optimal whe applied to express chael-based etworks. B. Impact o Power Cosumptio Although the primary objective is to reduce packet delay, the proposed TRAM is also able to slightly reduce power cosumptio as a side effect, because the algorithm reduces the umber of routers ad liks through which packets eed to travel. Table II shows the dyamic power of differet mappig algorithm solutios o various applicatios. It ca be see that, eve though TRAM does ot target for power optimizatio, it still achieves the lowest dyamic power cosumptio amog all schemes. C. Impact of Pipelie Stages So far we have assumed a 3-stage router pipelie, which is a optimized versio o top of the caoical -stage router. Equatio () idicates that the umber of router pipelie stages may affect the latecy of express-chael etworks. To assess this impact, Figure compares the mappig results of simulated aealig o CMesh etworks, simulated aealig o MECS ad the proposed TRAM o MECS while varyig the umbers of pipelie stages ( ) from to. As ca be see,

6 APD(cycles) TABLE I. PERCENTAGE OF TRAFFIC THAT NEEDS TO MAKE TURN. Systems Percetage (%) mpeg toybox vopd mms tgff_r tgff_r tgff_sp tgff_sp Average MC_MECS SA_CMesh(MECS) SA_MECS TRAM TABLE II. DYNAMIC POWER CONSUMPTION. Systems Dyamic Power (mw) mpeg toybox vopd mms tgff_r tgff_r tgff_sp tgff_sp MC_CMesh SA_CMesh MC_MECS SA_CMesh(MECS) SA_MECS TRAM the proposed TRAM is effective across differet umber of pipelie stages. This illustrates that TRAM ca be useful i a wide rage of etworks built from more aggressive or more coservative router architectures. 0 6 (a) vopd 3 Router Pipelie Stages Figure. Average packet delay as a fuctio of router pipelie stages. D. Scalability Previous evaluatio uses 6-task cofiguratios with cocetratio degree of. To further illustrate the scalability of the proposed algorithm, we geerate four TGFF cofiguratios of 6 tasks with the same cocetratio degree. Simulatio results show that, compared with MC_CMesh ad SA_MECS, TRAM is able to reduce the average packet delay by % ad 3% uder the same rutime, respectively. This demostrates that the proposed TRAM ca achieve higher improvemet for larger etworks, idicatig its good scalability. VII. CONCLUSIONS 3 Router Pipelie Stages Express chael-based etworks have bee proposed i recet studies as a promisig approach to support fast o-chip commuicatios for curret ad future may-core MPSoCs. However, the characteristics of these ew topologies have ot bee exploited i existig applicatio mappig algorithms. I this paper, we propose a efficiet heuristic algorithm to explore the applicatio mappig opportuities i express-chael etworks. The proposed TRAM algorithm is able to effectively map tasks with large commuicatio rate closer to each other, ad aligs heavily commuicatig tasks to the same rows or colums to reduce uecessary turs. Simulatio results show sigificat reductio i the umber of turs ad cosiderable reductio i average packet delay i the geerated mappig solutios. 0 6 (b) mms SA_CMesh SA_MECS TRAM REFERENCES [] Balfour, J., & Dally, W. J. (006). Desig tradeoffs for tiled CMP ochip etworks. I ACM Iteratioal Coferece o Supercomputig. [] Che, G., Li, F., So, S. W., & Kademir, M. (00). Applicatio mappig for chip multiprocessors. I Desig Automatio Coferece. [3] Corme, T. H., Leiserso, C. E., Rivest, R. L., & Stei, C. (00). Itroductio to algorithms. MIT press. [] Dick, R. P., Rhodes, D. L., & Wolf, W. (99). TGFF: task graphs for free. I Proceedigs of the 6th iteratioal workshop o Hardware/software codesig (pp. 9-0). IEEE Computer Society. [] Faruque, A., Abdullah, M., Krist, R., & Hekel, J. (00, Jue). ADAM: ru-time aget-based distributed applicatio mappig for o-chip commuicatio. I Proceedigs of the th aual Desig Automatio Coferece (pp. 60-6). ACM. [6] Garey, M. R., & Johso, D. S. (99). Computers ad itractability A Guide to the Theory of NP-Completeess. [] B. Grot, J. Hestess, S. W. Keckler, ad O. Mutlu (009). Express cube topologies for o-chip itercoects. I Iteratioal Symposium o High Performace Computer Architecture (pp. 63-). [] J. Howard, S. Dighe, Y. Hoskote, et al. (00). A -core IA-3 message-passig processor with DVFS i m CMOS. I IEEE Iteratioal Solid-State Circuits Coferece (pp. 0-09) [9] Hu, J., & Marculescu, R. (003). Eergy-aware mappig for tile-based NoC architectures uder performace costraits. I Proceedigs of the ASP-DAC. [0] Jag, W., & Pa, D. Z. (0). A3MAP: Architecture-aware aalytic mappig for etworks-o-chip. ACM Trasactios o Desig Automatio of Electroic Systems (TODAES), (3), 6. [] Kerigha, B. W., & Li, S. (90). A efficiet heuristic procedure for partitioig graphs. Bell Systems Techical Joural, 9. [] J. Kim, J. Balfour, ad W. J. Dally (00). Flatteed butterfly topology for o-chip etworks. I IEEE/ACM Iteratioal Symposium o Microarchitecture (pp. -). [3] Kuh, H. W. (00), The Hugaria method for the assigmet problem. Naval Research Logistics. [] Kumar, A., Peh, L.-S., Kudu, P. & Jha, Niraj K. (00). Express virtual chaels: Towards the ideal itercoectio fabric. I IEEE Iteratioal Symposium o Computer Architecture. [] Lei, T., & Kumar, S. (003, September). A two-step geetic algorithm for mappig task graphs to a etwork o chip architecture. I Digital System Desig, 003. Proceedigs. Euromicro Symposium o (pp. 0- ). IEEE. [6] Murali, S., & De Micheli, G. (00). Badwidth-costraied mappig of cores oto NoC architectures. I Proceedigs of the coferece o Desig, automatio ad test i Europe. [] Su, C., Che, C., Kuria, G., et al. (0). DSENT - A Tool Coectig Emergig Photoics with Electroics for Opto-Electroic Networks-o-Chip Modelig. I Iteratioal Symposium o Networkso-Chip. [] Tilera Corporatio.

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history: