Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures

Size: px

Start display at page:

Download "Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures"

Darrell Rose
6 years ago
Views:

1 Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Marcio Buss, Rodolfo Azevedo, Paulo Centoducatte and Guido Araujo IC - UNICAMP Cx Postal 676 Campinas, SP, Brazil {marciobuss, rjazevedo, ducatte, guido}@icunicampbr ABSTRACT In this paper we describe a design exploration methodology for clustered VLIWarchitectures The central idea of this work is a set of three techniques aimed at reducing the cost of expensive inter-cluster copy operations Instruction scheduling is performed using a list-scheduling algorithm that stores operand chains into the same register file Functional units are assigned to clusters based on the application inter-cluster communication pattern Finally, a careful insertion of pipeline bypasses is used to increase the number of data-dependencies that can be satisfied by pipeline register operands Experimental results, using the SPEC95 benchmark and the IMPACT compiler, reveal a substantial reduction in the number of copies between clusters INTRODUCTION The problem of instruction partitioning/scheduling for clustered VLIWhas earned a considerable attention recently, due to the small area and improved register file latency achieved by these architectures [5] Register file area/latency is proportional to O(n )/O(log m), where n is the total number of input/output ports, and m the number of read-ports Such features of clustered VLIWarchitectures are particularly relevant in the design of highly constrained embedded systems, where high performance, reduced die size and low power consumption are premium design goals In this paper we describe a design exploration methodology for clustered VLIWarchitectures Instruction scheduling is performed using a list-scheduling algorithm that stores chains of operands into the same register file Functional units are assigned to clusters based on the application intercluster communication pattern Finally, pipeline bypasses are inserted to increase the number of data-dependencies which can be satisfied by pipeline register operands This paper is divided as follows Section shows some prior art Section describes the architectural model adopted Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee CASES, November 6-7,, Atlanta, Georgia, USA Copyright ACM // $5 Register File Inter cluster copy-bus Register File Bypass Figure : Clustered VLIW architecture model with inter-cluster bypass throughout the paper Section discusses how operations are scheduled and assigned to functional units The partitioning of functional units into clusters is discussed in Section 5 Finally, Section 6 shows how functional units are assigned to physical clusters The SPEC CINT95 and CFP95 benchmarks and the IMPACT compiler [] were used to evaluate the performance of this strategy (Section 7) In Section 8 we conclude the work RELATED WORK Clustered VLIWarchitectures have been extensively studied in the literature The assignment of operation traces to clusters has been originally studied in the context of the Bulldog [6] and Multiflow Trace [9] compilers Separate partitioning and scheduling has been proposed by Capitanio et al [] using a limited connectivity architecture Ozer et al [, 6] integrate partitioning and scheduling in a single phase using the Unified Assign and Scheduling (UAS) algorithm A variation of UAS and modulo scheduling has been proposed by Sanchez et al [8, 9] as a way to assign different loop iterations to separate clusters Fisher at al [7] proposed a Partial Component Cluster technique to divide Data-Flow Graph (DFG) components into clusters in order to avoid copy operations along DFG critical paths Ozer and Conte [5] introduced an optimal cluster scheduling for a VLIWmachine based on integer linear programming Their approach is suitable to help the search for a schedule lower

2 bound, and as a way to evaluate the effectiveness of heuristic based schemes Fernandes et al [8] proposed a queue-based register file to pass operands between clusters Architectural exploration and VLIWcustomization for a particular application has been studied by Jacome et al [] and R Rau et al [7] Ahuja et al [] showed that the number of forwarding paths in a scalar processor could be reduced without a great performance loss Unfortunately, not much work has been performed on simultaneously tailoring cluster partitioning and pipeline bypass structures to a specific application ARCHITECTURE MODEL The architecture model used in this paper (Figure ) is a pipelined clustered VLIWarchitecture, where each cluster is formed by a set of one or more homogeneous functional unit (), a multi-ported register file and an inter-cluster data transfer bus (called copy-bus) This model is similar to the one described by Capitanio et al [5] Contrary to the work in [5], the copy-bus is driven by the output of the functional unit and not by the output of the register file By doing so, a copy operation can be scheduled to copy the result of an operation at the output of some directly to the register file of another cluster through the copy-bus Consider, for example, two dependent operations A and B (B depends on A s result) assigned to datapaths and in two distinct clusters (Figure ) Figure shows the pipeline timing diagram of these operations Assume that the result of operation A is available in the EX/MEM pipeline register, at the end of stage EX in A copy operation following A in canbeusedtomovea s result to the copy-bus, just in time to be written, during the ID stage, into the register file of (solid arrow) This is not possible by the approach used in [5], which requires one extra NOP operation to transfer the data to s register file The presence of the copy-bus affect the final register file design, but its impact is much smaller than the benefits gained by reducing the number of the read-ports [5] Using a heuristic from [5], we assume that the width of the copy-bus is equal to half the number of s per cluster, ie in the best case only half of the s in one cluster can simultaneously execute copies to other clusters One cluster can receive copies from all other clusters, provided the maximum constraint above is met Abnous and Bagherzadeh [, ] studied some of the design issues that arise in the pipeline structure and bypassing mechanism of pipelined clustered VLIWprocessors In our work we use a few bypassing lines to forward operation results stored in pipeline registers to other datapaths Pipeline bypasses can be added between datapaths within the same cluster or between datapaths in distinct clusters The goal of inserting a bypass interconnection between two datapaths inside the same cluster (eg and ) isto reduce the number of NOP operations required to solve the data-hazard between the dependent instructions in the datapaths By assigning a bypass interconnection between two datapaths in distinct clusters (eg and ), we are also reducing the number of copy operations required to use the copy-bus A copy operation must be issued by the compiler if: (a) no bypass exist between and and there is (at Homogeneous s have been used for the sake of simplicity The technique applies to heterogeneous units as well Register File (WB) (ID) (ID) EX () MEM Register File (WB) MUX MUX MUX MUX EX () MEM Figure : Bypass interconnection between datapaths of and A: copy: B: IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB wr rd Bypass from EX/MEM in forwards the result of instruction A to instruction B in the EX stage of Instruction B () reads from the copy-bus the result of instruction A in Figure : Using inter-cluster forwarding to solve inter-cluster dependencies least one) data-hazard between the operations in these datapaths; (b) a bypass exist between and, but at least one of the uses of the data in is so far from its definition in that the bypass cannot satisfy the dependency Consider again datapaths and and dependent operations A and B in Figure Moreover, assume that B has been scheduled two slots after A In this case, the result of A can be forwarded from the EX/MEM register in directly to the ID/EX register of (dotted line) through the bypass interconnect in Figure We are assuming that one bypass interconnection between two datapaths has as many lines as those required to exchange operands (in both directions) between the stages of the datapath pipelines The area needed by a bypass interconnection between two pipelines, is proportional to the number of comparators required to detect the datahazards [] between them This cost becomes very large if bypasses are allowed between all pipeline pairs, in which case it is proportional to dn, n the number of s and d the depth of the pipeline Instead of allowing full bypassing between all datapaths, we insert only a few carefully chosen bypasses between very communicating datapaths, aiming at reducing the inter-cluster communication These interconnections are selected based on the communication pattern of the application Pipeline bypasses have a reasonably small

3 impact in the processor cycle time [], consisting basically on the delay of the routing lines between datapaths Thus, for the sake of simplicity, we neglect the impact of bypasses to the cycle time (, ) A B E (, ) I L F K C G (, ) D (, ) J H Intercommunication Table A B C D E I F G J H Cycle K L Reservation Table Figure : IMPACT list scheduling using clustering as second criteria Our approach is based on three phases Initially, a variation of list scheduling is used to schedule operations into compacted instructions (Section ) The algorithm tries to cluster dependent operands into the same so as to avoid expensive inter-cluster copy operations In the second phase (Section 5), we use a partitioning algorithm to assign functional units to clusters, such that the majority of the data dependencies are resolved by the cluster register file Assigning s to clusters, based on the application, is a central feature of our approach which has not been extensively explored before Finally, functional units are assigned to physical units inside clusters (Section 5), and bypass interconnections are inserted between the most communicating datapaths SCHEDULING Our scheduling approach is a simple extension of the IM- PACT compiler list scheduling algorithm For a given operation op in the candidate list, IMPACT uses the distance from op toarootofthedata-dependence Graph (DDG) as the first scheduling criteria, followed by the number of children operations that become candidates if op is scheduled We use exactly the same algorithm, adding only a small modification to determine which will be assigned to execute op For each candidate operation op removed from the priority list, its is determined based on the s of its parents in the DDG If the intersection of the s assigned to parents(op) is non-empty, op is assigned the same as its parents, if that is free Otherwise, op is assigned the first available at the current time step If the intersection of the s in parents(op) is empty, op is assigned the first free, giving a higher priority to the s assigned toitsparentsthecentralideahereistokeeptheresultof an operation into the same register cluster as its operands By doing so, we avoid increasing the number of inter-cluster copy operations as much as possible Provided that only a few are inserted Consider for example, the DDG of Figure For the sake of simplicity, we assume in this example that all operations have single cycle latencies Moreover, consider that the scheduling priority is such that operations are scheduled in alphabetic order Initially, operations A-D are assigned to - AfterA-D are scheduled, E is the next operation in the working list which is ready to be scheduled The intersection of the s assigned to the parents of E ( and ) isempty,soe is scheduled to the first free that was assigned to its parents (ie ) Next, F is scheduled, and since it has no parents it is assigned the next free, ie OperationG is then scheduled to, since its parent s s are different The next candidate for scheduling is H, whichisassignedto OperationI is scheduled to, while J and K are assigned to and respectively Finally, L is scheduled to, the same functional unit as its parents I and K Notice from Figure that whenever a data-dependency exist between two operations scheduled to different s some action must be taken to assure that this dependency is satisfied At this point of our solution, s have not been assigned to clusters yet When s are assigned a common register file inside the same cluster, the dependency can be satisfied through the register file, or by some intra-cluster bypass if one exist On the other hand, if the s are located in different clusters a copy operation will be required if there is no inter-cluster bypass between those two s For example, consider operations J and K scheduled to functional units and If units and are assigned to the same cluster, no copy operations will required to communicate the result of J to K The same is not true if these operations are scheduled to s in different clusters and there is no bypass between their datapaths In order to evaluate the communication pattern between s we measure the dependencies between each pair of using the communication table shown in Figure Each entry in this table corresponds to the number of data dependencies that need to be satisfied between a pair of s For example entry (,) in the table is, meaning that two operations scheduled to (B and F) communicate their results to two operations scheduled to (E and I) 5 CLUSTER PARTITIONING After the communication table is computed, our algorithm divides the s among clusters such that the most intercommunicating s are assigned to the same cluster Initially, the communication table is reduced to a low-diagonal matrix, in order to accumulate the dependencies (i, j) and (j, i) into a unique value As said before in Section, one bypass is a bi-directional connection between all stages of datapaths i and j This is not a requirement of our approach though, and it can be relaxed if required The table on the top left corner of Figure 5 shows a reduced communication table Based on this table, we build a cluster vector thesizeofthenumberofs Eachentry in this vector stores the number of a The indices of the vector correspond to a physical datapath, and are divided according to the number of clusters In the case of Figure 5, four functional units - must be assigned to two clusters ( and ) each cluster containing two physical We assume that intra-pipeline data-hazards are always satisfied

4 cluster cluster 8 9 (a) Initial cost = 8 9 cluster cluster (d) Cost = cluster cluster 8 9 (b) Cost = cluster cluster Initial partitioning cluster cluster 8 9 (e) Cost = cluster cluster (c) Cost = Figure 5: Selecting the most communicating functional units datapaths Cluster contains datapaths and, and cluster datapaths and A variation of the LPK algorithm [] is then used to swap s between clusters, so as to minimize their communication The communication cost between two clusters, for a given distribution, is the total number of data dependencies that cross the clusters border Initially, the algorithm divides the functional units into two sets of clusters It swaps all possible pairs, one from each set, storing the smallest cost it has seen so far After all possible exchanges have been tried, the resulting smallest cost gives the best distribution between the two sets of clusters The algorithm proceeds recursively into each cluster set, until all are assigned a cluster Consider, for example, the reduced communication table and cluster vector in Figure 5 The communication between two s is represented by a double-headed solid arrow labeled with the cost from the communication table The cost of the initial partitioning in Figure 5a () is the result of the sum of the communication costs between: and (cost 8); and (cost ); and (cost ); and and (cost 9) and (in gray) are then selected for swapping, resulting in the new configuration () with cost (Figure 5b) The algorithm proceeds exchanging pairs of s from the initial partitioning (Figure 5(c-e)) while computing their costs After all pairs of s have been tested, the minimal communication cost () is achieved The configuration that results in the smallest inter-cluster communication () is obtained by swapping and (Figure 5c) 6 DATAPATH MAPPING After the scheduling and partitioning tasks described above are finished, operations are associated to s and s to clusters To complete the architectural design, s must be assigned to their corresponding physical datapaths, and bypass lines inserted We do that using the two step procedure shown in Figure 6 First, each inner-loop communication table is used, in combination with the result of its cluster vector after partitioning, to compute a partial hardwired communication table This table is a representation of the number of data dependencies between program operands in a particular loop, given the current architecture Its goal is basically to map each in the communication table to its corresponding datapth (and cluster) in the cluster vector Communication table cluster cluster Cluster vector Communication table cluster cluster Cluster vector Communication table cluster cluster Cluster vector Partial hardwired communication table Partial hardwired communication table Partial hardwired communication table X X X Hardwired communication table (normalized by ) Figure 6: Mapping the communication table to a hardwired bypass interconnects For example, at the center of Figure 6, has been assigned to (index ) of the cluster vector Hence, line in the communication table if mapped to column of the partial hardwired table Notice that one partial table emerges for each inner-loop super-block in the program In the case of Figure 6 three loops were considered Since the resulting architecture has to execute all of them, we need to take into account the contribution of each loop to the overall intercluster communication This is done, in a second phase, by adding up the partial communication tables into a single hardwired communication table Each entry in a given partial table is weighted by NL,whereNL is the nestinglevel of the loop corresponding to that table [] A better estimate is possible if the loop trip-count can be determined at compile time The resulting hardwired communication table is then used to determine the pairs of datapaths which will be interconnected with bypass lines To do that the entries in the hardwired table are sorted into a priority list, such that the most communicating pair of datapaths have a higher priority Bypass lines are inserted between datapath pairs, the highest priority pair first In Figure 6, for example, the highest entry in the hardwired table is,, corresponding to the communication between and These datapaths were assigned the same cluster, in order to reduce the cost of

5 inserting copy operations between them Some of the datadependencies between and will be resolved by the common register file in cluster, but many short distance dependencies can be satisfied by adding a bypass line between and 7 EXPERIMENTAL RESULTS The approach described above was implemented into the IMPACT infrastructure, and the resulting compiler was used to compile eighteen programs from the SPEC CINT95 (6), CFP95 (5) benchmarks, IMPACT (5) and Miscellaneous () as shown in Table The compiler was executed using superblock formation, maximum unrolling of and no predication In our experiments we estimate the number of copy operations and cycles produced by each program across a large number of architecture configurations Each configuration corresponds to a different combination of the following parameters: (a) Number of s (from to 6); (b) Number of register file clusters (from to the number of s); (c) Number of bypass interconnections (from to the number of s) For the sake of simplicity we adopted homogeneous clusters, ie all clusters (CLs) have the same number of s The goal of the experimental work was to determine the impact of the techniques described in Sections, 5 and 6 The experiments were divided into three parts First, we evaluated the impact of bypass insertion into the cycle count of the programs In the second part, we studied how cluster partitioning and bypassing effects the number of copy instruction between clusters In the last set of experiments we evaluate the impact of the scheduling and mapping algorithms CINT95 CFP95 IMPACT MISC 99go tomcatv fir mpegdec m88ksim swim kalman mpegenc 9compress sucor paraffins li 7mgrid dag ijpeg 5turbd eight 7vortex Table : Benchmark Programs The maximum number of bypass interconnections is given by n(n )/, where n is the number of s for that configuration Figures 7 and 8 show the impact, on programs go and sucor, of adding from to bypasses (full bypassing between all s) All architecture configurations considered in the following analysis have 6 s, and range from and 8 clusters For program go (Figure 7), we noticed that most of the speed-up was achieved for 8 bypasses (65% for one cluster and 58% for 8 clusters) Only a small difference was noticed when using 6 or more bypasses (7% for one cluster and 6% for 8 clusters, when 6 bypasses are used) For program sucor (Figure 8) we faced a more complex tradeoff In the first knee of the curve (left side of the figure), when bypasses are used, the speed-up was 6% (one cluster) and 75% (8 clusters) This value increases very slowly Notice that using bypasses improves the speedup only 6% ( cluster) and 7% (8 clusters) Thus, since the program speed-up decreases almost exponentially with the number of bypasses, we restrict the maximum number of bypasses to the number of s 69e+6 68e+6 67e+6 66e+6 65e+6 6e+6 8 clusters cluster 6e Number of bypassings Figure 7: Impact of adding bypasses for program go (6 s) 7e+7 7e+7 7e+7 7e+7 69e+7 68e+7 67e+7 8 clusters cluster 66e Number of bypassings Figure 8: Impact of adding bypasses for program sucor (6 s) In the second part of our experimental work, we studied the impact of clustering and bypassing in the number of inter-cluster copy instructions Consider, for example, the graphs in Figure 9a-b, where the number of inter-cluster copy operations for program sucor is measured using two architectures with 8 and 6 s In that graph, the vertical (horizontal) axis represents the number of copy operations (clusters) Curves in the graph have for parameter the number of bypass lines inserted between s The number of copy operations grows with the number of clusters, as expected, given the increase in the number of inter-cluster communication Nevertheless, notice that as bypass lines are added to the architecture, many copy operations are wiped-out of the program Bypasses sucor go Table : Number of copies for s In Table ( s/ CLs) a single bypass line removes more than 5% of all copy operations If the same program 5

6 Number of copies bypassing bypassing bypassing bypassing 8 bypassing Number of copies bypassing bypassing bypassing bypassing 8 bypassing 8 (a) Number of clusters 8 (a) Number of clusters Number of copies 8 6 bypassing bypassing bypassing bypassing 8 bypassing 6 bypassing Number of copies bypassing bypassing bypassing bypassing 8 bypassing 6 bypassing Percentage of copies removed wrt bypassing s s 8 s 6 s (b) Number of clusters 6 8 (c) Clusters / s Figure 9: Number of copy operations for program sucor in a: (a) 8 s architecture; (b) 6 s architecture (c) Percentage of copy operations removed from sucor through bypassing, as a variable of the number of clusters per s Percentage of copies removed wrt bypassing (b) Number of clusters bypassing bypassing 8 bypassing 6 bypassing 6 8 (c) Clusters / s Figure : Number of copy operations for program go in a: (a) 8 s architecture; (b) 6 s architecture (c) Percentage of copy operations removed from go through bypassing, as a variable of the number of clusters per s 6

7 e+8 e+8 8e+8 Cluster Bypassing Cluster Bypassing Cluster Bypassing 8 Cluster Bypassing e+7 e+7 e+7 Cluster Bypassing Cluster Bypassing Cluster Bypassing 8 Cluster Bypassing 6e+8 e+8 e+8 e+8 e+7 9e+6 8e+6 8e+7 7e+6 6e s 6e s Figure : Cycle count for program sucor in nonclustered and clustered configurations with no bypasses e+8 e+8 e+8 e+8 e+8 9e+7 8e+7 7e+7 Cluster Bypassing Cluster Bypassing Cluster Bypassing 6e s Figure : Impact on the cycle count for program sucor due to maximum bypassing Figure : Cycle count for program go in nonclustered and clustered configurations with no bypasses 8e+6 8e+6 8e+6 78e+6 76e+6 7e+6 7e+6 7e+6 68e+6 66e+6 Cluster Bypassing Cluster Bypassing Cluster Bypassing 6e s Figure : Impact on the cycle count for program go due to maximum bypassing runs on an 8 s/8 CLs configuration (Figure 9a), 8 bypasses are required to reduce by 8% the number of copy operations As more bypasses are added the law of diminishing returns settles in and the gain saturates In Figure 9b for 6 s, we removed % of the copy operations using 8 bypasses and 9% of the copy operations using 6 bypasses In general, we noticed, across all SPEC programs, that the insertion of a single bypass reduces the total number of copy operations from % to 7% Notice, for example, that program go repeats the same behavior (Figure a-b) as sucor The bypass effectiveness when all configurations run program sucor is described in Figure 9c The vertical axis in that graph shows the percentage of copy operations removed with respect to a bypassing free configuration The horizontal axis represents the ratio CLs/s Notice that it ranges to a maximum of one since the number of CLs is at most the number of s As mentioned before, increasing the number of bypasses implies in a reduction in the number of copy operations For example, % of the copy operations are removed when bypasses are inserted in a s/ CLs architecture Interesting enough, the percentage of copy operations removed by bypasses seems to saturate when the ratio CLs/s is around 5, ie when each cluster has two s The same pattern was also detected in the majority of the combinations of programs and architecture configurations (eg program go in Figure Figure c) We believe this might have to do with the way s data-dependencies are uniformly partitioned across clusters by the binary recursive algorithm described in Section 5 Further experimental work will be required to clarify this finding In the third part of our experiments, we studied the impact of the scheduling and clustering algorithms We plotted, for all programs, the cycle count as a function of the number of s and clusters In order to filter out the effect of bypassing, we considered only bypassing configurations As shown in Figure, for program sucor, the performance for clustered architectures follows very closely the performance for the non-clustered architecture ( Cluster, Bypassing) In other words, our approach is capable of canceling the negative effects of clustering, namely the cost of the inter-cluster copy instructions overhead The same can be seen from Figure, for program go The effect of the bypass lines in the cycle count for programs sucor and go are shown in Figures and The 7

8 best combination of clusters and number of bypass lines have been used for each architecture Surprisingly, the large gains achieved on reducing the number of copy operations in Figures 9 and did not translated to real performance In general, benchmark programs speed-up, due to bypassing, ranged from 6% to 5% only Although it is not clear why, it might be possible that the combination of the cluster partitioning and scheduling algorithms leaves only a small number of copy operations in the code for bypassing Yet another explanation can be drawn from this finding Not enough ILP is available in the unrolled loop bodies Inthis case, the generated instructions could have enough empty slots to hide the latency of most copy operations Further experimental work will be required to address this issue Notice that cycle count does not take into consideration the benefits achieved by clustering, ie a smaller register file and reduced latency If the register file determines the cycle time of the processor, the curves representing clustered architectures in Figures and will reveal a performance improvement proportional to the reduction in the register file latency Otherwise, by using our technique, the same performance level of a non-clustered architecture is achieved at a smaller processor cost 8 CONCLUSIONS AND TURE WORK This paper presents a scheduling and partitioning algorithm for clustered VLIWarchitectures aimed at reducing the communication cost between datapaths and clusters This is achieved by assigning higher communication datapaths to the same register file, while tailoring bypass interconnections to the application Preliminary experimental results reveal a substantial reduction on the number of inter-cluster copy operations and a potential performance improvement As the next steps in this project we are considering: (a) to use the data-dependency distance between scheduled operations to improve the communication cost estimate; (b) to insert delay registers into bypass lines to resolve long distance data-dependencies 9 ACKNOWLEDGMENTS This work was partially supported by research grants from CNPq/NSF Collaborative Research Project 6859/99-7, CNPq research grant 56/97-9, fellowship research a- wards from CAPES -P-58/ and FAPESP 97/98-, 99/96-8 We also thank the reviewers for their comments REFERENCES [] A Abnous and N Bagherzadeh Pipelining and bypassing in a VLIWprocessor IEEE Trans on Parallel and Distributed Systems, 5(6):658 66, June 99 [] A Abnous and N Bagherzadeh Architectural design and analysis of a VLIWprocessor International Journal of Computers and Electrical Engineering, ():9, 995 [] P S Ahuja, D W Clark, and A Rogers The performance impact of incomplete bypassing in processor pipelines In MICRO-8, 995 [] A Capitanio, N Dutt, and A Nicolau Design considerations for limited connectivity VLIW architectures Technical Report TR-9-59, University of California, Irvine, Irvine, CA 977, 99 [5] A Capitanio, N Dutt, and A Nicolau Partitioned register file for VLIWs: A preliminary analysis of tradeoffs In 5th International Symposium on Microarchitecture (MICRO), 99 [6] J R Ellis Bulldog: A Compiler for VLIW Architectures MIT Press, 986 [7] P Faraboshchi, G Desoli, and J A Fisher Clustered instruction-level parallel processors Technical Report Technical Report HPL-98-, HP Labs, USA, 998 [8] M M Fernandes, J Llosa, and N Topham Partitioned schedules for clustered VLIW architectures In IEEE/ACM International Parallel Processing Symposium, 998 [9] J A Fisher Trace scheduling: A technique for global microcode compaction IEEE Trans on Computers, C-(7):78 9, July 98 [] W W Hwu et al Impact advanced compiler technology [] M F Jacome, G de Veciana, and V Lapinskii Exploring performance tradeoffs for clustered VLIW asips In International Conference on Computer-Aided Design, [] C Lee, C Park, and M Kim Efficient algorithm for graph partitioning problem using a problem transformation method Computer Aided Design, ():6, December 989 [] S S Muchnick Advanced Compiler Design and Implementation Morgan Kaufmann, 997 [] E Ozer, S Banerjia, and T M Conte Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures In th International Symposium on Microarchitecture (MICRO), 998 [5] E Ozer and T M Conte Optimal cluster scheduling for a VLIWmachine Technical report, Dept of Elec and Comp Eng, North Carolina State University, 998 [6] E Ozer and T M Conte Unified cluster assignment and instruction scheduling for clustered VLIW microarchitectures Technical report, Dept of Elec and Comp Eng, North Carolina State University, 998 [7] V K R Rau and S Aditya Machine-description driven compilers for EPIC and VLIWprocessors Design Automation for Embedded Systems, (/):7 8, 999 [8] J Sanchez and A Gonzalez The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures In Intl Conference on Parallel Processing (ICPP), [9] J Sanchez and A Gonzalez Instruction scheduling for clustered VLIWarchitectures In Intl Symposium on System Synthesis (ISSS), Remember that predication is not used 8

Removing Communications in Clustered Microarchitectures Through Instruction Replication

Removing Communications in Clustered Microarchitectures Through Instruction Replication ALEX ALETÀ, JOSEP M. CODINA, and ANTONIO GONZÁLEZ UPC and DAVID KAELI Northeastern University The need to communicate