Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures

Size: px
Start display at page:

Download "Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures"

Transcription

1 Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Marcio Buss, Rodolfo Azevedo, Paulo Centoducatte and Guido Araujo IC - UNICAMP Cx Postal 676 Campinas, SP, Brazil {marciobuss, rjazevedo, ducatte, guido}@icunicampbr ABSTRACT In this paper we describe a design exploration methodology for clustered VLIWarchitectures The central idea of this work is a set of three techniques aimed at reducing the cost of expensive inter-cluster copy operations Instruction scheduling is performed using a list-scheduling algorithm that stores operand chains into the same register file Functional units are assigned to clusters based on the application inter-cluster communication pattern Finally, a careful insertion of pipeline bypasses is used to increase the number of data-dependencies that can be satisfied by pipeline register operands Experimental results, using the SPEC95 benchmark and the IMPACT compiler, reveal a substantial reduction in the number of copies between clusters INTRODUCTION The problem of instruction partitioning/scheduling for clustered VLIWhas earned a considerable attention recently, due to the small area and improved register file latency achieved by these architectures [5] Register file area/latency is proportional to O(n )/O(log m), where n is the total number of input/output ports, and m the number of read-ports Such features of clustered VLIWarchitectures are particularly relevant in the design of highly constrained embedded systems, where high performance, reduced die size and low power consumption are premium design goals In this paper we describe a design exploration methodology for clustered VLIWarchitectures Instruction scheduling is performed using a list-scheduling algorithm that stores chains of operands into the same register file Functional units are assigned to clusters based on the application intercluster communication pattern Finally, pipeline bypasses are inserted to increase the number of data-dependencies which can be satisfied by pipeline register operands This paper is divided as follows Section shows some prior art Section describes the architectural model adopted Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee CASES, November 6-7,, Atlanta, Georgia, USA Copyright ACM // $5 Register File Inter cluster copy-bus Register File Bypass Figure : Clustered VLIW architecture model with inter-cluster bypass throughout the paper Section discusses how operations are scheduled and assigned to functional units The partitioning of functional units into clusters is discussed in Section 5 Finally, Section 6 shows how functional units are assigned to physical clusters The SPEC CINT95 and CFP95 benchmarks and the IMPACT compiler [] were used to evaluate the performance of this strategy (Section 7) In Section 8 we conclude the work RELATED WORK Clustered VLIWarchitectures have been extensively studied in the literature The assignment of operation traces to clusters has been originally studied in the context of the Bulldog [6] and Multiflow Trace [9] compilers Separate partitioning and scheduling has been proposed by Capitanio et al [] using a limited connectivity architecture Ozer et al [, 6] integrate partitioning and scheduling in a single phase using the Unified Assign and Scheduling (UAS) algorithm A variation of UAS and modulo scheduling has been proposed by Sanchez et al [8, 9] as a way to assign different loop iterations to separate clusters Fisher at al [7] proposed a Partial Component Cluster technique to divide Data-Flow Graph (DFG) components into clusters in order to avoid copy operations along DFG critical paths Ozer and Conte [5] introduced an optimal cluster scheduling for a VLIWmachine based on integer linear programming Their approach is suitable to help the search for a schedule lower

2 bound, and as a way to evaluate the effectiveness of heuristic based schemes Fernandes et al [8] proposed a queue-based register file to pass operands between clusters Architectural exploration and VLIWcustomization for a particular application has been studied by Jacome et al [] and R Rau et al [7] Ahuja et al [] showed that the number of forwarding paths in a scalar processor could be reduced without a great performance loss Unfortunately, not much work has been performed on simultaneously tailoring cluster partitioning and pipeline bypass structures to a specific application ARCHITECTURE MODEL The architecture model used in this paper (Figure ) is a pipelined clustered VLIWarchitecture, where each cluster is formed by a set of one or more homogeneous functional unit (), a multi-ported register file and an inter-cluster data transfer bus (called copy-bus) This model is similar to the one described by Capitanio et al [5] Contrary to the work in [5], the copy-bus is driven by the output of the functional unit and not by the output of the register file By doing so, a copy operation can be scheduled to copy the result of an operation at the output of some directly to the register file of another cluster through the copy-bus Consider, for example, two dependent operations A and B (B depends on A s result) assigned to datapaths and in two distinct clusters (Figure ) Figure shows the pipeline timing diagram of these operations Assume that the result of operation A is available in the EX/MEM pipeline register, at the end of stage EX in A copy operation following A in canbeusedtomovea s result to the copy-bus, just in time to be written, during the ID stage, into the register file of (solid arrow) This is not possible by the approach used in [5], which requires one extra NOP operation to transfer the data to s register file The presence of the copy-bus affect the final register file design, but its impact is much smaller than the benefits gained by reducing the number of the read-ports [5] Using a heuristic from [5], we assume that the width of the copy-bus is equal to half the number of s per cluster, ie in the best case only half of the s in one cluster can simultaneously execute copies to other clusters One cluster can receive copies from all other clusters, provided the maximum constraint above is met Abnous and Bagherzadeh [, ] studied some of the design issues that arise in the pipeline structure and bypassing mechanism of pipelined clustered VLIWprocessors In our work we use a few bypassing lines to forward operation results stored in pipeline registers to other datapaths Pipeline bypasses can be added between datapaths within the same cluster or between datapaths in distinct clusters The goal of inserting a bypass interconnection between two datapaths inside the same cluster (eg and ) isto reduce the number of NOP operations required to solve the data-hazard between the dependent instructions in the datapaths By assigning a bypass interconnection between two datapaths in distinct clusters (eg and ), we are also reducing the number of copy operations required to use the copy-bus A copy operation must be issued by the compiler if: (a) no bypass exist between and and there is (at Homogeneous s have been used for the sake of simplicity The technique applies to heterogeneous units as well Register File (WB) (ID) (ID) EX () MEM Register File (WB) MUX MUX MUX MUX EX () MEM Figure : Bypass interconnection between datapaths of and A: copy: B: IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB wr rd Bypass from EX/MEM in forwards the result of instruction A to instruction B in the EX stage of Instruction B () reads from the copy-bus the result of instruction A in Figure : Using inter-cluster forwarding to solve inter-cluster dependencies least one) data-hazard between the operations in these datapaths; (b) a bypass exist between and, but at least one of the uses of the data in is so far from its definition in that the bypass cannot satisfy the dependency Consider again datapaths and and dependent operations A and B in Figure Moreover, assume that B has been scheduled two slots after A In this case, the result of A can be forwarded from the EX/MEM register in directly to the ID/EX register of (dotted line) through the bypass interconnect in Figure We are assuming that one bypass interconnection between two datapaths has as many lines as those required to exchange operands (in both directions) between the stages of the datapath pipelines The area needed by a bypass interconnection between two pipelines, is proportional to the number of comparators required to detect the datahazards [] between them This cost becomes very large if bypasses are allowed between all pipeline pairs, in which case it is proportional to dn, n the number of s and d the depth of the pipeline Instead of allowing full bypassing between all datapaths, we insert only a few carefully chosen bypasses between very communicating datapaths, aiming at reducing the inter-cluster communication These interconnections are selected based on the communication pattern of the application Pipeline bypasses have a reasonably small

3 impact in the processor cycle time [], consisting basically on the delay of the routing lines between datapaths Thus, for the sake of simplicity, we neglect the impact of bypasses to the cycle time (, ) A B E (, ) I L F K C G (, ) D (, ) J H Intercommunication Table A B C D E I F G J H Cycle K L Reservation Table Figure : IMPACT list scheduling using clustering as second criteria Our approach is based on three phases Initially, a variation of list scheduling is used to schedule operations into compacted instructions (Section ) The algorithm tries to cluster dependent operands into the same so as to avoid expensive inter-cluster copy operations In the second phase (Section 5), we use a partitioning algorithm to assign functional units to clusters, such that the majority of the data dependencies are resolved by the cluster register file Assigning s to clusters, based on the application, is a central feature of our approach which has not been extensively explored before Finally, functional units are assigned to physical units inside clusters (Section 5), and bypass interconnections are inserted between the most communicating datapaths SCHEDULING Our scheduling approach is a simple extension of the IM- PACT compiler list scheduling algorithm For a given operation op in the candidate list, IMPACT uses the distance from op toarootofthedata-dependence Graph (DDG) as the first scheduling criteria, followed by the number of children operations that become candidates if op is scheduled We use exactly the same algorithm, adding only a small modification to determine which will be assigned to execute op For each candidate operation op removed from the priority list, its is determined based on the s of its parents in the DDG If the intersection of the s assigned to parents(op) is non-empty, op is assigned the same as its parents, if that is free Otherwise, op is assigned the first available at the current time step If the intersection of the s in parents(op) is empty, op is assigned the first free, giving a higher priority to the s assigned toitsparentsthecentralideahereistokeeptheresultof an operation into the same register cluster as its operands By doing so, we avoid increasing the number of inter-cluster copy operations as much as possible Provided that only a few are inserted Consider for example, the DDG of Figure For the sake of simplicity, we assume in this example that all operations have single cycle latencies Moreover, consider that the scheduling priority is such that operations are scheduled in alphabetic order Initially, operations A-D are assigned to - AfterA-D are scheduled, E is the next operation in the working list which is ready to be scheduled The intersection of the s assigned to the parents of E ( and ) isempty,soe is scheduled to the first free that was assigned to its parents (ie ) Next, F is scheduled, and since it has no parents it is assigned the next free, ie OperationG is then scheduled to, since its parent s s are different The next candidate for scheduling is H, whichisassignedto OperationI is scheduled to, while J and K are assigned to and respectively Finally, L is scheduled to, the same functional unit as its parents I and K Notice from Figure that whenever a data-dependency exist between two operations scheduled to different s some action must be taken to assure that this dependency is satisfied At this point of our solution, s have not been assigned to clusters yet When s are assigned a common register file inside the same cluster, the dependency can be satisfied through the register file, or by some intra-cluster bypass if one exist On the other hand, if the s are located in different clusters a copy operation will be required if there is no inter-cluster bypass between those two s For example, consider operations J and K scheduled to functional units and If units and are assigned to the same cluster, no copy operations will required to communicate the result of J to K The same is not true if these operations are scheduled to s in different clusters and there is no bypass between their datapaths In order to evaluate the communication pattern between s we measure the dependencies between each pair of using the communication table shown in Figure Each entry in this table corresponds to the number of data dependencies that need to be satisfied between a pair of s For example entry (,) in the table is, meaning that two operations scheduled to (B and F) communicate their results to two operations scheduled to (E and I) 5 CLUSTER PARTITIONING After the communication table is computed, our algorithm divides the s among clusters such that the most intercommunicating s are assigned to the same cluster Initially, the communication table is reduced to a low-diagonal matrix, in order to accumulate the dependencies (i, j) and (j, i) into a unique value As said before in Section, one bypass is a bi-directional connection between all stages of datapaths i and j This is not a requirement of our approach though, and it can be relaxed if required The table on the top left corner of Figure 5 shows a reduced communication table Based on this table, we build a cluster vector thesizeofthenumberofs Eachentry in this vector stores the number of a The indices of the vector correspond to a physical datapath, and are divided according to the number of clusters In the case of Figure 5, four functional units - must be assigned to two clusters ( and ) each cluster containing two physical We assume that intra-pipeline data-hazards are always satisfied

4 cluster cluster 8 9 (a) Initial cost = 8 9 cluster cluster (d) Cost = cluster cluster 8 9 (b) Cost = cluster cluster Initial partitioning cluster cluster 8 9 (e) Cost = cluster cluster (c) Cost = Figure 5: Selecting the most communicating functional units datapaths Cluster contains datapaths and, and cluster datapaths and A variation of the LPK algorithm [] is then used to swap s between clusters, so as to minimize their communication The communication cost between two clusters, for a given distribution, is the total number of data dependencies that cross the clusters border Initially, the algorithm divides the functional units into two sets of clusters It swaps all possible pairs, one from each set, storing the smallest cost it has seen so far After all possible exchanges have been tried, the resulting smallest cost gives the best distribution between the two sets of clusters The algorithm proceeds recursively into each cluster set, until all are assigned a cluster Consider, for example, the reduced communication table and cluster vector in Figure 5 The communication between two s is represented by a double-headed solid arrow labeled with the cost from the communication table The cost of the initial partitioning in Figure 5a () is the result of the sum of the communication costs between: and (cost 8); and (cost ); and (cost ); and and (cost 9) and (in gray) are then selected for swapping, resulting in the new configuration () with cost (Figure 5b) The algorithm proceeds exchanging pairs of s from the initial partitioning (Figure 5(c-e)) while computing their costs After all pairs of s have been tested, the minimal communication cost () is achieved The configuration that results in the smallest inter-cluster communication () is obtained by swapping and (Figure 5c) 6 DATAPATH MAPPING After the scheduling and partitioning tasks described above are finished, operations are associated to s and s to clusters To complete the architectural design, s must be assigned to their corresponding physical datapaths, and bypass lines inserted We do that using the two step procedure shown in Figure 6 First, each inner-loop communication table is used, in combination with the result of its cluster vector after partitioning, to compute a partial hardwired communication table This table is a representation of the number of data dependencies between program operands in a particular loop, given the current architecture Its goal is basically to map each in the communication table to its corresponding datapth (and cluster) in the cluster vector Communication table cluster cluster Cluster vector Communication table cluster cluster Cluster vector Communication table cluster cluster Cluster vector Partial hardwired communication table Partial hardwired communication table Partial hardwired communication table X X X Hardwired communication table (normalized by ) Figure 6: Mapping the communication table to a hardwired bypass interconnects For example, at the center of Figure 6, has been assigned to (index ) of the cluster vector Hence, line in the communication table if mapped to column of the partial hardwired table Notice that one partial table emerges for each inner-loop super-block in the program In the case of Figure 6 three loops were considered Since the resulting architecture has to execute all of them, we need to take into account the contribution of each loop to the overall intercluster communication This is done, in a second phase, by adding up the partial communication tables into a single hardwired communication table Each entry in a given partial table is weighted by NL,whereNL is the nestinglevel of the loop corresponding to that table [] A better estimate is possible if the loop trip-count can be determined at compile time The resulting hardwired communication table is then used to determine the pairs of datapaths which will be interconnected with bypass lines To do that the entries in the hardwired table are sorted into a priority list, such that the most communicating pair of datapaths have a higher priority Bypass lines are inserted between datapath pairs, the highest priority pair first In Figure 6, for example, the highest entry in the hardwired table is,, corresponding to the communication between and These datapaths were assigned the same cluster, in order to reduce the cost of

5 inserting copy operations between them Some of the datadependencies between and will be resolved by the common register file in cluster, but many short distance dependencies can be satisfied by adding a bypass line between and 7 EXPERIMENTAL RESULTS The approach described above was implemented into the IMPACT infrastructure, and the resulting compiler was used to compile eighteen programs from the SPEC CINT95 (6), CFP95 (5) benchmarks, IMPACT (5) and Miscellaneous () as shown in Table The compiler was executed using superblock formation, maximum unrolling of and no predication In our experiments we estimate the number of copy operations and cycles produced by each program across a large number of architecture configurations Each configuration corresponds to a different combination of the following parameters: (a) Number of s (from to 6); (b) Number of register file clusters (from to the number of s); (c) Number of bypass interconnections (from to the number of s) For the sake of simplicity we adopted homogeneous clusters, ie all clusters (CLs) have the same number of s The goal of the experimental work was to determine the impact of the techniques described in Sections, 5 and 6 The experiments were divided into three parts First, we evaluated the impact of bypass insertion into the cycle count of the programs In the second part, we studied how cluster partitioning and bypassing effects the number of copy instruction between clusters In the last set of experiments we evaluate the impact of the scheduling and mapping algorithms CINT95 CFP95 IMPACT MISC 99go tomcatv fir mpegdec m88ksim swim kalman mpegenc 9compress sucor paraffins li 7mgrid dag ijpeg 5turbd eight 7vortex Table : Benchmark Programs The maximum number of bypass interconnections is given by n(n )/, where n is the number of s for that configuration Figures 7 and 8 show the impact, on programs go and sucor, of adding from to bypasses (full bypassing between all s) All architecture configurations considered in the following analysis have 6 s, and range from and 8 clusters For program go (Figure 7), we noticed that most of the speed-up was achieved for 8 bypasses (65% for one cluster and 58% for 8 clusters) Only a small difference was noticed when using 6 or more bypasses (7% for one cluster and 6% for 8 clusters, when 6 bypasses are used) For program sucor (Figure 8) we faced a more complex tradeoff In the first knee of the curve (left side of the figure), when bypasses are used, the speed-up was 6% (one cluster) and 75% (8 clusters) This value increases very slowly Notice that using bypasses improves the speedup only 6% ( cluster) and 7% (8 clusters) Thus, since the program speed-up decreases almost exponentially with the number of bypasses, we restrict the maximum number of bypasses to the number of s 69e+6 68e+6 67e+6 66e+6 65e+6 6e+6 8 clusters cluster 6e Number of bypassings Figure 7: Impact of adding bypasses for program go (6 s) 7e+7 7e+7 7e+7 7e+7 69e+7 68e+7 67e+7 8 clusters cluster 66e Number of bypassings Figure 8: Impact of adding bypasses for program sucor (6 s) In the second part of our experimental work, we studied the impact of clustering and bypassing in the number of inter-cluster copy instructions Consider, for example, the graphs in Figure 9a-b, where the number of inter-cluster copy operations for program sucor is measured using two architectures with 8 and 6 s In that graph, the vertical (horizontal) axis represents the number of copy operations (clusters) Curves in the graph have for parameter the number of bypass lines inserted between s The number of copy operations grows with the number of clusters, as expected, given the increase in the number of inter-cluster communication Nevertheless, notice that as bypass lines are added to the architecture, many copy operations are wiped-out of the program Bypasses sucor go Table : Number of copies for s In Table ( s/ CLs) a single bypass line removes more than 5% of all copy operations If the same program 5

6 Number of copies bypassing bypassing bypassing bypassing 8 bypassing Number of copies bypassing bypassing bypassing bypassing 8 bypassing 8 (a) Number of clusters 8 (a) Number of clusters Number of copies 8 6 bypassing bypassing bypassing bypassing 8 bypassing 6 bypassing Number of copies bypassing bypassing bypassing bypassing 8 bypassing 6 bypassing Percentage of copies removed wrt bypassing s s 8 s 6 s (b) Number of clusters 6 8 (c) Clusters / s Figure 9: Number of copy operations for program sucor in a: (a) 8 s architecture; (b) 6 s architecture (c) Percentage of copy operations removed from sucor through bypassing, as a variable of the number of clusters per s Percentage of copies removed wrt bypassing (b) Number of clusters bypassing bypassing 8 bypassing 6 bypassing 6 8 (c) Clusters / s Figure : Number of copy operations for program go in a: (a) 8 s architecture; (b) 6 s architecture (c) Percentage of copy operations removed from go through bypassing, as a variable of the number of clusters per s 6

7 e+8 e+8 8e+8 Cluster Bypassing Cluster Bypassing Cluster Bypassing 8 Cluster Bypassing e+7 e+7 e+7 Cluster Bypassing Cluster Bypassing Cluster Bypassing 8 Cluster Bypassing 6e+8 e+8 e+8 e+8 e+7 9e+6 8e+6 8e+7 7e+6 6e s 6e s Figure : Cycle count for program sucor in nonclustered and clustered configurations with no bypasses e+8 e+8 e+8 e+8 e+8 9e+7 8e+7 7e+7 Cluster Bypassing Cluster Bypassing Cluster Bypassing 6e s Figure : Impact on the cycle count for program sucor due to maximum bypassing Figure : Cycle count for program go in nonclustered and clustered configurations with no bypasses 8e+6 8e+6 8e+6 78e+6 76e+6 7e+6 7e+6 7e+6 68e+6 66e+6 Cluster Bypassing Cluster Bypassing Cluster Bypassing 6e s Figure : Impact on the cycle count for program go due to maximum bypassing runs on an 8 s/8 CLs configuration (Figure 9a), 8 bypasses are required to reduce by 8% the number of copy operations As more bypasses are added the law of diminishing returns settles in and the gain saturates In Figure 9b for 6 s, we removed % of the copy operations using 8 bypasses and 9% of the copy operations using 6 bypasses In general, we noticed, across all SPEC programs, that the insertion of a single bypass reduces the total number of copy operations from % to 7% Notice, for example, that program go repeats the same behavior (Figure a-b) as sucor The bypass effectiveness when all configurations run program sucor is described in Figure 9c The vertical axis in that graph shows the percentage of copy operations removed with respect to a bypassing free configuration The horizontal axis represents the ratio CLs/s Notice that it ranges to a maximum of one since the number of CLs is at most the number of s As mentioned before, increasing the number of bypasses implies in a reduction in the number of copy operations For example, % of the copy operations are removed when bypasses are inserted in a s/ CLs architecture Interesting enough, the percentage of copy operations removed by bypasses seems to saturate when the ratio CLs/s is around 5, ie when each cluster has two s The same pattern was also detected in the majority of the combinations of programs and architecture configurations (eg program go in Figure Figure c) We believe this might have to do with the way s data-dependencies are uniformly partitioned across clusters by the binary recursive algorithm described in Section 5 Further experimental work will be required to clarify this finding In the third part of our experiments, we studied the impact of the scheduling and clustering algorithms We plotted, for all programs, the cycle count as a function of the number of s and clusters In order to filter out the effect of bypassing, we considered only bypassing configurations As shown in Figure, for program sucor, the performance for clustered architectures follows very closely the performance for the non-clustered architecture ( Cluster, Bypassing) In other words, our approach is capable of canceling the negative effects of clustering, namely the cost of the inter-cluster copy instructions overhead The same can be seen from Figure, for program go The effect of the bypass lines in the cycle count for programs sucor and go are shown in Figures and The 7

8 best combination of clusters and number of bypass lines have been used for each architecture Surprisingly, the large gains achieved on reducing the number of copy operations in Figures 9 and did not translated to real performance In general, benchmark programs speed-up, due to bypassing, ranged from 6% to 5% only Although it is not clear why, it might be possible that the combination of the cluster partitioning and scheduling algorithms leaves only a small number of copy operations in the code for bypassing Yet another explanation can be drawn from this finding Not enough ILP is available in the unrolled loop bodies Inthis case, the generated instructions could have enough empty slots to hide the latency of most copy operations Further experimental work will be required to address this issue Notice that cycle count does not take into consideration the benefits achieved by clustering, ie a smaller register file and reduced latency If the register file determines the cycle time of the processor, the curves representing clustered architectures in Figures and will reveal a performance improvement proportional to the reduction in the register file latency Otherwise, by using our technique, the same performance level of a non-clustered architecture is achieved at a smaller processor cost 8 CONCLUSIONS AND TURE WORK This paper presents a scheduling and partitioning algorithm for clustered VLIWarchitectures aimed at reducing the communication cost between datapaths and clusters This is achieved by assigning higher communication datapaths to the same register file, while tailoring bypass interconnections to the application Preliminary experimental results reveal a substantial reduction on the number of inter-cluster copy operations and a potential performance improvement As the next steps in this project we are considering: (a) to use the data-dependency distance between scheduled operations to improve the communication cost estimate; (b) to insert delay registers into bypass lines to resolve long distance data-dependencies 9 ACKNOWLEDGMENTS This work was partially supported by research grants from CNPq/NSF Collaborative Research Project 6859/99-7, CNPq research grant 56/97-9, fellowship research a- wards from CAPES -P-58/ and FAPESP 97/98-, 99/96-8 We also thank the reviewers for their comments REFERENCES [] A Abnous and N Bagherzadeh Pipelining and bypassing in a VLIWprocessor IEEE Trans on Parallel and Distributed Systems, 5(6):658 66, June 99 [] A Abnous and N Bagherzadeh Architectural design and analysis of a VLIWprocessor International Journal of Computers and Electrical Engineering, ():9, 995 [] P S Ahuja, D W Clark, and A Rogers The performance impact of incomplete bypassing in processor pipelines In MICRO-8, 995 [] A Capitanio, N Dutt, and A Nicolau Design considerations for limited connectivity VLIW architectures Technical Report TR-9-59, University of California, Irvine, Irvine, CA 977, 99 [5] A Capitanio, N Dutt, and A Nicolau Partitioned register file for VLIWs: A preliminary analysis of tradeoffs In 5th International Symposium on Microarchitecture (MICRO), 99 [6] J R Ellis Bulldog: A Compiler for VLIW Architectures MIT Press, 986 [7] P Faraboshchi, G Desoli, and J A Fisher Clustered instruction-level parallel processors Technical Report Technical Report HPL-98-, HP Labs, USA, 998 [8] M M Fernandes, J Llosa, and N Topham Partitioned schedules for clustered VLIW architectures In IEEE/ACM International Parallel Processing Symposium, 998 [9] J A Fisher Trace scheduling: A technique for global microcode compaction IEEE Trans on Computers, C-(7):78 9, July 98 [] W W Hwu et al Impact advanced compiler technology [] M F Jacome, G de Veciana, and V Lapinskii Exploring performance tradeoffs for clustered VLIW asips In International Conference on Computer-Aided Design, [] C Lee, C Park, and M Kim Efficient algorithm for graph partitioning problem using a problem transformation method Computer Aided Design, ():6, December 989 [] S S Muchnick Advanced Compiler Design and Implementation Morgan Kaufmann, 997 [] E Ozer, S Banerjia, and T M Conte Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures In th International Symposium on Microarchitecture (MICRO), 998 [5] E Ozer and T M Conte Optimal cluster scheduling for a VLIWmachine Technical report, Dept of Elec and Comp Eng, North Carolina State University, 998 [6] E Ozer and T M Conte Unified cluster assignment and instruction scheduling for clustered VLIW microarchitectures Technical report, Dept of Elec and Comp Eng, North Carolina State University, 998 [7] V K R Rau and S Aditya Machine-description driven compilers for EPIC and VLIWprocessors Design Automation for Embedded Systems, (/):7 8, 999 [8] J Sanchez and A Gonzalez The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures In Intl Conference on Parallel Processing (ICPP), [9] J Sanchez and A Gonzalez Instruction scheduling for clustered VLIWarchitectures In Intl Symposium on System Synthesis (ISSS), Remember that predication is not used 8

Removing Communications in Clustered Microarchitectures Through Instruction Replication

Removing Communications in Clustered Microarchitectures Through Instruction Replication Removing Communications in Clustered Microarchitectures Through Instruction Replication ALEX ALETÀ, JOSEP M. CODINA, and ANTONIO GONZÁLEZ UPC and DAVID KAELI Northeastern University The need to communicate

More information

Instruction Scheduling for Clustered VLIW Architectures

Instruction Scheduling for Clustered VLIW Architectures Instruction Scheduling for Clustered VLIW Architectures Jesús Sánchez and Antonio González Universitat Politècnica de Catalunya Dept. of Computer Architecture Barcelona - SPAIN E-mail: {fran,antonio@ac.upc.es

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Software-Only Value Speculation Scheduling

Software-Only Value Speculation Scheduling Software-Only Value Speculation Scheduling Chao-ying Fu Matthew D. Jennings Sergei Y. Larin Thomas M. Conte Abstract Department of Electrical and Computer Engineering North Carolina State University Raleigh,

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

2D-VLIW: An Architecture Based on the Geometry of Computation

2D-VLIW: An Architecture Based on the Geometry of Computation 2D-VLIW: An Architecture Based on the Geometry of Computation Ricardo Santos,2 2 Dom Bosco Catholic University Computer Engineering Department Campo Grande, MS, Brazil Rodolfo Azevedo Guido Araujo State

More information

High-Level Synthesis

High-Level Synthesis High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

The Impact of Instruction Compression on I-cache Performance

The Impact of Instruction Compression on I-cache Performance Technical Report CSE-TR--97, University of Michigan The Impact of Instruction Compression on I-cache Performance I-Cheng K. Chen Peter L. Bird Trevor Mudge EECS Department University of Michigan {icheng,pbird,tnm}@eecs.umich.edu

More information

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

VLIW/EPIC: Statically Scheduled ILP

VLIW/EPIC: Statically Scheduled ILP 6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

A Cost-Effective Clustered Architecture

A Cost-Effective Clustered Architecture A Cost-Effective Clustered Architecture Ramon Canal, Joan-Manuel Parcerisa, Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Cr. Jordi Girona, - Mòdul D6

More information

The character of the instruction scheduling problem

The character of the instruction scheduling problem The character of the instruction scheduling problem Darko Stefanović Department of Computer Science University of Massachusetts March 997 Abstract Here I present some measurements that serve to characterize

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Register Assignment for Software Pipelining with Partitioned Register Banks

Register Assignment for Software Pipelining with Partitioned Register Banks Register Assignment for Software Pipelining with Partitioned Register Banks Jason Hiser Steve Carr Philip Sweany Department of Computer Science Michigan Technological University Houghton MI 49931-1295

More information

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture Chapter 2 Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, Computer

More information

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

An Enhanced Perturbing Algorithm for Floorplan Design Using the O-tree Representation*

An Enhanced Perturbing Algorithm for Floorplan Design Using the O-tree Representation* An Enhanced Perturbing Algorithm for Floorplan Design Using the O-tree Representation* Yingxin Pang Dept.ofCSE Univ. of California, San Diego La Jolla, CA 92093 ypang@cs.ucsd.edu Chung-Kuan Cheng Dept.ofCSE

More information

Systematic Register Bypass Customization for Application-Specific Processors

Systematic Register Bypass Customization for Application-Specific Processors Systematic Register Bypass Customization for Application-Specific Processors Kevin Fan, Nathan Clark, Michael Chu, K. V. Manjunath, Rajiv Ravindran, Mikhail Smelyanskiy, and Scott Mahlke Advanced Computer

More information

Speculative Trace Scheduling in VLIW Processors

Speculative Trace Scheduling in VLIW Processors Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in J.v.Eijndhoven and S. Balakrishnan

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Cost-Sensitive Operation Partitioning for Synthesizing Custom Multicluster Datapath Architectures

Cost-Sensitive Operation Partitioning for Synthesizing Custom Multicluster Datapath Architectures Cost-Sensitive Operation Partitioning for Synthesizing Custom Multicluster Datapath Architectures Michael L. Chu Kevin C. Fan Rajiv A. Ravindran Scott A. Mahlke Advanced Computer Architecture Laboratory

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two? Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the

More information

Trace-Driven Hybrid Simulation Methodology for Simulation Speedup : Rapid Evaluation of a Pipelined Processor

Trace-Driven Hybrid Simulation Methodology for Simulation Speedup : Rapid Evaluation of a Pipelined Processor Trace-Driven Hybrid Simulation Methodology for Simulation Speedup : Rapid Evaluation of a Pipelined Processor Ho Young Kim and Tag Gon Kim Systems Modeling Simulation Lab., Department of Electrical Engineering

More information

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg

More information

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Department of Computer Architecture Universitat Politècnica de

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing" Database System Concepts, 6 th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Chapter 11: Indexing and Hashing" Basic Concepts!

More information

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures

A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures Qiong Cai Josep M. Codina José González Antonio González Intel Barcelona Research Centers, Intel-UPC {qiongx.cai, josep.m.codina,

More information

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg

More information

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in

More information

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction Review: Evaluating Branch Alternatives Lecture 3: Introduction to Advanced Pipelining Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier Pipeline speedup

More information

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline

More information

Impact of Source-Level Loop Optimization on DSP Architecture Design

Impact of Source-Level Loop Optimization on DSP Architecture Design Impact of Source-Level Loop Optimization on DSP Architecture Design Bogong Su Jian Wang Erh-Wen Hu Andrew Esguerra Wayne, NJ 77, USA bsuwpc@frontier.wilpaterson.edu Wireless Speech and Data Nortel Networks,

More information

The MDES User Manual

The MDES User Manual The MDES User Manual Contents 1 Introduction 2 2 MDES Related Files in Trimaran 3 3 Structure of the Machine Description File 5 4 Modifying and Compiling HMDES Files 7 4.1 An Example. 7 5 External Interface

More information

On Cluster Resource Allocation for Multiple Parallel Task Graphs

On Cluster Resource Allocation for Multiple Parallel Task Graphs On Cluster Resource Allocation for Multiple Parallel Task Graphs Henri Casanova Frédéric Desprez Frédéric Suter University of Hawai i at Manoa INRIA - LIP - ENS Lyon IN2P3 Computing Center, CNRS / IN2P3

More information

Satisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits

Satisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits Satisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits Suchandra Banerjee Anand Ratna Suchismita Roy mailnmeetsuchandra@gmail.com pacific.anand17@hotmail.com suchismita27@yahoo.com

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling CE431 Parallel Computer Architecture Spring 2018 Compile-time ILP extraction Modulo Scheduling Nikos Bellas Electrical and Computer Engineering University of Thessaly Parallel Computer Architecture 1 Readings

More information

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search

Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,

More information

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Computer Performance Evaluation: Cycles Per Instruction (CPI) Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: where: Clock rate = 1 / clock cycle A computer machine

More information

Modulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures

Modulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures Modulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures Javier Zalamea, Josep Llosa, Eduard Ayguadé and Mateo Valero Departament d Arquitectura de Computadors (UPC) Universitat

More information

Multiple Branch and Block Prediction

Multiple Branch and Block Prediction Multiple Branch and Block Prediction Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine Irvine, CA 92697 swallace@ece.uci.edu, nader@ece.uci.edu

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,

More information

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2018-11-28-13-01 1 Motivating VLIW Processors

More information

Evaluation of a High Performance Code Compression Method

Evaluation of a High Performance Code Compression Method Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The

More information

Code Compression for RISC Processors with Variable Length Instruction Encoding

Code Compression for RISC Processors with Variable Length Instruction Encoding Code Compression for RISC Processors with Variable Length Instruction Encoding S. S. Gupta, D. Das, S.K. Panda, R. Kumar and P. P. Chakrabarty Department of Computer Science & Engineering Indian Institute

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Efficient Power Reduction Techniques for Time Multiplexed Address Buses

Efficient Power Reduction Techniques for Time Multiplexed Address Buses Efficient Power Reduction Techniques for Time Multiplexed Address Buses Mahesh Mamidipaka enter for Embedded omputer Systems Univ. of alifornia, Irvine, USA maheshmn@cecs.uci.edu Nikil Dutt enter for Embedded

More information

Multi-Profile Based Code Compression

Multi-Profile Based Code Compression 15.2 Multi-Profile Based Code Compression E. Wanderley Netto CEFET/RN IC/UNICAMP R. Azevedo P. Centoducatte IC/UNICAMP IC/UNICAMP Caixa Postal 6176 13084-971 Campinas/SP Brazil +55 19 3788 5838 {braulio,

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

Reducing Data Cache Energy Consumption via Cached Load/Store Queue

Reducing Data Cache Energy Consumption via Cached Load/Store Queue Reducing Data Cache Energy Consumption via Cached Load/Store Queue Dan Nicolaescu, Alex Veidenbaum, Alex Nicolau Center for Embedded Computer Systems University of Cafornia, Irvine {dann,alexv,nicolau}@cecs.uci.edu

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Understanding multimedia application chacteristics for designing programmable media processors

Understanding multimedia application chacteristics for designing programmable media processors Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable

More information

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 797- flur,chengkokg@ecn.purdue.edu

More information

Static, multiple-issue (superscaler) pipelines

Static, multiple-issue (superscaler) pipelines Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Annotated Memory References: A Mechanism for Informed Cache Management

Annotated Memory References: A Mechanism for Informed Cache Management Annotated Memory References: A Mechanism for Informed Cache Management Alvin R. Lebeck, David R. Raymond, Chia-Lin Yang Mithuna S. Thottethodi Department of Computer Science, Duke University http://www.cs.duke.edu/ari/ice

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Exploiting Idle Floating-Point Resources For Integer Execution

Exploiting Idle Floating-Point Resources For Integer Execution Exploiting Idle Floating-Point Resources For Integer Execution S.Subramanya Sastry Computer Sciences Dept. University of Wisconsin-Madison sastry@cs.wisc.edu Subbarao Palacharla Computer Sciences Dept.

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information