Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures
|
|
- Darrell Rose
- 6 years ago
- Views:
Transcription
1 Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Marcio Buss, Rodolfo Azevedo, Paulo Centoducatte and Guido Araujo IC - UNICAMP Cx Postal 676 Campinas, SP, Brazil {marciobuss, rjazevedo, ducatte, guido}@icunicampbr ABSTRACT In this paper we describe a design exploration methodology for clustered VLIWarchitectures The central idea of this work is a set of three techniques aimed at reducing the cost of expensive inter-cluster copy operations Instruction scheduling is performed using a list-scheduling algorithm that stores operand chains into the same register file Functional units are assigned to clusters based on the application inter-cluster communication pattern Finally, a careful insertion of pipeline bypasses is used to increase the number of data-dependencies that can be satisfied by pipeline register operands Experimental results, using the SPEC95 benchmark and the IMPACT compiler, reveal a substantial reduction in the number of copies between clusters INTRODUCTION The problem of instruction partitioning/scheduling for clustered VLIWhas earned a considerable attention recently, due to the small area and improved register file latency achieved by these architectures [5] Register file area/latency is proportional to O(n )/O(log m), where n is the total number of input/output ports, and m the number of read-ports Such features of clustered VLIWarchitectures are particularly relevant in the design of highly constrained embedded systems, where high performance, reduced die size and low power consumption are premium design goals In this paper we describe a design exploration methodology for clustered VLIWarchitectures Instruction scheduling is performed using a list-scheduling algorithm that stores chains of operands into the same register file Functional units are assigned to clusters based on the application intercluster communication pattern Finally, pipeline bypasses are inserted to increase the number of data-dependencies which can be satisfied by pipeline register operands This paper is divided as follows Section shows some prior art Section describes the architectural model adopted Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee CASES, November 6-7,, Atlanta, Georgia, USA Copyright ACM // $5 Register File Inter cluster copy-bus Register File Bypass Figure : Clustered VLIW architecture model with inter-cluster bypass throughout the paper Section discusses how operations are scheduled and assigned to functional units The partitioning of functional units into clusters is discussed in Section 5 Finally, Section 6 shows how functional units are assigned to physical clusters The SPEC CINT95 and CFP95 benchmarks and the IMPACT compiler [] were used to evaluate the performance of this strategy (Section 7) In Section 8 we conclude the work RELATED WORK Clustered VLIWarchitectures have been extensively studied in the literature The assignment of operation traces to clusters has been originally studied in the context of the Bulldog [6] and Multiflow Trace [9] compilers Separate partitioning and scheduling has been proposed by Capitanio et al [] using a limited connectivity architecture Ozer et al [, 6] integrate partitioning and scheduling in a single phase using the Unified Assign and Scheduling (UAS) algorithm A variation of UAS and modulo scheduling has been proposed by Sanchez et al [8, 9] as a way to assign different loop iterations to separate clusters Fisher at al [7] proposed a Partial Component Cluster technique to divide Data-Flow Graph (DFG) components into clusters in order to avoid copy operations along DFG critical paths Ozer and Conte [5] introduced an optimal cluster scheduling for a VLIWmachine based on integer linear programming Their approach is suitable to help the search for a schedule lower
2 bound, and as a way to evaluate the effectiveness of heuristic based schemes Fernandes et al [8] proposed a queue-based register file to pass operands between clusters Architectural exploration and VLIWcustomization for a particular application has been studied by Jacome et al [] and R Rau et al [7] Ahuja et al [] showed that the number of forwarding paths in a scalar processor could be reduced without a great performance loss Unfortunately, not much work has been performed on simultaneously tailoring cluster partitioning and pipeline bypass structures to a specific application ARCHITECTURE MODEL The architecture model used in this paper (Figure ) is a pipelined clustered VLIWarchitecture, where each cluster is formed by a set of one or more homogeneous functional unit (), a multi-ported register file and an inter-cluster data transfer bus (called copy-bus) This model is similar to the one described by Capitanio et al [5] Contrary to the work in [5], the copy-bus is driven by the output of the functional unit and not by the output of the register file By doing so, a copy operation can be scheduled to copy the result of an operation at the output of some directly to the register file of another cluster through the copy-bus Consider, for example, two dependent operations A and B (B depends on A s result) assigned to datapaths and in two distinct clusters (Figure ) Figure shows the pipeline timing diagram of these operations Assume that the result of operation A is available in the EX/MEM pipeline register, at the end of stage EX in A copy operation following A in canbeusedtomovea s result to the copy-bus, just in time to be written, during the ID stage, into the register file of (solid arrow) This is not possible by the approach used in [5], which requires one extra NOP operation to transfer the data to s register file The presence of the copy-bus affect the final register file design, but its impact is much smaller than the benefits gained by reducing the number of the read-ports [5] Using a heuristic from [5], we assume that the width of the copy-bus is equal to half the number of s per cluster, ie in the best case only half of the s in one cluster can simultaneously execute copies to other clusters One cluster can receive copies from all other clusters, provided the maximum constraint above is met Abnous and Bagherzadeh [, ] studied some of the design issues that arise in the pipeline structure and bypassing mechanism of pipelined clustered VLIWprocessors In our work we use a few bypassing lines to forward operation results stored in pipeline registers to other datapaths Pipeline bypasses can be added between datapaths within the same cluster or between datapaths in distinct clusters The goal of inserting a bypass interconnection between two datapaths inside the same cluster (eg and ) isto reduce the number of NOP operations required to solve the data-hazard between the dependent instructions in the datapaths By assigning a bypass interconnection between two datapaths in distinct clusters (eg and ), we are also reducing the number of copy operations required to use the copy-bus A copy operation must be issued by the compiler if: (a) no bypass exist between and and there is (at Homogeneous s have been used for the sake of simplicity The technique applies to heterogeneous units as well Register File (WB) (ID) (ID) EX () MEM Register File (WB) MUX MUX MUX MUX EX () MEM Figure : Bypass interconnection between datapaths of and A: copy: B: IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB wr rd Bypass from EX/MEM in forwards the result of instruction A to instruction B in the EX stage of Instruction B () reads from the copy-bus the result of instruction A in Figure : Using inter-cluster forwarding to solve inter-cluster dependencies least one) data-hazard between the operations in these datapaths; (b) a bypass exist between and, but at least one of the uses of the data in is so far from its definition in that the bypass cannot satisfy the dependency Consider again datapaths and and dependent operations A and B in Figure Moreover, assume that B has been scheduled two slots after A In this case, the result of A can be forwarded from the EX/MEM register in directly to the ID/EX register of (dotted line) through the bypass interconnect in Figure We are assuming that one bypass interconnection between two datapaths has as many lines as those required to exchange operands (in both directions) between the stages of the datapath pipelines The area needed by a bypass interconnection between two pipelines, is proportional to the number of comparators required to detect the datahazards [] between them This cost becomes very large if bypasses are allowed between all pipeline pairs, in which case it is proportional to dn, n the number of s and d the depth of the pipeline Instead of allowing full bypassing between all datapaths, we insert only a few carefully chosen bypasses between very communicating datapaths, aiming at reducing the inter-cluster communication These interconnections are selected based on the communication pattern of the application Pipeline bypasses have a reasonably small
3 impact in the processor cycle time [], consisting basically on the delay of the routing lines between datapaths Thus, for the sake of simplicity, we neglect the impact of bypasses to the cycle time (, ) A B E (, ) I L F K C G (, ) D (, ) J H Intercommunication Table A B C D E I F G J H Cycle K L Reservation Table Figure : IMPACT list scheduling using clustering as second criteria Our approach is based on three phases Initially, a variation of list scheduling is used to schedule operations into compacted instructions (Section ) The algorithm tries to cluster dependent operands into the same so as to avoid expensive inter-cluster copy operations In the second phase (Section 5), we use a partitioning algorithm to assign functional units to clusters, such that the majority of the data dependencies are resolved by the cluster register file Assigning s to clusters, based on the application, is a central feature of our approach which has not been extensively explored before Finally, functional units are assigned to physical units inside clusters (Section 5), and bypass interconnections are inserted between the most communicating datapaths SCHEDULING Our scheduling approach is a simple extension of the IM- PACT compiler list scheduling algorithm For a given operation op in the candidate list, IMPACT uses the distance from op toarootofthedata-dependence Graph (DDG) as the first scheduling criteria, followed by the number of children operations that become candidates if op is scheduled We use exactly the same algorithm, adding only a small modification to determine which will be assigned to execute op For each candidate operation op removed from the priority list, its is determined based on the s of its parents in the DDG If the intersection of the s assigned to parents(op) is non-empty, op is assigned the same as its parents, if that is free Otherwise, op is assigned the first available at the current time step If the intersection of the s in parents(op) is empty, op is assigned the first free, giving a higher priority to the s assigned toitsparentsthecentralideahereistokeeptheresultof an operation into the same register cluster as its operands By doing so, we avoid increasing the number of inter-cluster copy operations as much as possible Provided that only a few are inserted Consider for example, the DDG of Figure For the sake of simplicity, we assume in this example that all operations have single cycle latencies Moreover, consider that the scheduling priority is such that operations are scheduled in alphabetic order Initially, operations A-D are assigned to - AfterA-D are scheduled, E is the next operation in the working list which is ready to be scheduled The intersection of the s assigned to the parents of E ( and ) isempty,soe is scheduled to the first free that was assigned to its parents (ie ) Next, F is scheduled, and since it has no parents it is assigned the next free, ie OperationG is then scheduled to, since its parent s s are different The next candidate for scheduling is H, whichisassignedto OperationI is scheduled to, while J and K are assigned to and respectively Finally, L is scheduled to, the same functional unit as its parents I and K Notice from Figure that whenever a data-dependency exist between two operations scheduled to different s some action must be taken to assure that this dependency is satisfied At this point of our solution, s have not been assigned to clusters yet When s are assigned a common register file inside the same cluster, the dependency can be satisfied through the register file, or by some intra-cluster bypass if one exist On the other hand, if the s are located in different clusters a copy operation will be required if there is no inter-cluster bypass between those two s For example, consider operations J and K scheduled to functional units and If units and are assigned to the same cluster, no copy operations will required to communicate the result of J to K The same is not true if these operations are scheduled to s in different clusters and there is no bypass between their datapaths In order to evaluate the communication pattern between s we measure the dependencies between each pair of using the communication table shown in Figure Each entry in this table corresponds to the number of data dependencies that need to be satisfied between a pair of s For example entry (,) in the table is, meaning that two operations scheduled to (B and F) communicate their results to two operations scheduled to (E and I) 5 CLUSTER PARTITIONING After the communication table is computed, our algorithm divides the s among clusters such that the most intercommunicating s are assigned to the same cluster Initially, the communication table is reduced to a low-diagonal matrix, in order to accumulate the dependencies (i, j) and (j, i) into a unique value As said before in Section, one bypass is a bi-directional connection between all stages of datapaths i and j This is not a requirement of our approach though, and it can be relaxed if required The table on the top left corner of Figure 5 shows a reduced communication table Based on this table, we build a cluster vector thesizeofthenumberofs Eachentry in this vector stores the number of a The indices of the vector correspond to a physical datapath, and are divided according to the number of clusters In the case of Figure 5, four functional units - must be assigned to two clusters ( and ) each cluster containing two physical We assume that intra-pipeline data-hazards are always satisfied
4 cluster cluster 8 9 (a) Initial cost = 8 9 cluster cluster (d) Cost = cluster cluster 8 9 (b) Cost = cluster cluster Initial partitioning cluster cluster 8 9 (e) Cost = cluster cluster (c) Cost = Figure 5: Selecting the most communicating functional units datapaths Cluster contains datapaths and, and cluster datapaths and A variation of the LPK algorithm [] is then used to swap s between clusters, so as to minimize their communication The communication cost between two clusters, for a given distribution, is the total number of data dependencies that cross the clusters border Initially, the algorithm divides the functional units into two sets of clusters It swaps all possible pairs, one from each set, storing the smallest cost it has seen so far After all possible exchanges have been tried, the resulting smallest cost gives the best distribution between the two sets of clusters The algorithm proceeds recursively into each cluster set, until all are assigned a cluster Consider, for example, the reduced communication table and cluster vector in Figure 5 The communication between two s is represented by a double-headed solid arrow labeled with the cost from the communication table The cost of the initial partitioning in Figure 5a () is the result of the sum of the communication costs between: and (cost 8); and (cost ); and (cost ); and and (cost 9) and (in gray) are then selected for swapping, resulting in the new configuration () with cost (Figure 5b) The algorithm proceeds exchanging pairs of s from the initial partitioning (Figure 5(c-e)) while computing their costs After all pairs of s have been tested, the minimal communication cost () is achieved The configuration that results in the smallest inter-cluster communication () is obtained by swapping and (Figure 5c) 6 DATAPATH MAPPING After the scheduling and partitioning tasks described above are finished, operations are associated to s and s to clusters To complete the architectural design, s must be assigned to their corresponding physical datapaths, and bypass lines inserted We do that using the two step procedure shown in Figure 6 First, each inner-loop communication table is used, in combination with the result of its cluster vector after partitioning, to compute a partial hardwired communication table This table is a representation of the number of data dependencies between program operands in a particular loop, given the current architecture Its goal is basically to map each in the communication table to its corresponding datapth (and cluster) in the cluster vector Communication table cluster cluster Cluster vector Communication table cluster cluster Cluster vector Communication table cluster cluster Cluster vector Partial hardwired communication table Partial hardwired communication table Partial hardwired communication table X X X Hardwired communication table (normalized by ) Figure 6: Mapping the communication table to a hardwired bypass interconnects For example, at the center of Figure 6, has been assigned to (index ) of the cluster vector Hence, line in the communication table if mapped to column of the partial hardwired table Notice that one partial table emerges for each inner-loop super-block in the program In the case of Figure 6 three loops were considered Since the resulting architecture has to execute all of them, we need to take into account the contribution of each loop to the overall intercluster communication This is done, in a second phase, by adding up the partial communication tables into a single hardwired communication table Each entry in a given partial table is weighted by NL,whereNL is the nestinglevel of the loop corresponding to that table [] A better estimate is possible if the loop trip-count can be determined at compile time The resulting hardwired communication table is then used to determine the pairs of datapaths which will be interconnected with bypass lines To do that the entries in the hardwired table are sorted into a priority list, such that the most communicating pair of datapaths have a higher priority Bypass lines are inserted between datapath pairs, the highest priority pair first In Figure 6, for example, the highest entry in the hardwired table is,, corresponding to the communication between and These datapaths were assigned the same cluster, in order to reduce the cost of
5 inserting copy operations between them Some of the datadependencies between and will be resolved by the common register file in cluster, but many short distance dependencies can be satisfied by adding a bypass line between and 7 EXPERIMENTAL RESULTS The approach described above was implemented into the IMPACT infrastructure, and the resulting compiler was used to compile eighteen programs from the SPEC CINT95 (6), CFP95 (5) benchmarks, IMPACT (5) and Miscellaneous () as shown in Table The compiler was executed using superblock formation, maximum unrolling of and no predication In our experiments we estimate the number of copy operations and cycles produced by each program across a large number of architecture configurations Each configuration corresponds to a different combination of the following parameters: (a) Number of s (from to 6); (b) Number of register file clusters (from to the number of s); (c) Number of bypass interconnections (from to the number of s) For the sake of simplicity we adopted homogeneous clusters, ie all clusters (CLs) have the same number of s The goal of the experimental work was to determine the impact of the techniques described in Sections, 5 and 6 The experiments were divided into three parts First, we evaluated the impact of bypass insertion into the cycle count of the programs In the second part, we studied how cluster partitioning and bypassing effects the number of copy instruction between clusters In the last set of experiments we evaluate the impact of the scheduling and mapping algorithms CINT95 CFP95 IMPACT MISC 99go tomcatv fir mpegdec m88ksim swim kalman mpegenc 9compress sucor paraffins li 7mgrid dag ijpeg 5turbd eight 7vortex Table : Benchmark Programs The maximum number of bypass interconnections is given by n(n )/, where n is the number of s for that configuration Figures 7 and 8 show the impact, on programs go and sucor, of adding from to bypasses (full bypassing between all s) All architecture configurations considered in the following analysis have 6 s, and range from and 8 clusters For program go (Figure 7), we noticed that most of the speed-up was achieved for 8 bypasses (65% for one cluster and 58% for 8 clusters) Only a small difference was noticed when using 6 or more bypasses (7% for one cluster and 6% for 8 clusters, when 6 bypasses are used) For program sucor (Figure 8) we faced a more complex tradeoff In the first knee of the curve (left side of the figure), when bypasses are used, the speed-up was 6% (one cluster) and 75% (8 clusters) This value increases very slowly Notice that using bypasses improves the speedup only 6% ( cluster) and 7% (8 clusters) Thus, since the program speed-up decreases almost exponentially with the number of bypasses, we restrict the maximum number of bypasses to the number of s 69e+6 68e+6 67e+6 66e+6 65e+6 6e+6 8 clusters cluster 6e Number of bypassings Figure 7: Impact of adding bypasses for program go (6 s) 7e+7 7e+7 7e+7 7e+7 69e+7 68e+7 67e+7 8 clusters cluster 66e Number of bypassings Figure 8: Impact of adding bypasses for program sucor (6 s) In the second part of our experimental work, we studied the impact of clustering and bypassing in the number of inter-cluster copy instructions Consider, for example, the graphs in Figure 9a-b, where the number of inter-cluster copy operations for program sucor is measured using two architectures with 8 and 6 s In that graph, the vertical (horizontal) axis represents the number of copy operations (clusters) Curves in the graph have for parameter the number of bypass lines inserted between s The number of copy operations grows with the number of clusters, as expected, given the increase in the number of inter-cluster communication Nevertheless, notice that as bypass lines are added to the architecture, many copy operations are wiped-out of the program Bypasses sucor go Table : Number of copies for s In Table ( s/ CLs) a single bypass line removes more than 5% of all copy operations If the same program 5
6 Number of copies bypassing bypassing bypassing bypassing 8 bypassing Number of copies bypassing bypassing bypassing bypassing 8 bypassing 8 (a) Number of clusters 8 (a) Number of clusters Number of copies 8 6 bypassing bypassing bypassing bypassing 8 bypassing 6 bypassing Number of copies bypassing bypassing bypassing bypassing 8 bypassing 6 bypassing Percentage of copies removed wrt bypassing s s 8 s 6 s (b) Number of clusters 6 8 (c) Clusters / s Figure 9: Number of copy operations for program sucor in a: (a) 8 s architecture; (b) 6 s architecture (c) Percentage of copy operations removed from sucor through bypassing, as a variable of the number of clusters per s Percentage of copies removed wrt bypassing (b) Number of clusters bypassing bypassing 8 bypassing 6 bypassing 6 8 (c) Clusters / s Figure : Number of copy operations for program go in a: (a) 8 s architecture; (b) 6 s architecture (c) Percentage of copy operations removed from go through bypassing, as a variable of the number of clusters per s 6
7 e+8 e+8 8e+8 Cluster Bypassing Cluster Bypassing Cluster Bypassing 8 Cluster Bypassing e+7 e+7 e+7 Cluster Bypassing Cluster Bypassing Cluster Bypassing 8 Cluster Bypassing 6e+8 e+8 e+8 e+8 e+7 9e+6 8e+6 8e+7 7e+6 6e s 6e s Figure : Cycle count for program sucor in nonclustered and clustered configurations with no bypasses e+8 e+8 e+8 e+8 e+8 9e+7 8e+7 7e+7 Cluster Bypassing Cluster Bypassing Cluster Bypassing 6e s Figure : Impact on the cycle count for program sucor due to maximum bypassing Figure : Cycle count for program go in nonclustered and clustered configurations with no bypasses 8e+6 8e+6 8e+6 78e+6 76e+6 7e+6 7e+6 7e+6 68e+6 66e+6 Cluster Bypassing Cluster Bypassing Cluster Bypassing 6e s Figure : Impact on the cycle count for program go due to maximum bypassing runs on an 8 s/8 CLs configuration (Figure 9a), 8 bypasses are required to reduce by 8% the number of copy operations As more bypasses are added the law of diminishing returns settles in and the gain saturates In Figure 9b for 6 s, we removed % of the copy operations using 8 bypasses and 9% of the copy operations using 6 bypasses In general, we noticed, across all SPEC programs, that the insertion of a single bypass reduces the total number of copy operations from % to 7% Notice, for example, that program go repeats the same behavior (Figure a-b) as sucor The bypass effectiveness when all configurations run program sucor is described in Figure 9c The vertical axis in that graph shows the percentage of copy operations removed with respect to a bypassing free configuration The horizontal axis represents the ratio CLs/s Notice that it ranges to a maximum of one since the number of CLs is at most the number of s As mentioned before, increasing the number of bypasses implies in a reduction in the number of copy operations For example, % of the copy operations are removed when bypasses are inserted in a s/ CLs architecture Interesting enough, the percentage of copy operations removed by bypasses seems to saturate when the ratio CLs/s is around 5, ie when each cluster has two s The same pattern was also detected in the majority of the combinations of programs and architecture configurations (eg program go in Figure Figure c) We believe this might have to do with the way s data-dependencies are uniformly partitioned across clusters by the binary recursive algorithm described in Section 5 Further experimental work will be required to clarify this finding In the third part of our experiments, we studied the impact of the scheduling and clustering algorithms We plotted, for all programs, the cycle count as a function of the number of s and clusters In order to filter out the effect of bypassing, we considered only bypassing configurations As shown in Figure, for program sucor, the performance for clustered architectures follows very closely the performance for the non-clustered architecture ( Cluster, Bypassing) In other words, our approach is capable of canceling the negative effects of clustering, namely the cost of the inter-cluster copy instructions overhead The same can be seen from Figure, for program go The effect of the bypass lines in the cycle count for programs sucor and go are shown in Figures and The 7
8 best combination of clusters and number of bypass lines have been used for each architecture Surprisingly, the large gains achieved on reducing the number of copy operations in Figures 9 and did not translated to real performance In general, benchmark programs speed-up, due to bypassing, ranged from 6% to 5% only Although it is not clear why, it might be possible that the combination of the cluster partitioning and scheduling algorithms leaves only a small number of copy operations in the code for bypassing Yet another explanation can be drawn from this finding Not enough ILP is available in the unrolled loop bodies Inthis case, the generated instructions could have enough empty slots to hide the latency of most copy operations Further experimental work will be required to address this issue Notice that cycle count does not take into consideration the benefits achieved by clustering, ie a smaller register file and reduced latency If the register file determines the cycle time of the processor, the curves representing clustered architectures in Figures and will reveal a performance improvement proportional to the reduction in the register file latency Otherwise, by using our technique, the same performance level of a non-clustered architecture is achieved at a smaller processor cost 8 CONCLUSIONS AND TURE WORK This paper presents a scheduling and partitioning algorithm for clustered VLIWarchitectures aimed at reducing the communication cost between datapaths and clusters This is achieved by assigning higher communication datapaths to the same register file, while tailoring bypass interconnections to the application Preliminary experimental results reveal a substantial reduction on the number of inter-cluster copy operations and a potential performance improvement As the next steps in this project we are considering: (a) to use the data-dependency distance between scheduled operations to improve the communication cost estimate; (b) to insert delay registers into bypass lines to resolve long distance data-dependencies 9 ACKNOWLEDGMENTS This work was partially supported by research grants from CNPq/NSF Collaborative Research Project 6859/99-7, CNPq research grant 56/97-9, fellowship research a- wards from CAPES -P-58/ and FAPESP 97/98-, 99/96-8 We also thank the reviewers for their comments REFERENCES [] A Abnous and N Bagherzadeh Pipelining and bypassing in a VLIWprocessor IEEE Trans on Parallel and Distributed Systems, 5(6):658 66, June 99 [] A Abnous and N Bagherzadeh Architectural design and analysis of a VLIWprocessor International Journal of Computers and Electrical Engineering, ():9, 995 [] P S Ahuja, D W Clark, and A Rogers The performance impact of incomplete bypassing in processor pipelines In MICRO-8, 995 [] A Capitanio, N Dutt, and A Nicolau Design considerations for limited connectivity VLIW architectures Technical Report TR-9-59, University of California, Irvine, Irvine, CA 977, 99 [5] A Capitanio, N Dutt, and A Nicolau Partitioned register file for VLIWs: A preliminary analysis of tradeoffs In 5th International Symposium on Microarchitecture (MICRO), 99 [6] J R Ellis Bulldog: A Compiler for VLIW Architectures MIT Press, 986 [7] P Faraboshchi, G Desoli, and J A Fisher Clustered instruction-level parallel processors Technical Report Technical Report HPL-98-, HP Labs, USA, 998 [8] M M Fernandes, J Llosa, and N Topham Partitioned schedules for clustered VLIW architectures In IEEE/ACM International Parallel Processing Symposium, 998 [9] J A Fisher Trace scheduling: A technique for global microcode compaction IEEE Trans on Computers, C-(7):78 9, July 98 [] W W Hwu et al Impact advanced compiler technology [] M F Jacome, G de Veciana, and V Lapinskii Exploring performance tradeoffs for clustered VLIW asips In International Conference on Computer-Aided Design, [] C Lee, C Park, and M Kim Efficient algorithm for graph partitioning problem using a problem transformation method Computer Aided Design, ():6, December 989 [] S S Muchnick Advanced Compiler Design and Implementation Morgan Kaufmann, 997 [] E Ozer, S Banerjia, and T M Conte Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures In th International Symposium on Microarchitecture (MICRO), 998 [5] E Ozer and T M Conte Optimal cluster scheduling for a VLIWmachine Technical report, Dept of Elec and Comp Eng, North Carolina State University, 998 [6] E Ozer and T M Conte Unified cluster assignment and instruction scheduling for clustered VLIW microarchitectures Technical report, Dept of Elec and Comp Eng, North Carolina State University, 998 [7] V K R Rau and S Aditya Machine-description driven compilers for EPIC and VLIWprocessors Design Automation for Embedded Systems, (/):7 8, 999 [8] J Sanchez and A Gonzalez The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures In Intl Conference on Parallel Processing (ICPP), [9] J Sanchez and A Gonzalez Instruction scheduling for clustered VLIWarchitectures In Intl Symposium on System Synthesis (ISSS), Remember that predication is not used 8
Removing Communications in Clustered Microarchitectures Through Instruction Replication
Removing Communications in Clustered Microarchitectures Through Instruction Replication ALEX ALETÀ, JOSEP M. CODINA, and ANTONIO GONZÁLEZ UPC and DAVID KAELI Northeastern University The need to communicate
More informationInstruction Scheduling for Clustered VLIW Architectures
Instruction Scheduling for Clustered VLIW Architectures Jesús Sánchez and Antonio González Universitat Politècnica de Catalunya Dept. of Computer Architecture Barcelona - SPAIN E-mail: {fran,antonio@ac.upc.es
More informationSoftware Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors
Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,
More informationA Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors
A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal
More informationAutomatic Counterflow Pipeline Synthesis
Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The
More informationEffective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management
International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,
More informationSoftware-Only Value Speculation Scheduling
Software-Only Value Speculation Scheduling Chao-ying Fu Matthew D. Jennings Sergei Y. Larin Thomas M. Conte Abstract Department of Electrical and Computer Engineering North Carolina State University Raleigh,
More informationEvaluating Inter-cluster Communication in Clustered VLIW Architectures
Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More information2D-VLIW: An Architecture Based on the Geometry of Computation
2D-VLIW: An Architecture Based on the Geometry of Computation Ricardo Santos,2 2 Dom Bosco Catholic University Computer Engineering Department Campo Grande, MS, Brazil Rodolfo Azevedo Guido Araujo State
More informationHigh-Level Synthesis
High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction
More informationAppendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationThe Impact of Instruction Compression on I-cache Performance
Technical Report CSE-TR--97, University of Michigan The Impact of Instruction Compression on I-cache Performance I-Cheng K. Chen Peter L. Bird Trevor Mudge EECS Department University of Michigan {icheng,pbird,tnm}@eecs.umich.edu
More informationEvaluation of Static and Dynamic Scheduling for Media Processors. Overview
Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationA Level-wise Priority Based Task Scheduling for Heterogeneous Systems
International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract
More informationA Cost-Effective Clustered Architecture
A Cost-Effective Clustered Architecture Ramon Canal, Joan-Manuel Parcerisa, Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Cr. Jordi Girona, - Mòdul D6
More informationThe character of the instruction scheduling problem
The character of the instruction scheduling problem Darko Stefanović Department of Computer Science University of Massachusetts March 997 Abstract Here I present some measurements that serve to characterize
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationRegister Assignment for Software Pipelining with Partitioned Register Banks
Register Assignment for Software Pipelining with Partitioned Register Banks Jason Hiser Steve Carr Philip Sweany Department of Computer Science Michigan Technological University Houghton MI 49931-1295
More informationMeasure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture
Chapter 2 Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, Computer
More informationOrange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationAn Enhanced Perturbing Algorithm for Floorplan Design Using the O-tree Representation*
An Enhanced Perturbing Algorithm for Floorplan Design Using the O-tree Representation* Yingxin Pang Dept.ofCSE Univ. of California, San Diego La Jolla, CA 92093 ypang@cs.ucsd.edu Chung-Kuan Cheng Dept.ofCSE
More informationSystematic Register Bypass Customization for Application-Specific Processors
Systematic Register Bypass Customization for Application-Specific Processors Kevin Fan, Nathan Clark, Michael Chu, K. V. Manjunath, Rajiv Ravindran, Mikhail Smelyanskiy, and Scott Mahlke Advanced Computer
More informationSpeculative Trace Scheduling in VLIW Processors
Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in J.v.Eijndhoven and S. Balakrishnan
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationCost-Sensitive Operation Partitioning for Synthesizing Custom Multicluster Datapath Architectures
Cost-Sensitive Operation Partitioning for Synthesizing Custom Multicluster Datapath Architectures Michael L. Chu Kevin C. Fan Rajiv A. Ravindran Scott A. Mahlke Advanced Computer Architecture Laboratory
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationBehavioral Array Mapping into Multiport Memories Targeting Low Power 3
Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationPage # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?
Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the
More informationTrace-Driven Hybrid Simulation Methodology for Simulation Speedup : Rapid Evaluation of a Pipelined Processor
Trace-Driven Hybrid Simulation Methodology for Simulation Speedup : Rapid Evaluation of a Pipelined Processor Ho Young Kim and Tag Gon Kim Systems Modeling Simulation Lab., Department of Electrical Engineering
More informationExploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg
More informationA Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Department of Computer Architecture Universitat Politècnica de
More informationFILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas
FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given
More informationTopic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer
Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationChapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"
Chapter 11: Indexing and Hashing" Database System Concepts, 6 th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Chapter 11: Indexing and Hashing" Basic Concepts!
More informationMigration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM
Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationA Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures
A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures Qiong Cai Josep M. Codina José González Antonio González Intel Barcelona Research Centers, Intel-UPC {qiongx.cai, josep.m.codina,
More informationExploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg
More informationMultithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors
Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in
More informationReview: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction
Review: Evaluating Branch Alternatives Lecture 3: Introduction to Advanced Pipelining Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier Pipeline speedup
More informationMultiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.
Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline
More informationImpact of Source-Level Loop Optimization on DSP Architecture Design
Impact of Source-Level Loop Optimization on DSP Architecture Design Bogong Su Jian Wang Erh-Wen Hu Andrew Esguerra Wayne, NJ 77, USA bsuwpc@frontier.wilpaterson.edu Wireless Speech and Data Nortel Networks,
More informationThe MDES User Manual
The MDES User Manual Contents 1 Introduction 2 2 MDES Related Files in Trimaran 3 3 Structure of the Machine Description File 5 4 Modifying and Compiling HMDES Files 7 4.1 An Example. 7 5 External Interface
More informationOn Cluster Resource Allocation for Multiple Parallel Task Graphs
On Cluster Resource Allocation for Multiple Parallel Task Graphs Henri Casanova Frédéric Desprez Frédéric Suter University of Hawai i at Manoa INRIA - LIP - ENS Lyon IN2P3 Computing Center, CNRS / IN2P3
More informationSatisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits
Satisfiability Modulo Theory based Methodology for Floorplanning in VLSI Circuits Suchandra Banerjee Anand Ratna Suchismita Roy mailnmeetsuchandra@gmail.com pacific.anand17@hotmail.com suchismita27@yahoo.com
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationCE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling
CE431 Parallel Computer Architecture Spring 2018 Compile-time ILP extraction Modulo Scheduling Nikos Bellas Electrical and Computer Engineering University of Thessaly Parallel Computer Architecture 1 Readings
More informationFeasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search
Feasibility of Combined Area and Performance Optimization for Superscalar Processors Using Random Search S. van Haastregt LIACS, Leiden University svhaast@liacs.nl P.M.W. Knijnenburg Informatics Institute,
More informationComputer Performance Evaluation: Cycles Per Instruction (CPI)
Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: where: Clock rate = 1 / clock cycle A computer machine
More informationModulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures
Modulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures Javier Zalamea, Josep Llosa, Eduard Ayguadé and Mateo Valero Departament d Arquitectura de Computadors (UPC) Universitat
More informationMultiple Branch and Block Prediction
Multiple Branch and Block Prediction Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine Irvine, CA 92697 swallace@ece.uci.edu, nader@ece.uci.edu
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationSupporting Multithreading in Configurable Soft Processor Cores
Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationDesign methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts
Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,
More informationECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors
ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2018-11-28-13-01 1 Motivating VLIW Processors
More informationEvaluation of a High Performance Code Compression Method
Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The
More informationCode Compression for RISC Processors with Variable Length Instruction Encoding
Code Compression for RISC Processors with Variable Length Instruction Encoding S. S. Gupta, D. Das, S.K. Panda, R. Kumar and P. P. Chakrabarty Department of Computer Science & Engineering Indian Institute
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationEfficient Power Reduction Techniques for Time Multiplexed Address Buses
Efficient Power Reduction Techniques for Time Multiplexed Address Buses Mahesh Mamidipaka enter for Embedded omputer Systems Univ. of alifornia, Irvine, USA maheshmn@cecs.uci.edu Nikil Dutt enter for Embedded
More informationMulti-Profile Based Code Compression
15.2 Multi-Profile Based Code Compression E. Wanderley Netto CEFET/RN IC/UNICAMP R. Azevedo P. Centoducatte IC/UNICAMP IC/UNICAMP Caixa Postal 6176 13084-971 Campinas/SP Brazil +55 19 3788 5838 {braulio,
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationReducing Data Cache Energy Consumption via Cached Load/Store Queue
Reducing Data Cache Energy Consumption via Cached Load/Store Queue Dan Nicolaescu, Alex Veidenbaum, Alex Nicolau Center for Embedded Computer Systems University of Cafornia, Irvine {dann,alexv,nicolau}@cecs.uci.edu
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationUnderstanding multimedia application chacteristics for designing programmable media processors
Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable
More informationSAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh
BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 797- flur,chengkokg@ecn.purdue.edu
More informationStatic, multiple-issue (superscaler) pipelines
Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue
More informationImproving Data Cache Performance via Address Correlation: An Upper Bound Study
Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationAdministrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies
Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationAnnotated Memory References: A Mechanism for Informed Cache Management
Annotated Memory References: A Mechanism for Informed Cache Management Alvin R. Lebeck, David R. Raymond, Chia-Lin Yang Mithuna S. Thottethodi Department of Computer Science, Duke University http://www.cs.duke.edu/ari/ice
More informationMapping Vector Codes to a Stream Processor (Imagine)
Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream
More informationExploiting Idle Floating-Point Resources For Integer Execution
Exploiting Idle Floating-Point Resources For Integer Execution S.Subramanya Sastry Computer Sciences Dept. University of Wisconsin-Madison sastry@cs.wisc.edu Subbarao Palacharla Computer Sciences Dept.
More informationSoftware Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo
More information