2D-VLIW: An Architecture Based on the Geometry of Computation

Size: px
Start display at page:

Download "2D-VLIW: An Architecture Based on the Geometry of Computation"

Transcription

1 2D-VLIW: An Architecture Based on the Geometry of Computation Ricardo Santos,2 2 Dom Bosco Catholic University Computer Engineering Department Campo Grande, MS, Brazil Rodolfo Azevedo Guido Araujo State University of Campinas Initute of Computing Campinas, SP, Brazil {ricrs, rodolfo, guido}@icunicampbr Abract This work proposes a new architecture and execution model called 2D-VLIW This architecture adopts an execution model based on large pieces of computation running over a matrix of functional units connected by a set of local regier spread across the matrix Experiments using the Mediabench and SPECint00 programs and the Trimaran compiler show performance gains ranging from 5% to 63%, when comparing our proposal to an EPIC architecture with the same number of regiers and functional units We also show that the g72 enc program running on a 2D-VLIW 3 3 matrix had a speedup of 37 over a 2 2 matrix while the same program over the EPIC processor with 9 functional units had a speedup of 2 over an EPIC processor with 4 functional units For some internal procedures from Mediabench and SPECint programs, the average 2D-VLIW OPC (operations per cycle) was up to 0 times greater than for the equivalent EPIC processor Introduction For many years the conant increase in processor performance came from the advancement in VLSI process technology However, recent announcements on the technology limits due to the thermal barrier have motivated the research into innovative ways to suain the increase in performance One approach used by modern computer architectures to achieve performance, independent of the implementation technology, is the tion of hardware resources to exploit the parallelism available inside programs Two architectural design yles commonly used to exploit parallelism are VLIW (Very Long Inruction Word) [4] and superscalar [2] Despite all their advantages and disadvantages, the major problem with VLIW and superscalar architectures is related to its inability to capture and scale parallelism The performance udy in [7] sugges that a compiler can locate more parallelism by looking at larger pieces of the application code, though it is essential that the hardware provides a suitable execution model Moreover, previous work [5, 7, 4] as well as ate-ofthe-art architectures [2, 8, 3] have pointed out toward architectures adopting new rategies for code optimization and execution models Advanced compiler optimizations cou look at large code regions to find out greater parallelism levels The execution models cou provide new resource arrangements to execute large code regions, thus increasing performance In this paper, we propose a new architecture and execution model where the main characteriic is the mapping of large pieces of computation into a matrix of functional units This mapping brings attractive performance to applications with high ILP as embedded and some general-purpose applications Our architecture is named 2D-VLIW [9, 0] because large inructions, comprised of single operations, are fetched from the memory and executed into a (two-dimensional) matrix of functional units through a pipeline In order to evaluate the real performance gains from the 2D- VLIW architecture, we have compared it to an EPICbased processor [] using Mediabench and SPECint suites compiled on the Trimaran compiler We assume that an EPIC processor has more flexibility in terms of operations (dependent and parallel) within an inruction and hardware complexity than a pure VLIW processor The experiments revealed speedups ranging from 5%-63% over the EPIC processor and, for some programs, the average 2D-VLIW OPC (operations per cycle) was up to 0 times greater than for the EPIC processor This paper is ructured as follows Section 2 presents the 2D-VLIW architecture and execution model Section 3 shows the performance results ob- Application-specific Syems, Architectures and Processors (ASAP'06)

2 tained by comparing 2D-VLIW to an EPIC processor Section 4 discusses current architectures related to our approach The conclusions and future work are presented in Section 5 2 The 2D-VLIW Architecture and Execution Model Like traditional pipelined architectures, the 2D- VLIW datapath has ages to fetch, decode and execute inructions In 2D-VLIW, execution is performed through a matrix of functional units where each column represents one execution sub-age of the pipeline For example, a 6 6 matrix has 6 sub-ages: EX,EX 2,,EX 6 A detailed view of the 2D-VLIW datapath is presented next 2 The 2D-VLIW Architecture The 2D-VLIW architecture exploits inruction level parallelism through a pipelined datapath and a two dimensional VLIW execution model This architecture fetches large inructions from the memory as well as VLIW and EPIC architectures do The 2D-VLIW inructions are comprised of single operations executed by a set of functional units (FUs) organized as a matrix The maximum number of operations in one inruction is equivalent to the amount of functional units in the matrix Figure shows a simplified overview of a 4 4 2D-VLIW datapath 2D VLIW Word IF IF/ID Pipeline Regier Control Unit Regier File Bank ID ID/EX Pipeline Regier EX Pipeline Regier Functional Unit Matrix Figure A simplified overview of the 2D- VLIW datapath In Figure four operations are executed by 4 FUs at each execution age EX,EX 2,EX 3,EX 4 of the pipeline Notice that, by using the interconnection network, each FU in column n, n<4, can provide operands to two FUs in the following column (n + ) Figure 2 shows all logic blocks and signals inside a FU Results from a FU may be written either into the T emp Regier F ile (TRF) or into the Global EX 2 Pipeline Regier 2 3 EX 3 Pipeline Regier EX 4 Regier F ile (GRF) TRF is a small regier bank with 2 local regiers dedicated to each FU The result from an operation is always written into an internal regier called F U Regier By using TRF and FU Regier, a result from an operation in cycle i will be available for three other operations in the next cycle (i + ): two operations can use this result through the T emp Regier whereas another operation uses it from the F U Regier Operands input data come from three possible sources: the global regier file, the temporary regier file through the interconnection network, or from the own FU by the F U Regier The SelOpnd, SelOpnd2 and SelOperation input selection signals come from the pipeline regiers Operands Operands SelOpnd SelOpnd2 SelOperation Functional Unit Global Reg File Figure 2 The 2D-VLIW functional unit Temp Regier File The TRFs in each FU are a viable alternative to minimize the pressure over the global regier file This is achieved by oring temporary values (of the DAGs) into TRFs The DAG leaves read values from the global regier file, while the operations inside the DAGs read and write values from/to the temporary regier The DAG root nodes write values into the global regier file Figure 3 shows this rategy by using a DAG with 4 operations The load operation,, uses one value from the global regier file, and needs to ore its result into a regier which will be used by the two operations The results from these two operations are the inputs to add, which ores the result of the tion into the R4 global regier The DAG in 3(a) requires two read ports and two write ports to access regiers from GRF Figure 3(b) shows how this DAG can be allocated in the 2D-VLIW architecture The results from the operations and are ored into the TRFs Figure 3(c) shows the mapping of this DAG onto the matrix In general, the area upper bound complexity of a andard regier file is given by O(n 2 ) [] whereas the latency upper bound is O(log m), where n is the total number of read and write ports, and m the number of read ports Notice that the number of read ports has an impact on the area and latency of the regier Application-specific Syems, Architectures and Processors (ASAP'06)

3 R R2 add R4 R3 (a) One DAG using GRFs TRF TRF add R4 TRF (b) One DAG using TRFs add (c) The DAG (b) mapped onto the FU matrix Figure 3 Example of a DAG using TRFs (a) + + A A A2 A3 A4 B B B2 + + B3 B4 (b) file Multiple-issue architectures have the number of read ports bounded by O(k) (usually 2 k), where k is the amount of functional units in the architecture By adopting TRFs and assuming that only leaves/roots from the DAGs will need to read/write from/to the global regier file, we found out that the latency upper bound for our architecture GRF can be asymptotically reduced This new upper bound is O( k) (usually 2 k) For example, a 6-FUs EPIC architecture has 6 2 global file read ports, while the equivalent 2D- VLIW (4 4 matrix) has only 8 read ports which leads to an area 4 times smaller, and a latency reduction of 40%, when taking into account that the same number of write ports is available for both architectures 22 The 2D-VLIW Execution Model As aforementioned, program execution over the 2D- VLIW matrix is pipelined At each clock cycle, one 2D-VLIW inruction is fetched from the memory and pushed into the pipeline ages On the execution ages, the operations from this inruction are executed according to the number of FUs in each column Assuming that the architecture has 6 functional units organized as a 4 4 matrix, a 2D-VLIW inruction can also be represented as a 4 4 operation matrix comprised of 6 operations Take for example the DAGs in Figure 4 extracted from the 8mcf program This program is part of the SPECint00 benchmark and was compiled by Trimaran with the hyperblock option on The operations from these DAGs correspond to a subset of the HPL-PD [6] ISA Figure 4(a) shows several DAGs from the 8mcf program (from an unrolled inner loop) and 4(b) shows the organization of these DAGs into two 2D-VLIW inructions A and B These inructions are represented as matrices of operations where operations with Read-After-Write (RAW) dependencies are placed in two different rows while independent operations are at the same row Figure 4 DAGs from the 8mcf program and their equivalent 2D-VLIW inructions Figure 5 shows the execution of the A and B inructions from Figure 4(b) on the 2D-VLIW datapath Since the decode and inruction fetch ages work the same as in a andard processor, we art at the EX execution age After the ID/EX pipeline regier has been filled in, the execution arts over the matrix Figure 5(a) depicts the fir execution cycle on the FU matrix The fir column receives data from the ID/EX pipeline regier Four functional units from the fir column execute operations,,, from row A (inruction A) The dashed arrows indicate which FUs receive the results from these operations Obviously, the consumers FUs are limited by the interconnection network At the second execution cycle, 5(b), operations,,, from A 2 are executed on the second column At the same time, operations from B art on the fir column At the third execution age, 5(c), the EX 2 /EX 3 pipeline regier carries information to execute operations from A 3, the FUs in the second column are executing the operations,,, from B 2 and the fir row (C ), from a 2D-VLIW inruction C, arted its execution on the matrix At the fourth execution age, 5(d), the operations from A 4 are executed on the fourth column, operations from B 3 are executed on the third column, operations from C 2 are running on the second column and the inruction D arted its execution The la execution is shown in 5(e) where the four operations from B 4 are being executed and the inruction A has already been finished Following the pipeline execution, at the fourth execution cycle the matrix is quite filled with operations from four different 2D-VLIW inructions as indicated by the fir, second, third and fourth columns (highlighted differently) in 5(d) After the fourth execution cycle, one 2D-VLIW inruction is finished at every cycle Application-specific Syems, Architectures and Processors (ASAP'06)

4 ID/EX ID/EX A 2 3 (a) 2 3 C B2 A3 (c) ID/EX 2 3 ID/EX ID/EX B A 2 D C2 (b) 2 3 (d) B 3 A 4 both architectures Functional units were designed to execute all operations from the inruction set The GRF has 64 regiers, and each TRF has 2 local regiers The 2D-VLIW matrix was organized as a square matrix by using 4, 9 and 6 FUs For 30 FUs, however, it was organized as a 3 0 matrix, where the fir dimension is the number of rows and the second one the number of columns The speedup was obtained by dividing the number of clock cycles of a base machine by the clock cycles taken by the current configuration Figure 6 shows the comparison between 2D-VLIW and HPL-PD using 0 programs (4 MediaBench programs and 6 SPECint00 programs) Each bar represents the 2D-VLIW speedup over an HPL-PD machine with same number of FUs Notice that all 2D-VLIW configurations attained speedups over their equivalent HPL-PD D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 E D 2 C 3 B 4 (e) Figure 5 Execution ages on the 2D-VLIW architecture % Speedup Notice also that operations being executed in 5(a), 5(b), 5(c), 5(d), and 5(e) represent exactly all the operations from the 2D-VLIW inructions A and B in Figure 4(b) 3 Experiments and Results In this Section we present performance results through a comparison between 2D-VLIW and the HPL-PD [6] processor based on the EPIC architecture The results were obtained by simulating the program execution over the Trimaran simulation tool We simulate MediaBench and SPECint00 programs over the 2D-VLIW and HPL-PD models described by the HMDES language [3] All programs were compiled by Trimaran with hyperblock, pre-pass and po-pass scheduling options on In the following experiments, we have adopted 4, 9, 6 and 30 functional units, the same inruction set, the same operation s latency (including load and ore operations) and the same number of regiers for 0 epic g72 dec g72 enc gsm dec 75vpr 8mcf 97parser 255vortex 256bzip2 300twolf Figure 6 Performance comparison between 2D-VLIW and HPL-PD After an analysis of the generated code for both architectures, we have found out that the 2D-VLIW speedup was mainly owing to the arrangement of the architectural elements The tion of local regiers made it possible to handle Write-After-Read (WAR) dependencies through the matrix This situation is illurated in Figure 7 This code fragment follows the ELCOR-IR [3] syntax and it was picked up from the 255vortex program after the scheduling and regier allocation phase Notice that operation sw, in 7(a), is executed before the operation addw (ime(7) and ime(8), respectively) For the case of the 2D-VLIW, 7(b), these operations are executed simultaneously (ime(7)) The 2D-VLIW allocator assigns a local regier inead of a global regier to the deination operand of the addw operation When Application-specific Syems, Architectures and Processors (ASAP'06)

5 this operation is executed, it does not write to the global regier, thus being executed in parallel to sw and eliminating one extra execution cycle 2 2D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 s_w [] [<gpr > <gpr 55>] s_time(7) add_w [<gpr >] [<sp> < 2520>] s_time(8) (a) HPL-PD Scheduling s_w [] [<gpr > <gpr 55>] s_time(7) add_w [<trf >] [<sp> < 2520>] s_time(7) (b) 2D-VLIW Scheduling Speedup 5 Figure 7 Trimaran s scheduling for a code fragment based on HPL-PD and 2D-VLIW 05 The next experiment presents the scalability of 2D- VLIW and HPL-PD Figure 8(a) depicts the speedup obtained for three 2D-VLIW configurations over a basic model consiing of 4 functional units Figure 8(b) shows the same information, but taking into account the HPL-PD configurations In 8(a), although 2D-VLIW did not achieve any speedup for 75vpr, an incremental speedup (scalability) was obtained for g72-dec, g72-enc, gsmdec, 255vortex, 256bzip2 and 300twolf The results from 8(b) show that HPL-PD achieved scalability for g72-dec, g72-enc, gsm-dec and 255vortex programs But two programs, 75vpr and 300twolf, had a performance decrease Table presents another result from our experiments Each row shows one program, its respective procedure, and the number of OPCs according to each 2D-VLIW and HPL-PD configuration (4, 9, 6 and 30 FUs) These values indicate how well 2D-VLIW scales while the number of functional units is increased Numbers in parenthesis are the results obtained by HPL-PD over the same programs and procedures Looking at 8mcf program we can notice that 2D- VLIW achieved an OPC 0 times greater than HPL- PD Table 2D-VLIW and HPL-PD OPC Program 3 3 (9) 4 4 (6) 3 0 (30) 8mcf (primal artificial) 25 (3) 52 (4) 423 (4) 256bzip2 (main) 205 (73) 299 (73) 304 (63) gsm-decode (Autocorrelation) 27 (99) 222 (20) 235 (2) 4 Related Work The RAW architecture [3] is a set of interconnected tiles organized as a 4 4 matrix Each tile contains an 8-age MIPS-yle pipeline, a 4-age pipelined FPU, a 32 KB data cache, and a switch that allows for atic and dynamic inter-tile communication One Speedup epic epic g72 dec g72 dec g72 enc gsm dec 75vpr 8mcf 97parser 255vortex Media and SPECint00 Benchmarks (a) 2D-VLIW scalability g72 enc gsm dec 75vpr 8mcf 97parser 255vortex Media and SPECint00 Benchmarks (b) HPL-PD scalability 256bzip2 2D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 256bzip2 300twolf 300twolf Figure 8 Scalability to the 2D-VLIW and HPL- PD architectures remarkable difference between 2D-VLIW and RAW concerns the computation granularity RAW can map a whole kernel onto one processing tile whereas 2D- VLIW maps DAGs onto the FU matrix The TRIPS architecture [2] is composed of a 4 4 array of functional units (FUs) There are four regier file banks along the top array, and four inruction and data cache banks along the right side of the array The compiler buis 28-inruction blocks organized into groups of eight inructions per FU Experiments showed that the IPC for integer and floatpoint programs are 0 and 2, respectively Conversely to 2D-VLIW, the TRIPS execution model is based on a dataflow ordering with each operation being executed as soon as its operands are available Moreover, TRIPS uses block-atomic execution where all operations in a block are either entirely committed or rolled back Application-specific Syems, Architectures and Processors (ASAP'06)

6 ADRES [8] is a coarse-grained dynamically reconfigurable architecture comprised of a tightly coupled VLIW processor and a matrix of reconfigurable functional units These functional units have regier files to keep intermediate computation values The compiler identifies computation-intensive kernels inside the program and maps these kernels onto the reconfigurable matrix The experiments on this architecture achieved IPCs from 9 to 42 ADRES and 2D- VLIW have different execution models In ADRES, one inruction will be executed either by the VLIW processor or by the reconfigurable matrix 2D-VLIW has an unified execution where every inruction crosses all ages This unified view improves resource sharing and increases ILP 5 Conclusions and Future Work A new multiple-issue architecture called 2D-VLIW was presented in this paper This architecture aims at providing architectural resources that match to the topology of high performance applications in domains like multimedia, communication and some generalpurpose integer computing 2D-VLIW has a matrix of functional units and local regiers spread across this matrix The execution model consis in large inructions brought from the memory and executed onto the matrix The local regiers are used to keep temporary values and to reduce the pressure over the global regier file Data transportation inside the FU matrix is performed through a narrow interconnection network which provides a fa communication mechanism between functional units at consecutive ages The performance results from Section 3 show that 2D-VLIW outperformed the EPIC-based processor (HPL-PD) over all MediaBench and SPECint00 programs We define four FUs configurations (4, 9, 6 and 30) and measured the 2D-VLIW and HPL-PD performance in all of them Our experimental results show that the combination of architectural arrangement, execution model, and compiler assiance can achieve an appealing performance to embedded applications The compiler plays an important role in the 2D- VLIW architecture It is responsible for recognizing large pieces of computation, mapping their operations onto the matrix topology and using local regiers to keep temporary values Currently, we arted the implementation of two new scheduling and regier allocation algorithms for 2D-VLIW References [] A Abnous and N Bagherzadeh Architectural Design and Analysis of a VLIW Processor International Journal of Computers and Electrical Engineering, 2(2):9 42, 995 [2] D Burger, S W Keckler, K S McKinley, M Dahlin, L K John, C Lin, C R Moore, J Burril, R G Mcdona, and W Yoder Scaling to the End of Silicon with EDGE Architectures IEEE Computer, 37(7):44 55, 2004 [3] L N Chakrapani, J Gyllenhaal, W Mei, W Hwu, S A Mahlke, K V Palem, and R M Rabbah Trimaran - An Infraructure for Research in Inruction- Level Parallelism Lecture Notes in Computer Science, 3602:32 4, 2004 [4] J A Fisher Very Long Inruction Word Architectures and the ELI-52 In Proceedings of the 0 th International Symposium on Computer Architecture (ISCA), pages IEEE Computer Society, 983 [5] N P Jouppi Available Inruction-Level Parallelism for Superscalar and Superpipelined Machines In Proceedings of the 3 rd International Conference on Architectural Support for Programming Languages and Operating Syems (ASPLOS), pages ACM Press, 989 [6] V Kathail, M S Schlansker, and B R Rau HPL- PD Architecture Specification Technical Report 93-80, Hewlett Packard Laboratories Palo Alto, 2000 [7] M S Lam and R P Wilson Limits of Control Flow on Parallelism In Proceedings of the 9 th ISCA, pages ACM Press, 992 [8] B Mei, A Lambrechts, J-Y Mignolet, D Verke, and R Lauwereins Architecture Exploration for a Reconfigurable Architecture Template IEEE Design Te of Computers, 22(2):90 0, 2005 [9] R Santos, R Azevedo, and G Araujo Exploiting Dynamic Reconfiguration Techniques: The 2D-VLIW Approach In Proceedings of the 3 th IEEE International Workshop on Reconfigurable Architectures, Rhodes Island-Greece, 2006 IEEE Computer Society [0] R Santos, R Azevedo, and G Araujo The 2D-VLIW Architecture Technical Report IC , Initute of Computing, 2006 [] M S Schlansker and B R Rau EPIC: Explicitly Parallel Inruction Computing IEEE Computer, 33(2):37 45, 2000 [2] M D Smith, M Johnson, and M A Horowitz Limits on Multiple Inruction Issue In Proceedings of the 3 rd ASPLOS, pages ACM Press, 989 [3] E Waingo, M Taylor, D Srikrishna, V Sarkar, W Lee, V Lee, J Kim, M Frank, P Finch, R Barua, J Babb, S Amarasinghe, and A Argawal Baring it All to Software: RAW Machines IEEE Computer, 30(9):86 93, 997 [4] D W Wall Limits of Inruction-Level Parallelism In Proceedings of the 4 th ASPLOS, pages ACM Press, 99 Application-specific Syems, Architectures and Processors (ASAP'06)

The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic

The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic The Vector-Thread Architecture Ronny Krashinsky, Chriopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Kre Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge,

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

Performance Evaluation of VLIW and Superscalar Processors on DSP and Multimedia Workloads

Performance Evaluation of VLIW and Superscalar Processors on DSP and Multimedia Workloads Middle-East Journal of Scientific Research 22 (11): 1612-1617, 2014 ISSN 1990-9233 IDOSI Publications, 2014 DOI: 10.5829/idosi.mejsr.2014.22.11.21523 Performance Evaluation of VLIW and Superscalar Processors

More information

The MDES User Manual

The MDES User Manual The MDES User Manual Contents 1 Introduction 2 2 MDES Related Files in Trimaran 3 3 Structure of the Machine Description File 5 4 Modifying and Compiling HMDES Files 7 4.1 An Example. 7 5 External Interface

More information

The Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC

The Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC The Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC Rong Ji, Xianjun Zeng, Liang Chen, and Junfeng Zhang School of Computer Science,National University of Defense

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Trimaran: An Infrastructure for Research in Instruction-Level Parallelism

Trimaran: An Infrastructure for Research in Instruction-Level Parallelism Trimaran: An Infrastructure for Research in Instruction-Level Parallelism Lakshmi N. Chakrapani 1, John Gyllenhaal 2, Wen-mei W. Hwu 3, Scott A. Mahlke 4, Krishna V. Palem 1, and Rodric M. Rabbah 5 1 Georgia

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

Coarse-Grained Reconfigurable Array Architectures

Coarse-Grained Reconfigurable Array Architectures Coarse-Grained Reconfigurable Array Architectures Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts Abstract Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that

More information

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Multiple Issue (Superscalar and VLIW) Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Customisable EPIC Processor: Architecture and Tools

Customisable EPIC Processor: Architecture and Tools Customisable EPIC Processor: Architecture and Tools W.W.S. Chu, R.G. Dimond, S. Perrott, S.P. Seng and W. Luk Department of Computing, Imperial College London 180 Queen s Gate, London SW7 2BZ, UK Abstract

More information

Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures

Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Marcio Buss, Rodolfo Azevedo, Paulo Centoducatte and Guido Araujo IC - UNICAMP Cx Postal 676 Campinas,

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers,

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

EA-Based Generation of Compiler Heuristics for Polymorphous Computing Architectures

EA-Based Generation of Compiler Heuristics for Polymorphous Computing Architectures EA-Based Generation of Compiler Heuristics for Polymorphous Computing Architectures Laurence D. Merkle, Michael C. McClurg, Matt G. Ellis, Tyler G. Hicks-Wright Department of Computer Science and Software

More information

PBIW Instruction Encoding Technique on the ρ-vex Processor: Design and First Results

PBIW Instruction Encoding Technique on the ρ-vex Processor: Design and First Results PBIW Instruction Encoding Technique on the ρ-vex Processor: Design and First Results Renan Marks Felipe Araujo Ricardo Santos High Performance Computing Systems Laboratory - LSCAD School of Computing -

More information

Superscalar Organization

Superscalar Organization Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average

More information

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical

More information

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Ultra Low-Cost Defect Protection for Microprocessor Pipelines Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Computer Systems Architecture Spring 2016

Computer Systems Architecture Spring 2016 Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,

More information

Software-Only Value Speculation Scheduling

Software-Only Value Speculation Scheduling Software-Only Value Speculation Scheduling Chao-ying Fu Matthew D. Jennings Sergei Y. Larin Thomas M. Conte Abstract Department of Electrical and Computer Engineering North Carolina State University Raleigh,

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Register-Sensitive Software Pipelining

Register-Sensitive Software Pipelining Regier-Sensitive Software Pipelining Amod K. Dani V. Janaki Ramanan R. Govindarajan Veritas Software India Pvt. Ltd. Supercomputer Edn. & Res. Centre Supercomputer Edn. & Res. Centre 1179/3, Shivajinagar

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL Beyond Pipelining Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

More information

CS 351 Final Exam Solutions

CS 351 Final Exam Solutions CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Creating a Scalable Microprocessor:

Creating a Scalable Microprocessor: Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

VLIW/EPIC: Statically Scheduled ILP

VLIW/EPIC: Statically Scheduled ILP 6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4 12 1 CMPE110 Fall 2006 A. Di Blas 110 Fall 2006 CMPE pipeline concepts Advanced ffl ILP ffl Deep pipeline ffl Static multiple issue ffl Loop unrolling ffl VLIW ffl Dynamic multiple issue Textbook Edition:

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Processor (II) - pipelining. Hwansoo Han

Processor (II) - pipelining. Hwansoo Han Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike

More information

Mapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals

Mapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals Mapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals Mladen Berekovic IMEC Kapeldreef 75 B-301 Leuven, Belgium 0032-16-28-8162 Mladen.Berekovic@imec.be

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Historical Perspective and Further Reading 3.10

Historical Perspective and Further Reading 3.10 3.10 6.13 Historical Perspective and Further Reading 3.10 This section discusses the history of the first pipelined processors, the earliest superscalars, the development of out-of-order and speculative

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

Architecture. Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R.

Architecture. Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore The

More information

Energy Efficient Asymmetrically Ported Register Files

Energy Efficient Asymmetrically Ported Register Files Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

ECEC 355: Pipelining

ECEC 355: Pipelining ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

More information

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures Abstract: The coarse-grained reconfigurable architectures (CGRAs) are a promising class of architectures with the advantages of

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Compiler Controlled Speculation for Power Aware ILP Extraction in Dataflow Architectures

Compiler Controlled Speculation for Power Aware ILP Extraction in Dataflow Architectures Compiler Controlled Speculation for Power Aware ILP Extraction in Dataflow Architectures Muhammad Umar Farooq, Lizy John, and Margarida F. Jacome Department of Electrical and Computer Engineering he University

More information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

More information

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ... CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

Improve performance by increasing instruction throughput

Improve performance by increasing instruction throughput Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access

More information

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Mary D. Brown Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin {mbrown,patt}@ece.utexas.edu

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors

Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors Superscalarity 8 Jan 2018 Carsten Trinitis (LRR) Superscalarity Parallel Execution & ILP at at Instruction Level Parallelism Superscalarity:

More information

Exploiting Virtual Registers to Reduce Pressure on Real Registers

Exploiting Virtual Registers to Reduce Pressure on Real Registers Exploiting Virtual Registers to Reduce Pressure on Real Registers JUN YAN and WEI ZHANG Southern Illinois University Carbondale It is well known that a large fraction of variables are short-lived. This

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011 15-740/18-740 Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011 Review of Last Lecture More ISA Tradeoffs Programmer vs. microarchitect Transactional

More information