2D-VLIW: An Architecture Based on the Geometry of Computation
|
|
- Aldous Gilmore
- 5 years ago
- Views:
Transcription
1 2D-VLIW: An Architecture Based on the Geometry of Computation Ricardo Santos,2 2 Dom Bosco Catholic University Computer Engineering Department Campo Grande, MS, Brazil Rodolfo Azevedo Guido Araujo State University of Campinas Initute of Computing Campinas, SP, Brazil {ricrs, rodolfo, guido}@icunicampbr Abract This work proposes a new architecture and execution model called 2D-VLIW This architecture adopts an execution model based on large pieces of computation running over a matrix of functional units connected by a set of local regier spread across the matrix Experiments using the Mediabench and SPECint00 programs and the Trimaran compiler show performance gains ranging from 5% to 63%, when comparing our proposal to an EPIC architecture with the same number of regiers and functional units We also show that the g72 enc program running on a 2D-VLIW 3 3 matrix had a speedup of 37 over a 2 2 matrix while the same program over the EPIC processor with 9 functional units had a speedup of 2 over an EPIC processor with 4 functional units For some internal procedures from Mediabench and SPECint programs, the average 2D-VLIW OPC (operations per cycle) was up to 0 times greater than for the equivalent EPIC processor Introduction For many years the conant increase in processor performance came from the advancement in VLSI process technology However, recent announcements on the technology limits due to the thermal barrier have motivated the research into innovative ways to suain the increase in performance One approach used by modern computer architectures to achieve performance, independent of the implementation technology, is the tion of hardware resources to exploit the parallelism available inside programs Two architectural design yles commonly used to exploit parallelism are VLIW (Very Long Inruction Word) [4] and superscalar [2] Despite all their advantages and disadvantages, the major problem with VLIW and superscalar architectures is related to its inability to capture and scale parallelism The performance udy in [7] sugges that a compiler can locate more parallelism by looking at larger pieces of the application code, though it is essential that the hardware provides a suitable execution model Moreover, previous work [5, 7, 4] as well as ate-ofthe-art architectures [2, 8, 3] have pointed out toward architectures adopting new rategies for code optimization and execution models Advanced compiler optimizations cou look at large code regions to find out greater parallelism levels The execution models cou provide new resource arrangements to execute large code regions, thus increasing performance In this paper, we propose a new architecture and execution model where the main characteriic is the mapping of large pieces of computation into a matrix of functional units This mapping brings attractive performance to applications with high ILP as embedded and some general-purpose applications Our architecture is named 2D-VLIW [9, 0] because large inructions, comprised of single operations, are fetched from the memory and executed into a (two-dimensional) matrix of functional units through a pipeline In order to evaluate the real performance gains from the 2D- VLIW architecture, we have compared it to an EPICbased processor [] using Mediabench and SPECint suites compiled on the Trimaran compiler We assume that an EPIC processor has more flexibility in terms of operations (dependent and parallel) within an inruction and hardware complexity than a pure VLIW processor The experiments revealed speedups ranging from 5%-63% over the EPIC processor and, for some programs, the average 2D-VLIW OPC (operations per cycle) was up to 0 times greater than for the EPIC processor This paper is ructured as follows Section 2 presents the 2D-VLIW architecture and execution model Section 3 shows the performance results ob- Application-specific Syems, Architectures and Processors (ASAP'06)
2 tained by comparing 2D-VLIW to an EPIC processor Section 4 discusses current architectures related to our approach The conclusions and future work are presented in Section 5 2 The 2D-VLIW Architecture and Execution Model Like traditional pipelined architectures, the 2D- VLIW datapath has ages to fetch, decode and execute inructions In 2D-VLIW, execution is performed through a matrix of functional units where each column represents one execution sub-age of the pipeline For example, a 6 6 matrix has 6 sub-ages: EX,EX 2,,EX 6 A detailed view of the 2D-VLIW datapath is presented next 2 The 2D-VLIW Architecture The 2D-VLIW architecture exploits inruction level parallelism through a pipelined datapath and a two dimensional VLIW execution model This architecture fetches large inructions from the memory as well as VLIW and EPIC architectures do The 2D-VLIW inructions are comprised of single operations executed by a set of functional units (FUs) organized as a matrix The maximum number of operations in one inruction is equivalent to the amount of functional units in the matrix Figure shows a simplified overview of a 4 4 2D-VLIW datapath 2D VLIW Word IF IF/ID Pipeline Regier Control Unit Regier File Bank ID ID/EX Pipeline Regier EX Pipeline Regier Functional Unit Matrix Figure A simplified overview of the 2D- VLIW datapath In Figure four operations are executed by 4 FUs at each execution age EX,EX 2,EX 3,EX 4 of the pipeline Notice that, by using the interconnection network, each FU in column n, n<4, can provide operands to two FUs in the following column (n + ) Figure 2 shows all logic blocks and signals inside a FU Results from a FU may be written either into the T emp Regier F ile (TRF) or into the Global EX 2 Pipeline Regier 2 3 EX 3 Pipeline Regier EX 4 Regier F ile (GRF) TRF is a small regier bank with 2 local regiers dedicated to each FU The result from an operation is always written into an internal regier called F U Regier By using TRF and FU Regier, a result from an operation in cycle i will be available for three other operations in the next cycle (i + ): two operations can use this result through the T emp Regier whereas another operation uses it from the F U Regier Operands input data come from three possible sources: the global regier file, the temporary regier file through the interconnection network, or from the own FU by the F U Regier The SelOpnd, SelOpnd2 and SelOperation input selection signals come from the pipeline regiers Operands Operands SelOpnd SelOpnd2 SelOperation Functional Unit Global Reg File Figure 2 The 2D-VLIW functional unit Temp Regier File The TRFs in each FU are a viable alternative to minimize the pressure over the global regier file This is achieved by oring temporary values (of the DAGs) into TRFs The DAG leaves read values from the global regier file, while the operations inside the DAGs read and write values from/to the temporary regier The DAG root nodes write values into the global regier file Figure 3 shows this rategy by using a DAG with 4 operations The load operation,, uses one value from the global regier file, and needs to ore its result into a regier which will be used by the two operations The results from these two operations are the inputs to add, which ores the result of the tion into the R4 global regier The DAG in 3(a) requires two read ports and two write ports to access regiers from GRF Figure 3(b) shows how this DAG can be allocated in the 2D-VLIW architecture The results from the operations and are ored into the TRFs Figure 3(c) shows the mapping of this DAG onto the matrix In general, the area upper bound complexity of a andard regier file is given by O(n 2 ) [] whereas the latency upper bound is O(log m), where n is the total number of read and write ports, and m the number of read ports Notice that the number of read ports has an impact on the area and latency of the regier Application-specific Syems, Architectures and Processors (ASAP'06)
3 R R2 add R4 R3 (a) One DAG using GRFs TRF TRF add R4 TRF (b) One DAG using TRFs add (c) The DAG (b) mapped onto the FU matrix Figure 3 Example of a DAG using TRFs (a) + + A A A2 A3 A4 B B B2 + + B3 B4 (b) file Multiple-issue architectures have the number of read ports bounded by O(k) (usually 2 k), where k is the amount of functional units in the architecture By adopting TRFs and assuming that only leaves/roots from the DAGs will need to read/write from/to the global regier file, we found out that the latency upper bound for our architecture GRF can be asymptotically reduced This new upper bound is O( k) (usually 2 k) For example, a 6-FUs EPIC architecture has 6 2 global file read ports, while the equivalent 2D- VLIW (4 4 matrix) has only 8 read ports which leads to an area 4 times smaller, and a latency reduction of 40%, when taking into account that the same number of write ports is available for both architectures 22 The 2D-VLIW Execution Model As aforementioned, program execution over the 2D- VLIW matrix is pipelined At each clock cycle, one 2D-VLIW inruction is fetched from the memory and pushed into the pipeline ages On the execution ages, the operations from this inruction are executed according to the number of FUs in each column Assuming that the architecture has 6 functional units organized as a 4 4 matrix, a 2D-VLIW inruction can also be represented as a 4 4 operation matrix comprised of 6 operations Take for example the DAGs in Figure 4 extracted from the 8mcf program This program is part of the SPECint00 benchmark and was compiled by Trimaran with the hyperblock option on The operations from these DAGs correspond to a subset of the HPL-PD [6] ISA Figure 4(a) shows several DAGs from the 8mcf program (from an unrolled inner loop) and 4(b) shows the organization of these DAGs into two 2D-VLIW inructions A and B These inructions are represented as matrices of operations where operations with Read-After-Write (RAW) dependencies are placed in two different rows while independent operations are at the same row Figure 4 DAGs from the 8mcf program and their equivalent 2D-VLIW inructions Figure 5 shows the execution of the A and B inructions from Figure 4(b) on the 2D-VLIW datapath Since the decode and inruction fetch ages work the same as in a andard processor, we art at the EX execution age After the ID/EX pipeline regier has been filled in, the execution arts over the matrix Figure 5(a) depicts the fir execution cycle on the FU matrix The fir column receives data from the ID/EX pipeline regier Four functional units from the fir column execute operations,,, from row A (inruction A) The dashed arrows indicate which FUs receive the results from these operations Obviously, the consumers FUs are limited by the interconnection network At the second execution cycle, 5(b), operations,,, from A 2 are executed on the second column At the same time, operations from B art on the fir column At the third execution age, 5(c), the EX 2 /EX 3 pipeline regier carries information to execute operations from A 3, the FUs in the second column are executing the operations,,, from B 2 and the fir row (C ), from a 2D-VLIW inruction C, arted its execution on the matrix At the fourth execution age, 5(d), the operations from A 4 are executed on the fourth column, operations from B 3 are executed on the third column, operations from C 2 are running on the second column and the inruction D arted its execution The la execution is shown in 5(e) where the four operations from B 4 are being executed and the inruction A has already been finished Following the pipeline execution, at the fourth execution cycle the matrix is quite filled with operations from four different 2D-VLIW inructions as indicated by the fir, second, third and fourth columns (highlighted differently) in 5(d) After the fourth execution cycle, one 2D-VLIW inruction is finished at every cycle Application-specific Syems, Architectures and Processors (ASAP'06)
4 ID/EX ID/EX A 2 3 (a) 2 3 C B2 A3 (c) ID/EX 2 3 ID/EX ID/EX B A 2 D C2 (b) 2 3 (d) B 3 A 4 both architectures Functional units were designed to execute all operations from the inruction set The GRF has 64 regiers, and each TRF has 2 local regiers The 2D-VLIW matrix was organized as a square matrix by using 4, 9 and 6 FUs For 30 FUs, however, it was organized as a 3 0 matrix, where the fir dimension is the number of rows and the second one the number of columns The speedup was obtained by dividing the number of clock cycles of a base machine by the clock cycles taken by the current configuration Figure 6 shows the comparison between 2D-VLIW and HPL-PD using 0 programs (4 MediaBench programs and 6 SPECint00 programs) Each bar represents the 2D-VLIW speedup over an HPL-PD machine with same number of FUs Notice that all 2D-VLIW configurations attained speedups over their equivalent HPL-PD D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 E D 2 C 3 B 4 (e) Figure 5 Execution ages on the 2D-VLIW architecture % Speedup Notice also that operations being executed in 5(a), 5(b), 5(c), 5(d), and 5(e) represent exactly all the operations from the 2D-VLIW inructions A and B in Figure 4(b) 3 Experiments and Results In this Section we present performance results through a comparison between 2D-VLIW and the HPL-PD [6] processor based on the EPIC architecture The results were obtained by simulating the program execution over the Trimaran simulation tool We simulate MediaBench and SPECint00 programs over the 2D-VLIW and HPL-PD models described by the HMDES language [3] All programs were compiled by Trimaran with hyperblock, pre-pass and po-pass scheduling options on In the following experiments, we have adopted 4, 9, 6 and 30 functional units, the same inruction set, the same operation s latency (including load and ore operations) and the same number of regiers for 0 epic g72 dec g72 enc gsm dec 75vpr 8mcf 97parser 255vortex 256bzip2 300twolf Figure 6 Performance comparison between 2D-VLIW and HPL-PD After an analysis of the generated code for both architectures, we have found out that the 2D-VLIW speedup was mainly owing to the arrangement of the architectural elements The tion of local regiers made it possible to handle Write-After-Read (WAR) dependencies through the matrix This situation is illurated in Figure 7 This code fragment follows the ELCOR-IR [3] syntax and it was picked up from the 255vortex program after the scheduling and regier allocation phase Notice that operation sw, in 7(a), is executed before the operation addw (ime(7) and ime(8), respectively) For the case of the 2D-VLIW, 7(b), these operations are executed simultaneously (ime(7)) The 2D-VLIW allocator assigns a local regier inead of a global regier to the deination operand of the addw operation When Application-specific Syems, Architectures and Processors (ASAP'06)
5 this operation is executed, it does not write to the global regier, thus being executed in parallel to sw and eliminating one extra execution cycle 2 2D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 s_w [] [<gpr > <gpr 55>] s_time(7) add_w [<gpr >] [<sp> < 2520>] s_time(8) (a) HPL-PD Scheduling s_w [] [<gpr > <gpr 55>] s_time(7) add_w [<trf >] [<sp> < 2520>] s_time(7) (b) 2D-VLIW Scheduling Speedup 5 Figure 7 Trimaran s scheduling for a code fragment based on HPL-PD and 2D-VLIW 05 The next experiment presents the scalability of 2D- VLIW and HPL-PD Figure 8(a) depicts the speedup obtained for three 2D-VLIW configurations over a basic model consiing of 4 functional units Figure 8(b) shows the same information, but taking into account the HPL-PD configurations In 8(a), although 2D-VLIW did not achieve any speedup for 75vpr, an incremental speedup (scalability) was obtained for g72-dec, g72-enc, gsmdec, 255vortex, 256bzip2 and 300twolf The results from 8(b) show that HPL-PD achieved scalability for g72-dec, g72-enc, gsm-dec and 255vortex programs But two programs, 75vpr and 300twolf, had a performance decrease Table presents another result from our experiments Each row shows one program, its respective procedure, and the number of OPCs according to each 2D-VLIW and HPL-PD configuration (4, 9, 6 and 30 FUs) These values indicate how well 2D-VLIW scales while the number of functional units is increased Numbers in parenthesis are the results obtained by HPL-PD over the same programs and procedures Looking at 8mcf program we can notice that 2D- VLIW achieved an OPC 0 times greater than HPL- PD Table 2D-VLIW and HPL-PD OPC Program 3 3 (9) 4 4 (6) 3 0 (30) 8mcf (primal artificial) 25 (3) 52 (4) 423 (4) 256bzip2 (main) 205 (73) 299 (73) 304 (63) gsm-decode (Autocorrelation) 27 (99) 222 (20) 235 (2) 4 Related Work The RAW architecture [3] is a set of interconnected tiles organized as a 4 4 matrix Each tile contains an 8-age MIPS-yle pipeline, a 4-age pipelined FPU, a 32 KB data cache, and a switch that allows for atic and dynamic inter-tile communication One Speedup epic epic g72 dec g72 dec g72 enc gsm dec 75vpr 8mcf 97parser 255vortex Media and SPECint00 Benchmarks (a) 2D-VLIW scalability g72 enc gsm dec 75vpr 8mcf 97parser 255vortex Media and SPECint00 Benchmarks (b) HPL-PD scalability 256bzip2 2D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 256bzip2 300twolf 300twolf Figure 8 Scalability to the 2D-VLIW and HPL- PD architectures remarkable difference between 2D-VLIW and RAW concerns the computation granularity RAW can map a whole kernel onto one processing tile whereas 2D- VLIW maps DAGs onto the FU matrix The TRIPS architecture [2] is composed of a 4 4 array of functional units (FUs) There are four regier file banks along the top array, and four inruction and data cache banks along the right side of the array The compiler buis 28-inruction blocks organized into groups of eight inructions per FU Experiments showed that the IPC for integer and floatpoint programs are 0 and 2, respectively Conversely to 2D-VLIW, the TRIPS execution model is based on a dataflow ordering with each operation being executed as soon as its operands are available Moreover, TRIPS uses block-atomic execution where all operations in a block are either entirely committed or rolled back Application-specific Syems, Architectures and Processors (ASAP'06)
6 ADRES [8] is a coarse-grained dynamically reconfigurable architecture comprised of a tightly coupled VLIW processor and a matrix of reconfigurable functional units These functional units have regier files to keep intermediate computation values The compiler identifies computation-intensive kernels inside the program and maps these kernels onto the reconfigurable matrix The experiments on this architecture achieved IPCs from 9 to 42 ADRES and 2D- VLIW have different execution models In ADRES, one inruction will be executed either by the VLIW processor or by the reconfigurable matrix 2D-VLIW has an unified execution where every inruction crosses all ages This unified view improves resource sharing and increases ILP 5 Conclusions and Future Work A new multiple-issue architecture called 2D-VLIW was presented in this paper This architecture aims at providing architectural resources that match to the topology of high performance applications in domains like multimedia, communication and some generalpurpose integer computing 2D-VLIW has a matrix of functional units and local regiers spread across this matrix The execution model consis in large inructions brought from the memory and executed onto the matrix The local regiers are used to keep temporary values and to reduce the pressure over the global regier file Data transportation inside the FU matrix is performed through a narrow interconnection network which provides a fa communication mechanism between functional units at consecutive ages The performance results from Section 3 show that 2D-VLIW outperformed the EPIC-based processor (HPL-PD) over all MediaBench and SPECint00 programs We define four FUs configurations (4, 9, 6 and 30) and measured the 2D-VLIW and HPL-PD performance in all of them Our experimental results show that the combination of architectural arrangement, execution model, and compiler assiance can achieve an appealing performance to embedded applications The compiler plays an important role in the 2D- VLIW architecture It is responsible for recognizing large pieces of computation, mapping their operations onto the matrix topology and using local regiers to keep temporary values Currently, we arted the implementation of two new scheduling and regier allocation algorithms for 2D-VLIW References [] A Abnous and N Bagherzadeh Architectural Design and Analysis of a VLIW Processor International Journal of Computers and Electrical Engineering, 2(2):9 42, 995 [2] D Burger, S W Keckler, K S McKinley, M Dahlin, L K John, C Lin, C R Moore, J Burril, R G Mcdona, and W Yoder Scaling to the End of Silicon with EDGE Architectures IEEE Computer, 37(7):44 55, 2004 [3] L N Chakrapani, J Gyllenhaal, W Mei, W Hwu, S A Mahlke, K V Palem, and R M Rabbah Trimaran - An Infraructure for Research in Inruction- Level Parallelism Lecture Notes in Computer Science, 3602:32 4, 2004 [4] J A Fisher Very Long Inruction Word Architectures and the ELI-52 In Proceedings of the 0 th International Symposium on Computer Architecture (ISCA), pages IEEE Computer Society, 983 [5] N P Jouppi Available Inruction-Level Parallelism for Superscalar and Superpipelined Machines In Proceedings of the 3 rd International Conference on Architectural Support for Programming Languages and Operating Syems (ASPLOS), pages ACM Press, 989 [6] V Kathail, M S Schlansker, and B R Rau HPL- PD Architecture Specification Technical Report 93-80, Hewlett Packard Laboratories Palo Alto, 2000 [7] M S Lam and R P Wilson Limits of Control Flow on Parallelism In Proceedings of the 9 th ISCA, pages ACM Press, 992 [8] B Mei, A Lambrechts, J-Y Mignolet, D Verke, and R Lauwereins Architecture Exploration for a Reconfigurable Architecture Template IEEE Design Te of Computers, 22(2):90 0, 2005 [9] R Santos, R Azevedo, and G Araujo Exploiting Dynamic Reconfiguration Techniques: The 2D-VLIW Approach In Proceedings of the 3 th IEEE International Workshop on Reconfigurable Architectures, Rhodes Island-Greece, 2006 IEEE Computer Society [0] R Santos, R Azevedo, and G Araujo The 2D-VLIW Architecture Technical Report IC , Initute of Computing, 2006 [] M S Schlansker and B R Rau EPIC: Explicitly Parallel Inruction Computing IEEE Computer, 33(2):37 45, 2000 [2] M D Smith, M Johnson, and M A Horowitz Limits on Multiple Inruction Issue In Proceedings of the 3 rd ASPLOS, pages ACM Press, 989 [3] E Waingo, M Taylor, D Srikrishna, V Sarkar, W Lee, V Lee, J Kim, M Frank, P Finch, R Barua, J Babb, S Amarasinghe, and A Argawal Baring it All to Software: RAW Machines IEEE Computer, 30(9):86 93, 997 [4] D W Wall Limits of Inruction-Level Parallelism In Proceedings of the 4 th ASPLOS, pages ACM Press, 99 Application-specific Syems, Architectures and Processors (ASAP'06)
The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic
The Vector-Thread Architecture Ronny Krashinsky, Chriopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Kre Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge,
More informationSoftware Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors
Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationEvaluating Inter-cluster Communication in Clustered VLIW Architectures
Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September
More informationPerformance Evaluation of VLIW and Superscalar Processors on DSP and Multimedia Workloads
Middle-East Journal of Scientific Research 22 (11): 1612-1617, 2014 ISSN 1990-9233 IDOSI Publications, 2014 DOI: 10.5829/idosi.mejsr.2014.22.11.21523 Performance Evaluation of VLIW and Superscalar Processors
More informationThe MDES User Manual
The MDES User Manual Contents 1 Introduction 2 2 MDES Related Files in Trimaran 3 3 Structure of the Machine Description File 5 4 Modifying and Compiling HMDES Files 7 4.1 An Example. 7 5 External Interface
More informationThe Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC
The Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC Rong Ji, Xianjun Zeng, Liang Chen, and Junfeng Zhang School of Computer Science,National University of Defense
More informationTRIPS: Extending the Range of Programmable Processors
TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationTrimaran: An Infrastructure for Research in Instruction-Level Parallelism
Trimaran: An Infrastructure for Research in Instruction-Level Parallelism Lakshmi N. Chakrapani 1, John Gyllenhaal 2, Wen-mei W. Hwu 3, Scott A. Mahlke 4, Krishna V. Palem 1, and Rodric M. Rabbah 5 1 Georgia
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationLECTURE 10. Pipelining: Advanced ILP
LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction
More informationCoarse-Grained Reconfigurable Array Architectures
Coarse-Grained Reconfigurable Array Architectures Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts Abstract Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationCustomisable EPIC Processor: Architecture and Tools
Customisable EPIC Processor: Architecture and Tools W.W.S. Chu, R.G. Dimond, S. Perrott, S.P. Seng and W. Luk Department of Computing, Imperial College London 180 Queen s Gate, London SW7 2BZ, UK Abstract
More informationTailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures
Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Marcio Buss, Rodolfo Azevedo, Paulo Centoducatte and Guido Araujo IC - UNICAMP Cx Postal 676 Campinas,
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationComputer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationEnhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support
Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers,
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationEA-Based Generation of Compiler Heuristics for Polymorphous Computing Architectures
EA-Based Generation of Compiler Heuristics for Polymorphous Computing Architectures Laurence D. Merkle, Michael C. McClurg, Matt G. Ellis, Tyler G. Hicks-Wright Department of Computer Science and Software
More informationPBIW Instruction Encoding Technique on the ρ-vex Processor: Design and First Results
PBIW Instruction Encoding Technique on the ρ-vex Processor: Design and First Results Renan Marks Felipe Araujo Ricardo Santos High Performance Computing Systems Laboratory - LSCAD School of Computing -
More informationSuperscalar Organization
Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average
More informationPerformance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path
Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical
More informationUltra Low-Cost Defect Protection for Microprocessor Pipelines
Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key
More informationINTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016
NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering
More informationComputer Systems Architecture Spring 2016
Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,
More informationSoftware-Only Value Speculation Scheduling
Software-Only Value Speculation Scheduling Chao-ying Fu Matthew D. Jennings Sergei Y. Larin Thomas M. Conte Abstract Department of Electrical and Computer Engineering North Carolina State University Raleigh,
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationCOPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design
COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationRegister-Sensitive Software Pipelining
Regier-Sensitive Software Pipelining Amod K. Dani V. Janaki Ramanan R. Govindarajan Veritas Software India Pvt. Ltd. Supercomputer Edn. & Res. Centre Supercomputer Edn. & Res. Centre 1179/3, Shivajinagar
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationBanked Multiported Register Files for High-Frequency Superscalar Microprocessors
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation
More informationPipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!
Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationBeyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL
Beyond Pipelining Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/
More informationCS 351 Final Exam Solutions
CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question
More informationA Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors
A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationCS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationCreating a Scalable Microprocessor:
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationIF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4
12 1 CMPE110 Fall 2006 A. Di Blas 110 Fall 2006 CMPE pipeline concepts Advanced ffl ILP ffl Deep pipeline ffl Static multiple issue ffl Loop unrolling ffl VLIW ffl Dynamic multiple issue Textbook Edition:
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationProcessor (II) - pipelining. Hwansoo Han
Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike
More informationMapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals
Mapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals Mladen Berekovic IMEC Kapeldreef 75 B-301 Leuven, Belgium 0032-16-28-8162 Mladen.Berekovic@imec.be
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationHistorical Perspective and Further Reading 3.10
3.10 6.13 Historical Perspective and Further Reading 3.10 This section discusses the history of the first pipelined processors, the earliest superscalars, the development of out-of-order and speculative
More informationExploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan
More information15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationArchitecture. Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R.
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore The
More informationEnergy Efficient Asymmetrically Ported Register Files
Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationECEC 355: Pipelining
ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly
More informationMemory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures
Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures Abstract: The coarse-grained reconfigurable architectures (CGRAs) are a promising class of architectures with the advantages of
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationCompiler Controlled Speculation for Power Aware ILP Extraction in Dataflow Architectures
Compiler Controlled Speculation for Power Aware ILP Extraction in Dataflow Architectures Muhammad Umar Farooq, Lizy John, and Margarida F. Jacome Department of Electrical and Computer Engineering he University
More informationThe Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture
The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count
More informationPipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...
CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationIntegrating MRPSOC with multigrain parallelism for improvement of performance
Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,
More informationECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationProcessor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)
More information15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture
More informationImprove performance by increasing instruction throughput
Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access
More informationDemand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores
Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Mary D. Brown Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin {mbrown,patt}@ece.utexas.edu
More informationPipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationVorlesung / Course IN2075: Mikroprozessoren / Microprocessors
Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors Superscalarity 8 Jan 2018 Carsten Trinitis (LRR) Superscalarity Parallel Execution & ILP at at Instruction Level Parallelism Superscalarity:
More informationExploiting Virtual Registers to Reduce Pressure on Real Registers
Exploiting Virtual Registers to Reduce Pressure on Real Registers JUN YAN and WEI ZHANG Southern Illinois University Carbondale It is well known that a large fraction of variables are short-lived. This
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More informationMulti-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview
Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More information15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011
15-740/18-740 Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011 Review of Last Lecture More ISA Tradeoffs Programmer vs. microarchitect Transactional
More information