A Perfect Branch Prediction Technique for Conditional Loops

Size: px
Start display at page:

Download "A Perfect Branch Prediction Technique for Conditional Loops"

Transcription

1 A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Nelson L. Passos Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA Abstract One of the most stringent issues in obtaining optimal performance from today s computers is the decay in CPU performance that results from mispredicted branch instructions. As a result, a number of techniques have been proposed that attempt to forecast the decision of control instructions. These techniques however, while effective, do not achieve perfect accuracy. This paper suggests the use of new prediction instructions, which, together with a loop retiming technique, provide 100% accuracy in most instances for single control instructions found in loop structures. Keywords: Branch prediction, pipeline, conditional loop, retiming, pre-fetching. 1. Introduction Control instructions may alter the linear execution of programs. If the execution path is thus altered, modern systems that make use of pre-fetching, pipelining and instruction-level parallelism are adversely affected in that a number of instructions that are already executing have to be discarded and new instructions have to be brought to the CPU. The overall effect of this process is a deterioration in unit performance. Improving the performance of loop structures has been the focus of a large number of research projects over the last decade. Significant results have been obtained with both software and hardware advancements. An important optimization for the loop execution consists of being able to predict with some degree of certainty the flow of the program when conditional instructions are encountered. That degree of certainty, however, is always somewhat less than 100%. This paper describes a transformation technique for obtaining perfect branch prediction within loops without loop carried dependencies. There exist a number of techniques that improve the performance of loop execution by working towards achieving the largest possible degree of fine-grain parallelism among loop instructions. Noteworthy among these are software pipelining [1], index-shifting [13] and retiming [12]. However, these techniques are only effective if control instructions are not present. A different set of techniques exist that deal exclusively with control instructions. These techniques are known as branch prediction techniques. The body of work done within the scope of branch prediction is extensive. There are two main approaches to branch prediction: static and dynamic [3]. In the static approach, each branch is assigned a value (taken or not taken) at compile time, based on such criteria as test run results and branch type. The value thus assigned is then considered for pre-fetching of instructions. It is relatively easy to see that this approach will yield a large number of false decisions if the probability of the branch being taken is around 50%.

2 In dynamic branch prediction, several different implementations are possible [6,7,14,15,19,21]. The simplest implementation makes use of a counter and computes the percentage of the number of times the loop has been executed that the branch was taken. The prefetching is then done along the side of the branch that appears more likely according to the result of the above computation. In this case, a sequence of alternating decisions may yield close to 0% accuracy. a[i] = a[i-1] 7 can be represented in graph format as shown in figure 1, which is changed to a new form, shown in figure 2 after applying the technique presented in this paper. A special case, two level adaptive branch prediction [21], gives added importance to the recent history of results from the branch to be predicted. This approach tends to have better results than the static or probabilistic approaches, but there are still circumstances in which it could have very low accuracy. Yet another approach to optimizing control structures, predicated execution, involves adding a tag to instructions whose execution depends on control structures [15,19]. This tag indicates whether the instruction is valid after the controlling instruction has been executed. All instructions are then executed, but only the valid ones are taken into account. This approach, can be very effective especially when used in conjunction with ordinary branch prediction techniques. However, some penalty is still incurred, for example the use of system resources that will generate discarded results. Using more than one branch prediction technique has also proved effective at the cost of additional overhead choosing the technique to be used in particular cases [4]. Most dynamic branch prediction approaches [6,7,14] make use of a hardware buffer. According to the specific use, this buffer can be designed with different characteristics such as a branch target buffer [2], elastic history buffer [18] or branch history buffer [16]. This paper describes a process in which retiming concepts and new instruction constructs are used to obtain 100% accuracy in branch prediction occurring in loops that have no loop carried dependencies [5]. In this process, loops are modeled by direct acyclic graphs where the nodes represent instructions, the edges represent dependencies between instructions and the labels on the edges represent delays in execution. Delays are computed as the difference between the iteration where some data value is produced and the iteration when that information is utilized. For example, the code for i = 0 to 100 do if (a[i] = 0) b[i] = b[i-2] * 2 Figure 1: Graph representation of non-retimed code Figure 2: Graph representation of retimed code In figures 1 and 2, instructions B and C make use of the decision made in instruction A. However, in figure 1, that result is used in the same iteration it had been computed in, whereas in figure 2, the result of computing instruction A in the current iteration is used in the next iteration of the loop. This new view of the sequence of instructions allows the program to anticipate the branch decision. The next section presents a set of basic concepts relevant to this paper. Section 3 consists of the problem model. Section 4 presents the process by which control statements are optimized. Finally, section 5 provides an example of how the technique can be used. A summary concludes the paper. 2. Background Any program consists of two types of instructions: control and data instructions [8]. With regular constructs, all instructions are executed in sequential order and control instructions determine the

3 execution of data instructions usually found in close proximity. There are two approaches to instruction execution when control instructions are present. In the first, each instruction is executed only after the control decision that it depends on has been made. In the second, called speculative execution, an instruction is executed before that decision has been reached, often according to an algorithm that determines the most likely execution path. In considering control instruction handling, there are also two possible approaches: single and multiple control flow [10]. In single control flow, an instruction is not executed until all control instructions prior to it in the sequential execution of the program have been resolved or speculated. Multiple control flow allows execution prior to obtaining the information needed to determine if the instruction would be on the current execution path. In the ordinary pipeline model, a new iteration starts execution every n cycles, where n is called the initiation interval. The execution periods of several iterations are overlapped. All iterations of an acyclic data flow graph can be executed simultaneously after proper transformation. Therefore, where there is no resource constraint, the execution rate of an acyclic data-flow graph can be made arbitrarily good. When a data-flow graph is cyclic, the schedule cannot be overlapped arbitrarily. It is not trivial to generate a schedule for loop pipeline when the data-flow graph is cyclic even without resource constraints. To resolve the difficulty, instead of looking at the original iteration, the static schedule is constructed, which is a new loop body consisting of nodes from different original iterations. In the model presented, a loop pipeline is composed of a prologue, a repeating schedule and an epilogue. A repeating schedule, a schedule that is repeatedly executed, forms the new loop body. The length of a repeating schedule corresponds to the initiation interval in the ordinary pipeline model. The goal of the process is to construct a repeating schedule of minimal length. The transition from the initial state of the old loop body to the new loop body is done by the prologue. The prologue and epilogue can be easily obtained once a static schedule is found. Under the pipeline model used in this paper, the focus is on the construction of short repeating schedules. The retiming technique [12] is a convenient and effective tool for the optimization of synchronous systems due to its properties, algorithms and ease of manipulation. This paper will use retiming concepts to achieve perfect branch prediction and additional parallelism. Retiming is a technique initially described in [12] in which loop structures are altered in a manner that parallelizes instructions that were not written specifically for parallelism. For example, the code shown below for i = 0 to 100 a[i] = a[i-2] + 3 b[i] = a[i] * 5 can be transformed to a parallel version such as: a[0] = a[-2] + 3 for i = 0 to 99 a[i+1] = a[i-1] + 3 b[i] = a[i] * 5 b[100] = a[100] In order to apply retiming techniques, the loop body is modeled as a data flow graph. A data flow graph (DFG) G= (V,E,d) is a graph in which each node in the set of computation nodes V represents an instruction and each edge in E, the set of dependency edges, represents the relation between nodes, such that the destination node depends in on the origin node. A function d from E to Z represents the delay between two nodes. Consider again the example: for i = 0 to 100 a[i] = a[i-2] + 3 /* node A */ b[i] = a[i] * 5 /* node B */ The data flow graph for this code is shown in figure 3. From this graph it is clear that A and B must be executed sequentially. The retiming method modifies the delay between instructions in order to improve their parallel execution. Let r(n) be the retiming value for a node n in the graph and d(e) the delay for an edge e connecting two nodes u and v in the graph. Then the retiming function modifies the graph such that: d r (e) = r(u) - r(v) + d(e) where d r (e) is the delay of edge e after retiming. In this case, retiming node A by 1 and node B by 0 yields the graph in figure 4.

4 is represented by the graph in figure 5. Figure 3: Graph representation of non-retimed loop body Figure 5. Conditional data flow graph In this paper, only unit time CDFG are considered. A unit time CDFG is a DFG in which the computation times for all nodes are identical. General time CDFG s will be considered in future research. Figure 4: Graph representation of retimed loop body 3. Problem model As mentioned earlier, in order to apply retiming transformations, the loop is modeled as a data flow graph. In this paper, modified dataflow graphs called conditional data flow graphs are used in order to represent loops that include conditional instructions. DEFINITION 3.1 A conditional data flow graph (CDFG) G= (V,E,d,t) is a data flow graph in which nodes are differentiated by a function t according to their type into control nodes and regular ones. In a conditional data flow graph, dependencies are either data or control dependencies. In data dependence, the result of the destination instruction depends on the result of the origin instruction. In control dependence, the execution of the destination instruction depends on the decision result of the origin instruction. In the figures, control instructions are represented by diamond shaped symbols, while all others are represented by circles. For example, the structure in the loop for i = 0 to 9 if (x[i] = 0) a[i] = a[i-1] + 3 Instruction A a[i] = a[i-1] - 5Instruction B For the sake of convenience, u e v denotes that e is an edge from u to v. Dealing with loop structures, the following definition is needed: DEFINITION 3.2 An iteration in a CDFG G=(V,E,d,t) is the execution of all nodes v V exactly once. Finally, due to the nature of the retiming process, the schedules produced by the process described in this paper are static schedules. DEFINITION 3.3 A static schedule is a set of assignments of operations to processors in which every operation is assigned to a specific processor at a specific time unit. 4. Predicted if construct Regardless of the method used, regular branch prediction is bound to yield less than perfect results. However, it is possible to obtain perfect branch prediction using a new concept, a "predicted if" (pif). The new technique is based on the fact that within regular loops, branches usually determine the path to be taken in close proximity to the branch. If, however, the conditional instruction determines the execution of constructs not immediately following it, then those instructions could be prefetched in the correct order. This has been proposed in techniques such as delayed branch

5 [11]. In uniform loops that do not have dependencies across iterations, retiming makes it possible to place an entire loop iteration between an if statement and the instruction whose execution it determines. Thus, by the time the "next" instruction needs to be fetched - based on the outcome of the if - that result is already known. As a result, two sets of instructions need to be moved outside the loop: a prologue and an epilogue. The prologue will consist of the control and data instructions that are to be executed so that all the necessary delays are introduced into the structure of the loop. The epilogue will wrap up the execution of the loop. The breakdown of the loop structure results in the need for three new constructs. The first is a predicted if (pif) whose decision is stored a number of loop iterations, usually one, and it is used throughout the restructured loop. At every occurrence of a pif instruction, the previous decision is retrieved from a branch register and used in fetching the target instructions, while the current decision is stored back in the branch register for a future iteration. The second is the auxiliary prediction instruction (paf) whose decision is stored until the first pif is encountered and which is only used in the prologue. This instruction does not cause a branch to be executed and it is the initialization process of the branch register. The ultimate if, puf, performs no real action, but acts as a flag which marks the location where the control instruction would have been for the instructions in the epilogue. In other words, puf will not make any decisions, but will be used to retrieve the last stored decision and correctly fetch the target instructions. Direct acyclic graphs do not require data from previous loop iterations in order to be able to execute the control instruction. Therefore, since no cycles are present in the graph, the root node can be viewed as having an incoming edge with an infinite delay. In the examples shown, the root of the graph is the control node. Retiming that control node by 1, the first control statement is executed prior to the start of the loop (paf) and each subsequent iteration has the control statement for the next iteration (pif) execute in parallel with the branch sequence from the current iteration. Thus, the result of the control statement will always be known at some point in the iteration previous to that. In the following section, the predicted decision will be used for 99 of the 100 iterations of the loop. The process of determining the retiming values in order to optimize the branch prediction can be summarized in the following algorithm. Algorithm Predict Input: DAG G=(V,E,d,t) Output: Retimed G Begin Q for all v V level(v) 0 while V for all v V if indegree(v) = 0 then Q Q v V V-v endif while Q get(q,u) if t(u) = condition then increment 1 increment 0 endif for all v V, such that u e v level(v) level(v) + increment indegree(v) indegree(v)-1 endwhile endwhile for all v V retiming(v) max i (level(i) level(v)) End. At the termination of this algorithm the nodes whose level is greater than 0 are the nodes which constitute the prologue. The level of each node indicates the number of instances of that node that will be present in the prologue. As indicated above, all the nodes of the prologue are pifs. The number of puf statements at the end of the loop is given for each conditional statement by the maximum level minus its own level. 5. Example In this section, a simple example is discussed in order to show the final form of the code. The example consists of an initial loop, which assigns alternate values to an array that is later tested in a second loop. This alternation is one of the worst situations faced by branch prediction schemes since the history of the conditions is not useful in the prediction process. The following code segment represents such an example. /* initialize array */ x[0] = 1 for i = 1 to 10 do x[i] = x[i-1] * (-1) /* test the array values */

6 for i = 1 to 10 do if (x[i] < 0) y[i] = y[i-1] + 7 y[i] = y[i-1] - 5 Using the retiming method proposed in this paper and retiming the conditional loop, the following code is obtained: /* initialize array */ x[0] =1 for i = 1 to 10 do x[i] = x[i-1] * (-1) /* test the array values */ /* prologue */ paf (x[1] < 0) /* modified loop */ for i = 1 to 9 do pif (x[i+1] < 10) y[i] = y[i-1] + 7 y[i] = y[i-1] + 5 /* epilogue */ puf (x[10] < 10) y[10] = y[9] + 7 y[10] = y[9] + 5 The application of this transformation to the given example will result in 100% accuracy for the execution of the if statements. The use of a static branch predictor would result in an accuracy of 50%, while a dynamic branch prediction scheme using a 1-bit history would result in 0% correctness. Predication is not considered as a viable scheme because it would require extra resources to be competitive in performance. 6. Summary One of the major remaining problems in obtaining optimal performance from a processor is the inability to predict with certainty the results of control instructions. In order to provide a possible solution to this problem, this paper has presented three new instructions: the auxiliary predicted if (paf), predicted if (pif) and ultimate predicted if (puf). These instructions, used in conjunction with the technique of retiming can obtain perfect branch prediction in loops with no dependencies across iterations. These techniques are also viable for use in nested one-dimensional loop structures and multi-dimensional cases. 7. Acknowledgements This work was supported by the National Science Foundation under Grant No. MIP References [1] A. Aiken and A. Nicolau, "Resource- Constrained Software Pipelining," IEEE Transactions of Parallel and Distributed Systems, Vol. 6, No. 12, December 1995, pp [2] B. K. Bray and M.J. Flynn, "Strategies for Branch Target Buffers," Proceedings of the 24 th Microarchitecture, November 1991, pp

7 [3] C. Burch, "PA-8000: A Case Study of Static and Dynamic Branch Prediction," Proceedings of the International Conference on Computer Design, 1997, pp [4] P. Chang and U. Banerjee, "Profile-Guided Multi-heuristic Branch Prediction," Proceedings of the 1995 International Conference on Parallel Processing Vol. 1, 1995, pp. I I [5] L.-F. Chao and E. H.-M. Sha, "Static Scheduling of Uniform Nested Loops," Proceedings of the 7 th International Parallel Processing Symposium, Newport Beach, CA, April 1993, pp [6] I.-C. K. Chen et al., "Design Optimization for High-speed Per-address Two-level Branch Predictors," Proceedings of the International Conference on Computer Design, 1997, pp [7] T. M. Conte, B. A. Patel, and J. S. Cox, "Using Branch Handling Hardware to Support Profile- Driven Optimization," Proceedings of the 27 th Microarchitecture, November 1994, pp [8] H. L. Dershem and M. J. Jipping, Programming Languages: Structures and Models, PWS Publishing Company, [9] P. K. Dubey and R. Nair, "Profile-Driven Generation of Trace Samples," Proceedings of the International Conference on Computer Design, 1996, pp [10] P. K. Dubey and T. J. Watson, "Single-Thread ILP Limits and Compile-time Multithread Speculation," 1995 International Conference on Parallel Processing, Workshop on Challenges for Parallel Processors, 1995, pp [11] J.L. Hennessy and D. A. Patterson, Computer Organization & Design The Hardware / Software Interface, Morgan Kaufmann Publishers Inc [12] E. L. Leiserson and J. B. Saxe, "Optimizing Synchronous Systems," Journal of VLSI and Computer Systems, Volume 1, No 1, 1983, pp [13] L.-S. Liu, C.-W. Ho and J.-P. Sheu, "On the Parallelism of Nested For-Loops Using Index Shift Method," International Conference on Parallel Processing, 1990, pp [14] Y. Liu and D. R. Kaeli, "Branch-Directed and Stride-Based Data Cache Prefetching," Proceedings of the International Conference on Computer Design, 1996, pp [15] S. A. Mahlke et al., "Characterizing the Impact of Predicated Execution on Branch Prediction," Proceedings of the 27 th ACM/IEEE International Symposium on Microarchitecture. November 1994, pp [16] M. Sakamoto et al., "Microarchitecture Support for Reducing Branch Penalty in a Superscalar Processor," Proceedings of the International Conference on Computer Design. 1996, pp [17] Z. Tang et al., "GPMB---Software Pipelining Branch-Intensive Loops," Proceedings of the 26 th Microarchitecture, November 1994, pp [18] M. D. Tarlescu, K. B. Theobald and G. R. Gao, "Elastic History Buffer: A Low-Cost Method to Improve Branch Prediction Accuracy," Proceedings of the International Conference on Computer Design, 1997, pp [19] G. S. Tyson, "The Effects of Predicated Execution on Branch Prediction," Proceedings of the 27 th ACM/IEEE International Symposium on Microarchitecture, November 1994, pp [20] J. P. Vogel and B. K. Holmer, "Analysis of the Conditional Skip Instructions of the HP Precision Architecture," Proceedings of the 27 th Microarchitecture, November 1994, pp [21] T.-Y. Yeh and Y. N. Patt, "Two Level Adaptive Branch Prediction," Proceedings of the 24 th Microarchitecture, 1991, pp

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha DESIGN OF -D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University University of Notre Dame Wichita Falls, TX 76308

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Theoretical Constraints on Multi-Dimensional Retiming Design Techniques

Theoretical Constraints on Multi-Dimensional Retiming Design Techniques header for SPIE use Theoretical onstraints on Multi-imensional Retiming esign Techniques N. L. Passos,.. efoe, R. J. Bailey, R. Halverson, R. Simpson epartment of omputer Science Midwestern State University

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Lecture Topics ECE 341. Lecture # 14. Unconditional Branches. Instruction Hazards. Pipelining

Lecture Topics ECE 341. Lecture # 14. Unconditional Branches. Instruction Hazards. Pipelining ECE 341 Lecture # 14 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 17, 2014 Portland State University Lecture Topics Pipelining Instruction Hazards Branch penalty Branch delay slot optimization

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Design and Analysis of Efficient Application-Specific On-line Page Replacement Techniques

Design and Analysis of Efficient Application-Specific On-line Page Replacement Techniques Design and Analysis of Efficient Application-Specific On-line Page Replacement Techniques Virgil Andronache Edwin H.-M. Sha Nelson L. Passos Dept of Computer Science and Engineering Department of Computer

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

ronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

The Processor: Improving the performance - Control Hazards

The Processor: Improving the performance - Control Hazards The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Lecture Topics. Why Predict Branches? ECE 486/586. Computer Architecture. Lecture # 17. Basic Branch Prediction. Branch Prediction

Lecture Topics. Why Predict Branches? ECE 486/586. Computer Architecture. Lecture # 17. Basic Branch Prediction. Branch Prediction Lecture Topics ECE 486/586 Computer Architecture Branch Prediction Reference: Chapter : Section. Lecture # 17 Spring 2015 Portland State University Why Predict Branches? The decision about control flow

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

ECE 341 Final Exam Solution

ECE 341 Final Exam Solution ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Predicated Software Pipelining Technique for Loops with Conditions

Predicated Software Pipelining Technique for Loops with Conditions Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process

More information

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency. Lesson 1 Course Notes Review of Computer Architecture Embedded Systems ideal: low power, low cost, high performance Overview of VLIW and ILP What is ILP? It can be seen in: Superscalar In Order Processors

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 5-20% of instructions

More information

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx Microprogram Control Practice Problems (Con t) The following microinstructions are supported by each CW in the CS: RR ALU opx RA Rx RB Rx RB IR(adr) Rx RR Rx MDR MDR RR MDR Rx MAR IR(adr) MAR Rx PC IR(adr)

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

(Refer Slide Time: 00:02:04)

(Refer Slide Time: 00:02:04) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 27 Pipelined Processor Design: Handling Control Hazards We have been

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism Edwin H.-M. Sha Timothy W. O Neil Nelson L. Passos Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Erik

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 26 Cache Optimization Techniques (Contd.) (Refer

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Out-of-Order Execution II Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15 Video

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L10: Branch Prediction Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab2 and prelim grades Back to the regular office hours 2 1 Overview

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Hardware Loop Buffering

Hardware Loop Buffering Hardware Loop Buffering Scott DiPasquale, Khaled Elmeleegy, C.J. Ganier, Erik Swanson Abstract Several classes of applications can be characterized by repetition of certain behaviors or the regular distribution

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Lecture-18 (Cache Optimizations) CS422-Spring

Lecture-18 (Cache Optimizations) CS422-Spring Lecture-18 (Cache Optimizations) CS422-Spring 2018 Biswa@CSE-IITK Compiler Optimizations Loop interchange Merging Loop fusion Blocking Refer H&P: You need it for PA3 and PA4 too. CS422: Spring 2018 Biswabandan

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Using Retiming to Minimize Inter-Iteration Dependencies

Using Retiming to Minimize Inter-Iteration Dependencies Using Retiming to Minimize Inter-Iteration Dependencies Timothy W. O Neil Edwin H.-M. Sha Computer Science Dept. Computer Science Dept. University of Akron Univ. of Texas at Dallas Akron, OH 44325-4002

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Instruction Scheduling. Software Pipelining - 3

Instruction Scheduling. Software Pipelining - 3 Instruction Scheduling and Software Pipelining - 3 Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Instruction

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John Computer Performance Evaluation and Benchmarking EE 382M Dr. Lizy Kurian John Desirable features for modeling/evaluation techniques Accurate Not expensive Non-invasive User-friendly Fast Easy to change

More information

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP) Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

A VHDL Design Optimization for Two-Dimensional Filters

A VHDL Design Optimization for Two-Dimensional Filters A VHDL Design Optimization for Two-Dimensional Filters Nelson Luiz Passos Jian Song Robert Light Ranette Halverson Richard Simpson Department of Computer Science Midwestern State University Wichita Falls,

More information

Performance Measures of Superscalar Processor

Performance Measures of Superscalar Processor Performance Measures of Superscalar Processor K.A.Parthasarathy Global Institute of Engineering and Technology Vellore, Tamil Nadu, India ABSTRACT In this paper the author describes about superscalar processor

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

Old formulation of branch paths w/o prediction. bne $2,$3,foo subu $3,.. ld $2,..

Old formulation of branch paths w/o prediction. bne $2,$3,foo subu $3,.. ld $2,.. Old formulation of branch paths w/o prediction bne $2,$3,foo subu $3,.. ld $2,.. Cleaner formulation of Branching Front End Back End Instr Cache Guess PC Instr PC dequeue!= dcache arch. PC branch resolution

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking Bekim Cilku, Daniel Prokesch, Peter Puschner Institute of Computer Engineering Vienna University of Technology

More information

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS Christian Ferdinand and Reinhold Heckmann AbsInt Angewandte Informatik GmbH, Stuhlsatzenhausweg 69, D-66123 Saarbrucken, Germany info@absint.com

More information