UNIT I FUNDAMENTALS OF COMPUTER DESIGN

Size: px

Start display at page:

Download "UNIT I FUNDAMENTALS OF COMPUTER DESIGN"

Marilynn McCormick
5 years ago
Views:

1 UNIT I FUNDAMENTALS OF COMPUTER DESIGN 1.1 INTRODUCTION Computer technology has made incredible progress in the roughly from last 55 years. This rapid rate of improvement has come both from advances in the technology used to build computers and from innovation in computer design. During the first 25 years of electronic computers, both forces made a major contribution; but beginning in about 1970, computer designers became largely dependent upon integrated circuit technology. During the 1970s, performance continued to improve at about 25% to 30% per year for the mainframes and minicomputers that dominated the industry. The late 1970s after invention of microprocessor the growth roughly increased 35% per year in performance. This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraction of the computer business. In addition, two significant changes are observed in computer industry. First, the virtual elimination of assembly language programming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its clone, Linux, lowered the cost and risk of bringing out a new architecture. These changes made it possible to successfully develop a new set of architectures, called RISC (Reduced Instruction Set Computer) architectures. In the early 1980s, The RISC-based machines focused the attention of designers on two critical performance techniques, the exploitation of instruction-level parallelism and the use of caches. The combination of architectural and organizational enhancements has led to 20 years of sustained growth in performance at an annual rate of over 50%. 1

2 The effect of this dramatic growth rate has been twofold. First, it has significantly enhanced the capability available to computer users. For many applications, the highest performance microprocessors of today outperform the supercomputer of less than 10 years ago. 1.3 Second, this dramatic rate of improvement has led to the dominance of microprocessor-based computers across the entire range of the computer design. THE CHANGING FACE OF COMPUTING AND THE TASK OF THE COMPUTER DESIGNER 1960->the dominant form of computing was on large mainframes, machines costing millions of dollars and stored in computer rooms with multiple operators overseeing their support. Typical applications included business data processing and large-scale scientific computing >the minicomputer, a smaller sized machine initially focused on applications in scientific laboratories, but rapidly branching out as the technology of timesharing, multiple users sharing a computer interactively through independent terminals, became widespread >the rise of the desktop computer based on microprocessors, in the form of both personal computers and workstations. The individually owned desktop computer replaced timesharing and led to the rise of servers, computers that provided largerscale services such as: reliable, long-term file storage and access, larger memory, and more computing power >the emergence of the Internet and the world-wide web, the first successful handheld computing devices (personal digital assistants or PDAs), and the emergence of high-performance digital consumer electronics, varying from video games to set-top boxes. These changes in computer use have led to three different computing markets each characterized by different applications, requirements, and computing technologies. 2

3 Desktop Computing Desktop systems often are where the newest, highest performance microprocessors appear, as well as where recently cost-reduced microprocessors and systems appear first. Desktop computing also tends to be reasonably well characterized in terms of applications and benchmarking, though the increasing use of web-centric, interactive applications poses new challenges in performance evaluation. Servers The role of servers to provide larger scale and more reliable file and computing services grew. The emergence of the world-wide web accelerated this trend due to the tremendous growth in demand for web servers and the growth in sophistication of web-based services. Such servers have become the backbone of large-scale enterprise computing replacing the traditional mainframe. Characteristics of servers: 1) Availability is critical. We use the term availability, which means that the system can reliably and effectively provide a service. This term is to be distinguished from reliability, which says that the system never fails. Parts of large-scale systems unavoidably fail; the challenge in a server is to maintain system availability in the face of component failures, usually through the use of redundancy. 2) Emphasis on scalability. Server systems often grow over their lifetime in response to a growing demand for the services they support or an increase in functional requirements. Thus, the ability to scale up the computing capacity, the memory, the storage, and the I/O bandwidth of a server are crucial. 3

4 3) Efficient throughput. That is, the overall performance of the server in terms of transactions per minute or web pages served per second is what is crucial. Responsiveness to an individual request remains important, but overall efficiency and cost-effectiveness, as determined by how many requests can be handled in a unit time, are the key metrics for most servers. Embedded Computers Embedded computers, the name given to computers lodged in other devices where the presence of the computer is not immediately obvious.the range of application of these devices goes from simple embedded microprocessors that might appear in a everyday machines to handheld digital devices (such as palmtops, cell phones, and smart cards) to video games and digital set-top boxes. The application can usually be carefully tuned for the processor and system; this process sometimes includes limited use of assembly language in key loops, although time-to-market pressures and good software engineering practice usually restrict such assembly language coding to a small fraction of the application. This use of assembly language, together with the presence of standardized operating systems, and a large code base has meant that instruction set compatibility has become an important concern in the embedded market. Simply put, like other computing applications, software costs are often a large factor in total cost of an embedded system. Embedded computers have the widest range of processing power and cost.from low-end 8-bit and 16-bit processors that may cost less than a dollar, to full 32-bit microprocessors capable of executing 50 million instructions per second that cost under $10, to high-end embedded processors (that can execute a billion instructions per second and cost hundreds of dollars) for the newest video game or for a high-end network switch. Although the range of computing power in the embedded computing market is very large, price is a key factor in the design of computers for this space. 4

5 Performance requirements do exist, of course, but the primary goal is often meeting the performance need at a minimum price, rather than achieving higher performance at a higher price. The performance requirement in an embedded application is a real-time requirement. A real-time performance requirement is one where a segment of the application has an absolute maximum execution time that is allowed. Real-time performance tend to be highly application dependent. It is usually measured using kernels either from the application or from a standardized benchmark. Two other key characteristics exist in many embedded applications: the need to minimize memory and the need to minimize power. Embedded problems are usually solved by one of three approaches: Using a combined hardware/software solution that includes some custom hardware and typically a standard embedded processor. Using custom software running on an off-the-shelf embedded processor. Using a digital signal processor and custom software THE TASK OF THE COMPUTER DESIGNER The task the computer designer faces is a complex one: Determine what attributes are important for a new machine, and then design a machine to maximize performance while staying within cost and power constraints. This task has many aspects, including instruction set design, functional organization, logic design, and implementation. The implementation may encompass integrated circuit design, packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies, from compilers and operating systems to logic design and packaging. The term computer architecture often referred only to instruction set design.the architect s or designer s job is much more than instruction set design, and the technical hurdles in the other aspects of the project are certainly as challenging as those encountered in doing instruction set design. This challenge is particularly acute at the present when the differences among instruction sets are small and at a time when there are three rather distinct applications areas. 5

6 Other aspects of computer design were called implementation. The implementation of a machine has two components: organization and hardware. The term organization includes the high-level aspects of a computer s design, such as the memory system, the bus structure, and the design of the internal CPU. Hardware is used to refer to the specifics of a machine, including the detailed logic design and the packaging technology of the machine. The word architecture is intended to cover all three aspects of computer design instruction set architecture, organization, and hardware. Computer architects must design a computer to meet functional requirements as well as price, power, and performance goals. The requirements may be specific features inspired by the market. Summary of some of the most important functional requirements an architect faces. 6

7 1.3 TECHNOLOGY TRENDS The changes in the computer applications space over the last decade have dramatically changed the metrics. Desktop computers remain focused on optimizing cost-performance as measured by a single user, servers focus on availability, scalability, and throughput cost-performance, and embedded computers are driven by price and often power issues. If an instruction set architecture is to be successful, it must be designed to survive rapid changes in computer technology. An architect must plan for technology changes that can increase the lifetime of a computer. The following Four implementation technologies changed the computer industry: Integrated circuit logic technology Transistor density increases by about 35% per year, and die size increases 10% to 20% per year. The combined effect is a growth rate in transistor count on a chip of about 55% per year. Semiconductor DRAM : Density increases by between 40% and 60% per year and Cycle time has improved very slowly, decreasing by about one-third in 10 years. Bandwidth per chip increases about twice as fast as latency decreases. In addition, changes to the DRAM interface have also improved the bandwidth. Magnetic disk technology: it is improving more than 100% per year. Prior to 1990, density increased by about 30% per year, doubling in three years. It appears that disk technology will continue the faster density growth rate for some time to come. Access time has improved by one-third in 10 years. Network technology Network performance depends both on the performance of switches and on the performance of the transmission system, both latency and bandwidth can be improved, though recently bandwidth has been the primary focus. For many years, networking technology appeared to improve slowly: for example, it took about 10 years for Ethernet technology to move from 10 Mb to 100 Mb. The increased importance of networking has led to a faster rate of progress with 1 Gb Ethernet becoming available about five years after 100 Mb. These rapidly changing technologies impact the design of a microprocessor that may, with speed and technology enhancements, have a lifetime of five or more years. 7

8 Scaling of Transistor Performance, Wires, and Power in Integrated Circuits Integrated circuit processes are characterized by the feature size, which is decreased from 10 microns in 1971 to 0.18 microns in Since a transistor is a 2dimensional object, the density of transistors increases quadratic ally with a linear decrease in feature size. The increase in transistor performance, this combination of scaling factors leads to a complex interrelationship between transistor performance and process feature size. First approximation, transistor performance improves linearly with decreasing feature size. In the early days of microprocessors, the higher rate of improvement in density was used to quickly move from 4-bit, to 8bit, to 16-bit, to 32-bit microprocessors. More recently, density improvements have supported the introduction of 64-bit microprocessors as well as many of the innovations in pipelining and caches. The signal delay for a wire increases in proportion to the product of its resistance and capacitance. As feature size shrinks wires get shorter, but the resistance and capacitance per unit length gets worse. Since both resistance and capacitance depend on detailed aspects of the process, the geometry of a wire, the loading on a wire, and even the adjacency to other structures. In the past few years, wire delay has become a major design limitation for large integrated circuits and is often more critical than transistor switching delay. Larger and larger fractions of the clock cycle have been consumed by the propagation delay of signals on wires. In 2001, the Pentium 4 broke new ground by allocating two stages of its 20+ stage pipeline just for propagating signals across the chip. Power also provides challenges as devices are scaled. For modern CMOS microprocessors, the dominant energy consumption is in switching transistors. The energy required per transistor is proportional to the product of the load capacitance of the transistor, the frequency of switching, and the square of the voltage. As we move from one process to the next, the increase in the number of transistors switching and the frequency with which they switch, dominates the decrease in load capacitance and voltage, leading to an overall growth in power consumption. 8

9 1.4 COST, PRICE AND THEIR TRENDS In the past 15 years, the use of technology improvements to achieve lower cost, as well as increased performance, has been a major theme in the computer industry. Price is what you sell a finished good for, Cost is the amount spent to produce it, including overhead. The Impact of Time, Volume, Commodification: The cost of a manufactured computer component decreases over time even without major improvements in the basic implementation technology. The underlying principle that drives costs down is the learning curve manufacturing costs decrease over time. As an example of the learning curve in action, the price per megabyte of DRAM drops over the long term by 40% per year. The Microprocessor prices also drop over time, but because they are less standardized than DRAMs, the relationship between price and cost is more complex. In a period of significant competition, price tends to track cost closely The Volume is a second key factor in determining cost. Increasing volumes affect cost in several ways. First, they decrease the time needed to get down the learning curve, which is partly proportional to the number of systems (or chips) manufactured. Second, volume decreases cost, since it increases purchasing and manufacturing efficiency. As a rule of thumb, some designers have estimated that cost decreases about 10% for each doubling of volume. The Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. Virtually all the products sold on the shelves of grocery stores are commodities, as are standard DRAMs, disks, monitors, and keyboards. In the past 10 years, much of the low end of the computer business has become a commodity business focused on building IBM-compatible PCs. 9

10 There are a variety of vendors that ship virtually identical products and are highly competitive. Of course, this competition decreases the gap between cost and selling price, but it also decreases cost. Cost of an Integrated Circuit The cost of packaged integrated circuit is The number of good chips per wafer requires first learning how many dies fit on a wafer and then learning how to predict the percentage of those that will work. From there it is simple to predict cost: The number of dies per wafer is basically the area of the wafer divided by the area of the die. It can be more accurately estimated by 2 The first term is the ratio of wafer area ( r ) to die area. The second compensates for the square peg in a round hole problem rectangular dies near the periphery of round wafers. Dividing the circumference ( d) by the diagonal of a square die is approximately the number of dies along the edge. For example, a wafer 30 cm ( 12 inch) in diameter produces 225 ( ) = cm dies. Die costs are proportional to the fifth (or higher) power of the die area: 10

11 Distribution of Cost in a System: An Example Figure shows the approximate cost breakdown for a $1,000 PC in Although the costs of some parts of this machine can be expected to drop over time, other components, such as the packaging and power supply, have little room for improvement. Cost Versus Price Why They Differ and By How Much Cost goes through a number of changes before it becomes price, and the computer designer should understand how a design decision will affect the potential selling price. For example, changing cost by $1000 may change price by $3000 to $4000. The relationship between price and volume can increase the impact of changes in cost, especially at the low end of the market. Typically, fewer computers are sold as the price increases. Furthermore, as volume decreases, costs rise, leading to further increases in price. 11

Direct costs refer to the costs directly related to making a product. These include labor costs, purchasing components, scrap (the leftover from yield), and warranty.

12 Direct costs refer to the costs directly related to making a product. These include labor costs, purchasing components, scrap (the leftover from yield), and warranty. Direct cost typically adds 10% to 30% to component cost. The next addition is called the gross margin, the company s overhead that cannot be billed directly to one product. This can be thought of as indirect cost. It includes the company s research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, pretax profits, and taxes. When the component costs are added to the direct cost and gross margin, Average selling price is the money that comes directly to the company for each product sold. The gross margin is typically 10% to 45% of the average selling price, depending on the uniqueness of the product. Manufacturers of low-end PCs have lower gross margins for several reasons. First, their R&D expenses are lower. Second, their cost of sales is lower, since they use indirect distribution by mail, the Internet, phone order, or retail store) rather than salespeople. Third, because their products are less unique, competition is more intense, thus forcing lower prices and often lower profits, which in turn lead to a lower gross margin. List price and average selling price are not the same. One reason for this is that companies offer volume discounts, lowering the average selling price. As personal computers became commodity products, the retail mark-ups have dropped significantly, so list price and average selling price have closed. The components of price for a $1,000 PC. 12

Each increase is shown along the bottom as a tax on the prior price. The percentages of the new price for all elements are shown on the left of each column. 1.

13 Each increase is shown along the bottom as a tax on the prior price. The percentages of the new price for all elements are shown on the left of each column. 1.5 MEASURING AND REPORTING PERFORMANCE The computer user is interested in reducing response time ( the time between the start and the completion of an event) also referred to as execution time. The manager of a large data processing center may be interested in increasing throughput( the total amount of work done in a given time). In comparing design alternatives, we often want to relate the performance of two different machines, say X and Y. The phrase X is faster than Y is used here to mean that the response time or execution time is lower on X than on Y for the given task. In particular, X is n times faster than Y will mean Since execution time is the reciprocal of performance, the following relationship holds: The phrase the throughput of X is 1.3 times higher than Y signifies here that the number of tasks completed per unit time on machine X is 1.3 times the number completed on Y. Even execution time can be defined in different ways depending on what we count. The most straightforward definition of time is called wall-clock time, response time, or elapsed time, which is the latency to complete a task, including disk accesses, memory accesses, input/output activities, operating system overhead. 13

14 Choosing Programs to Evaluate Performance A computer user who runs the same programs day in and day out would be the perfect candidate to evaluate a new computer. To evaluate a new system the user would simply compare the execution time of her workload the mixture of programs and operating system commands that users run on a machine. There are five levels of programs used in such circumstances, listed below in decreasing order of accuracy of prediction. 1. Real applications Although the buyer may not know what fraction of time is spent on these programs, she knows that some users will run them to solve real problems. Examples are compilers for C, text-processing software like Word, and other applications like Photoshop. Real applications have input, output, and options that a user can select when running the program. There is one major downside to using real applications as benchmarks: Real applications often encounter portability problems arising from dependences on the operating system or compiler. Enhancing portability often means modifying the source and sometimes eliminating some important activity, such as interactive graphics, which tends to be more system-dependent. 2. Modified (or scripted) applications In many cases, real applications are used as the building block for a benchmark either with modifications to the application or with a script that acts as stimulus to the application. Applications are modified for two primary reasons: to enhance portability or to focus on one particular aspect of system performance. For example, to create a CPU-oriented benchmark, I/O may be removed or restructured to minimize its impact on execution time. Scripts are used to reproduce interactive behavior, which might occur on a desktop system, or to simulate complex multiuser interaction, which occurs in a server system. 3. Kernels Several attempts have been made to extract small, key pieces from real programs and use them to evaluate performance. Livermore Loops and Linpack are the best known examples. Unlike real programs, no user would run kernel programs, for they exist solely to evaluate performance. 14

Kernels are best used to isolate performance of individual features of a machine to explain the reasons for differences in performance of real programs. 4.

15 Kernels are best used to isolate performance of individual features of a machine to explain the reasons for differences in performance of real programs. 4. Toy benchmarks Toy benchmarks are typically between 10 and 100 lines of code and produce a result the user already knows before running the toy program. Programs like Sieve of Eratosthenes, Puzzle, and Quicksort are popular because they are small, easy to type, and run on almost any computer. The best use of such programs is beginning programming assignments. 5. Synthetic benchmarks Similar in philosophy to kernels, synthetic benchmarks try to match the average frequency of operations and operands of a large set of programs. Whetstone and Dhrystone are the most popular synthetic benchmarks. Benchmark Suites Recently, it has become popular to put together collections of benchmarks to try to measure the performance of processors with a variety of applications. One of the most successful attempts to create standardized benchmark application suites has been the SPEC (Standard Performance Evaluation Corporation), which had its roots in the late 1980s efforts to deliver better benchmarks for workstations. Just as the computer industry has evolved over time, so has the need for different benchmark suites, and there are now SPEC benchmarks to cover different application classes, as well as other suites based on the SPEC model. 15

16 Desktop Benchmarks Desktop benchmarks divide into two broad classes: CPU intensive benchmarks and graphics intensive benchmarks intensive CPU activity). SPEC originally created a benchmark set focusing on CPU performance (initially called SPEC89), which has evolved into its fourth generation: SPEC CPU2000, which follows SPEC95, and SPEC92. (Figure 1.30 on page 64 discusses the evolution of the benchmarks.) SPEC CPU2000, summarized in Figure, consists of a set of eleven integer benchmarks (CINT2000) and fourteen floating point benchmarks (CFP2000).Although SPEC CPU2000 is aimed at CPU performance, two different types of graphics benchmarks were created by SPEC: SPEC viewperf is used for benchmarking systems supporting the OpenGL graphics library, while SPECapc consists of applications that make extensive use of graphics. SPEC viewperf measures the 3D rendering performance of systems running under OpenGL using a 3-D model and a series of OpenGL calls that transform the model. SPECapc consists of runs of three large applications: 1. Pro/Engineer: a solid modeling application that does extensive 3-D rendering. The input script is a model of a photocopying machine consisting of 370,000 triangles. 2. SolidWorks 99: a 3-D CAD/CAM design tool running a series of five tests varying from I/O intensive to CPU intensive. The large test input is a model of an assembly line consisting of 276,000 triangles. 3. Unigraphics V15: The benchmark is based on an aircraft model and covers a wide spectrum of Unigraphics functionality, including assembly, drafting, numeric control machining, solid modeling, and optimization. The inputs are all part of an aircraft design. Server Benchmarks Just as servers have multiple functions, so there are multiple types of benchmarks. The simplest benchmark is perhaps a CPU throughput oriented benchmark. SPEC CPU2000 uses the SPEC CPU benchmarks to construct a simple throughput benchmark where the processing rate of a multiprocessor can be measured 16

17 by running multiple copies (usually as many as there are CPUs) of each SPEC CPU benchmark and converting the CPU time into a rate. This leads to a measurement called the SPECRate. Other than SPECRate, most server applications and benchmarks have significant I/O activity arising from either disk or network traffic, including benchmarks for file server systems, for web servers, and for database and transaction processing systems. SPEC offers both a file server benchmark (SPECSFS) and a web server benchmark (SPECWeb). SPECSFS is a benchmark for measuring NFS (Network File System) performance using a script of file server requests; it tests the performance of the I/O system (both disk and network I/O) as well as the CPU. SPECSFS is a throughput oriented benchmark but with important response time requirements. 17

18 Transaction processing benchmarks measure the ability of a system to handle transactions, which consist of database accesses and updates. All the TPC benchmarks measure performance in transactions per second. In addition, they include a response-time requirement, so that throughput performance is measured only when the response time limit is met. To model real-world systems, higher transaction rates are also associated with larger systems, both in terms of users and the data base that the transactions are applied to. Finally, the system cost for a benchmark system must also be included, allowing accurate comparisons of cost-performance. Embedded Benchmarks Benchmarks for embedded computing systems are in a far more nascent state than those for either desktop or server environments. In fact, many manufacturers quote Dhrystone performance, a benchmark that was criticized and given up by desktop systems more than 10 years ago! As mentioned earlier, the enormous variety in embedded applications, as well as differences in performance requirements (hard real-time, soft real-time, and overall cost-performance), make the use of a single set of benchmarks unrealistic. In practice, many designers of embedded systems devise benchmarks that reflect their application, either as kernels or as stand-alone versions of the entire application. 18

19 For those embedded applications that can be characterized well by kernel performance, the best standardized set of benchmarks appears to be a new benchmark set: the EDN Embedded Microprocessor Benchmark Consortium (or EEMBC pronounced embassy). The EEMBC benchmarks fall into five classes: automotive/industrial, consumer, networking, office automation, and telecommunications Figure shows the five different application classes, which include 34 benchmarks. Formula: An average of the execution times that tracks total execution time is the arithmetic mean. Total Execution Time/ arithmetic mean: where Timei is the execution time for the ith program of a total of n in the workload. Weighted Execution Time/weighted arithmetic mean: where Weighti is the frequency of the ith program in the workload and Timei is the execution time of that program. Average normalized execution time can be expressed as either an arithmetic or geometric mean. The formula for the geometric mean is Geometric means also have a nice property for two samples Xi and Yi: 19

1.6 QUANTITATIVE PRINCIPLES OF COMPUTER DESIGN The most important and pervasive principle of computer design is to make the common case fast In applying this simple principle, we have to decide what

20 1.6 QUANTITATIVE PRINCIPLES OF COMPUTER DESIGN The most important and pervasive principle of computer design is to make the common case fast In applying this simple principle, we have to decide what the frequent case is and how much performance can be improved by making that case faster. A fundamental law, called Amdahl s Law, can be used to quantify this principle. Amdahl s Law The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahl s Law. Amdahl s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Amdahl s Law defines the speedup that can be gained by using a particular feature. Speedup is the Ratio Speedup tells us how much faster a task will run using the machine with the enhancement as opposed to the original machine. Amdahl s Law gives us a quick way to find the speedup from some enhancement, which depends on two factors: 1. The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement For example, if 20 seconds of the execution time of a program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This value, which we will call Fraction enhanced, is always less than or equal to 1. 20

21 2. The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program This value is the time of the original mode over the time of the enhanced mode: If the enhanced mode takes 2 seconds for some portion of the program that can completely use the mode, while the original mode took 5 seconds for the same portion, the improvement is 5/2. We will call this value, which is always greater than 1, Speedup enhanced. The execution time using the original machine with the enhanced mode will be the time spent using the unenhanced portion of the machine plus the time spent using the enhancement: Amdahl s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve cost/performance. The CPU Performance Equation Essentially all computers are constructed using a clock running at a constant rate. These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be expressed two ways: 21

In addition to the number of clock cycles needed to execute a program, we can also count the number of instructions executed the instruction path length or instruction count (IC).

22 In addition to the number of clock cycles needed to execute a program, we can also count the number of instructions executed the instruction path length or instruction count (IC). If we know the number of clock cycles and the instruction count we can calculate the average number of clock cycles per instruction (CPI). CPI is computed as By transposing instruction count in the above formula, clock cycles can be defined as IC X CPI. This allows us to use CPI in the execution time formula: Expanding the first formula into the units of measurement and inverting the clock rate shows how the pieces fit together: CPU performance is dependent upon three characteristics: Clock cycle (or rate), clock cycles per instruction, and instruction count. 22

23 CPU time is equally dependent on these three characteristics: A 10% improvement in any one of them leads to a 10% improvement in CPU time. Clock cycle time Hardware technology and organization CPI Organization and instruction set architecture Instruction count Instruction set architecture and compiler technology It is useful in designing the CPU to calculate the number of total CPU clock cycles as where ICi represents number of times instruction i is executed in a program and CPIi represents the average number of instructions per clock for instruction i. This form can be used to express CPU time as Measuring and Modeling the Components of the CPU Performance Equation To use the CPU performance equation as a design tool, we need to be able to measure the various factors. An existing processor it is easy to obtain the execution time by measurement, and the clock speed is known. The challenge lies in discovering the instruction count or the CPI. 23

24 Newer processors include counters for both instructions executed and for clock cycles. There are three general classes of simulation techniques. The first and simplest technique, and hence the least costly, is profile-based, static modeling. In this technique a dynamic execution profile of the program is obtained by one of three methods: 1. By using hardware counters on the processors, which are periodically saved. This technique often gives an approximate profile, but one that is within a few percent of exact. 2. By using instrumented execution, in which instrumentation code is compiled into the program. This code is used to increment counters, yielding an exact profile. (This technique can also be used to create a trace of memory address that are accessed, which is useful for other simulation techniques.) 3. By interpreting the program at the instruction set level, compiling instruction counts in the process. Once the profile is obtained It is used to analyze the program in a static fashion by looking at the code. The total instruction count is easy to obtain. To get a detailed dynamic instruction mix telling what types of instructions were executed with what frequency. For simple processors, it is possible to compute an approximation to the CPI. Although this simple model ignores memory behavior and has severe limits for modeling complex pipelines, it is a reasonable and very fast technique for modeling the performance of short, integer pipelines, ignoring the memory system behavior. 24

25 Trace-driven simulation is a more sophisticated technique for modeling performance and is particularly useful for modeling memory system performance. In trace-driven simulation, a trace of the memory references executed is created, usually either by simulation or by instrumented execution. The trace includes what instructions were executed (given by the instruction address), as well as the data addresses accessed. The most common use is to model memory system performance, which can be done by simulating the memory system, including the caches and any memory management hardware using the address trace. The third technique, which is the most accurate and most costly, is execution driven simulation. In execution-driven simulation a detailed simulation of the memory system and the processor pipeline are done simultaneously. This allows the exact modeling of the interaction between the two, which is critical. Principle of Locality Locality of reference means: Programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code. An implication of locality is that we can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past. Locality of reference also applies to data accesses, though not as strongly as to code accesses. Two different types of locality have been observed. Temporal locality states that recently accessed items are likely to be accessed in the near future. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. 25

26 Advantage of Parallelism Advantage of parallelism is one of the most important methods for improving performance. We give three brief examples, which are expounded on in later chapters. Our first example is the use of parallelism at the system level. To improve the throughput performance on a typical server benchmark, such as SPECWeb or TPC, multiple processors and multiple disks can be used. The workload of handling requests can then be spread among the CPUs or disks resulting in improved throughput. This is the reason that scalability is viewed as a valuable asset for server applications. At the level of an individual processor, taking advantage of parallelism among instructions is critical to achieving high performance. This can be done to do this is through pipelining. The basic idea behind pipelining is to overlap the execution of instructions, so as to reduce the total time to complete a sequence of instructions. Viewed from the perspective of the CPU performance equation, we can think of pipelining as reducing the CPI by allowing instructions that take multiple cycles to overlap. A key insight that allows pipelining to work is that not every instruction depends on its immediate predecessor, and thus, executing the instructions completely or partially in parallel may be possible. Parallelism can also be exploited at the level of detailed digital design. For example set associative caches use multiple banks of memory that are typical searched in parallel to find a desired item. Modern ALUs use carry-lookahead, which uses parallelism to speed the process of computing sums from linear in the number of bits in the operands to logarithmic. 26

27 Formula: MIPS or million instructions per second. For a given program, MIPS is simply Some find this rightmost form convenient since clock rate is fixed for a machine and CPI is usually a small number, unlike instruction count or execution time. Relating MIPS to time, The problem with using MIPS as a measure for comparison is threefold: MIPS is dependent on the instruction set, making it difficult to compare MIPS of computers with different instruction sets. MIPS varies between programs on the same computer. Most importantly, MIPS can vary inversely to performance! Relative MIPS for a machine M was defined based on some reference machine as 27

28 INSTRUCTION SET PRINCIPLES AND EXAMPLES 1.7 INTRODUCTION General purpose RISC architectures (MIPS, Power PC, Precision Architecture, SPARC), Four embedded RISC processors (ARM, Hitachi SH, MIPS 16, Thumb), Three older architectures (80x86, IBM 360/370, and VAX). Desktop computing emphasizes performance of programs with integer and floating-point data types, with little regard for program size or processor power consumption. For example, code size has never been reported in the four generations of SPEC benchmarks. Servers today are used primarily for database, file server, and web applications, plus some timesharing applications for many users. Hence, floating-point performance is much less important for performance than integers and character strings, yet virtually every server processor still includes floating-point instructions. Embedded applications value cost and power, so code size is important because less memory is both cheaper and lower power, and some classes of instructions (such as floating point) may be optional to reduce chip costs. 1.8 CLASSIFYING INSTRUCTION SET ARCHITECTURES The type of internal storage in a processor the major choices are a stack, an accumulator, or a set of registers. Operands may be named explicitly or implicitly: The operands in stack architecture are implicitly on the top of the stack, and in an accumulator architecture one operand is implicitly the accumulator. 28

The general-purpose register architectures have only explicit operands either registers or memory locations A block diagram of such architectures Operand locations for four instruction set

29 The general-purpose register architectures have only explicit operands either registers or memory locations A block diagram of such architectures Operand locations for four instruction set architecture classes In (a), a Top Of Stack register (TOS), points to the top input operand, which is combined with the operand below. The first operand is removed from the stack, the result takes the place of the second operand, and TOS is updated to point to the result. All operands are implicit. In (b), the Accumulator is both an implicit input operand and a result. In (c) one input operand is a register, one is in memory, and the result goes to a register. All operands are registers in (d), and, like the stack architecture, can be transferred to memory only via separate instructions: push or pop for (a) and load or store for (d).the below figure shows how the code sequence C=A+B would typically appear in these four classes of instruction sets. 29

The explicit operands may be accessed directly from memory or may need to be first loaded into temporary storage, depending on the class of architecture and choice of specific instruction.

30 The explicit operands may be accessed directly from memory or may need to be first loaded into temporary storage, depending on the class of architecture and choice of specific instruction. The code sequence for C = A + B for four classes of instruction sets. There are really two classes of register computers. One class can access memory as part of any instruction, called register-memory architecture, and the other can access memory only with load and store instructions, called load-store or registerregister architecture. A third class, not found in computers shipping today, keeps all operands in memory and is called memory-memory architecture. Some instruction set architectures have more registers than a single accumulator, but place restrictions on uses of these special registers. Such architecture is sometimes called an extended accumulator or special purpose register computer. Early computers used stack or accumulator-style architectures, virtually every new architecture designed after 1980 uses load-store register architecture. The major reasons for the emergence of general-purpose register (GPR) computers are twofold. First, registers like other forms of storage internal to the processor are faster than memory. Second, registers are more efficient for a compiler to use than other forms of internal storage. For example, on a register computer the expression (A*B) (B*C) (A*D) may be evaluated by doing the multiplications in any order, which may be more efficient because of the location of the operands or because of pipelining concerns. 30

Two major instruction set characteristics divide GPR architectures. Both characteristics concern the nature of operands for a typical arithmetic or logical instruction (ALU instruction).

31 Two major instruction set characteristics divide GPR architectures. Both characteristics concern the nature of operands for a typical arithmetic or logical instruction (ALU instruction). The first concerns whether an ALU instruction has two or three operands. In the three-operand format, the instruction contains one result operand and two source operands. In the two-operand format, one of the operands is both a source and a result for the operation. The second distinction among GPR architectures concerns how many of the operands may be memory addresses in ALU instructions. The number of memory operands supported by a typical ALU instruction may vary from none to three. Figure shows combinations of these two attributes with examples of computers. Although there are seven possible combinations, three serve to classify nearly all existing computers. As we mentioned earlier, these three are register-register (also called load-store), register-memory, and memory-memory. Typical combinations of memory operands and total operands per typical ALU instruction with examples of computers. 31

32 Computers with no memory reference per ALU instruction are called load-store or register-register computers. Instructions with multiple memory operands per typical ALU instruction are called register-memory or memory-memory, according to whether they have one or more than one memory operand. Of course, the following advantages and disadvantages are not absolutes: They are qualitative and their actual impact depends on the compiler and implementation strategy. A GPR computer with memory-memory operations could easily be ignored by the compiler and used as a register-register computer. One of the most pervasive architectural impacts is on instruction encoding and the number of instructions needed to perform a task. Advantages and disadvantages of the three most common types of general-purpose register computers.the notation (m, n) means m memory operands and n total operands. 32

33 1.9 MEMORY ADDRESSING The memory addressing means how the object is accessed as a function of the address the length. There are two memory address issues, first one is ordering of bytes and words. There are two different conventions for ordering the bytes within a larger object. Little Endian byte order puts the byte whose address is x...x000 at the leastsignificant position in the double word (the little end). The bytes are numbered: Big Endian byte order puts the byte whose address is x...x000 at the mostsignificant position in the double word (the big end). The bytes are numbered: Little Endian ordering also fails to match normal ordering of words when strings are compared. Strings appear SDRAWKCAB (backwards) in the registers. A second memory issue is that accesses to objects larger than a byte must be aligned. An access to an object of size s bytes at byte address A is aligned if A mod s = 0. Figure shows the addresses at which an access is aligned or misaligned. Figure above, suppose we read a byte from an address with its three low order bits having the value 4. We will need shift right 3 bytes to align the byte to the proper place in a 64-bit register. Depending on the instruction, the computer may also need to sign-extend the quantity. Stores are easy: only the addressed bytes in memory may be altered. On some computers a byte, half word, and word operation does not affect the upper portion of a register. Although all the computers discussed in this book permit byte, half-word, and word accesses to memory, only the IBM 360/370, Intel 80x86, and VAX supports ALU operations on register operands narrower than the full width. 33

34 Aligned and misaligned addresses of byte, half word, word, and double word objects for byte addressed computers. Addressing Modes The way addresses are specified by instructions is called addressing modes. Addressing mode specify constants and registers in addition to locations in memory. When a memory location is used, the actual memory address specified by the addressing mode is called the effective address. Figure shows the most common names for the addressing modes, though the names differ among architectures. In this figure, only one non-c feature is used: The left arrow ( ) is used for assignment. We also use the array Mem as the name for main memory and the array Regs for registers. Thus, Mem[Regs[R1]]refers to the contents of the memory location whose address is given by the contents of register 1 (R1). Later, we will introduce extensions for accessing and transferring data smaller than a word. Addressing modes have the ability to significantly reduce instruction counts; they also add to the complexity of building a computer and may increase the average CPI (clock cycles per instruction) of computers that implement those modes. 34

35 Summary of use of memory addressing modes Selection of addressing modes with examples, meaning, and usage 35

Displacement Addressing Mode In Displacement-style addressing mode filed size is important because they directly affect the instruction length.

36 Displacement Addressing Mode In Displacement-style addressing mode filed size is important because they directly affect the instruction length. Figure shows the measurements taken on the data access on a load-store architecture using our benchmark programs. Displacement values are widely distributed. Immediate or Literal Addressing Mode: Immediate can be used in arithmetic operations, in comparisons and in moves where a constant is wanted in a register. Figure (1) show the frequency of immediate for the general classes of integer operations in an instruction set. As figure (2) shows, small immediate values are most heavily used. Large immediate are sometimes used, however most likely in addressing instructions. (1) About one-quarter of data transfers and ALU operations have an immediate operand. 36

37 (2)The distribution of immediate values ADDRESSING MODES FOR SIGNAL PROCESSING Digital Signal Processors (DSPs) deal with infinite, continuous streams of data, they routinely rely on circular buffers. Hence, as data is added to the buffer, a pointer is checked to see if it is pointing at the end of the buffer. If not, it increments the pointer to the next address; if it is, the pointer is set instead to the start of the buffer. Similar issues arise when emptying a buffer. Every DSP has a modulo or circular addressing mode to handle this case automatically, our first novel DSP addressing mode. It keeps a start register and an end register with every address register, allowing the auto-increment and autodecrement addressing modes to reset when the reach the end of the buffer. One variation makes assumptions about the buffer size starting at an address that ends in xxx00.00 and so uses just a single buffer length register per address register. Even though DSPs are tightly targeted to a small number of algorithms. 37

38 Fast Fourier Transform (FFT). FFTs start or end their processing with data shuffled in a particular order. For eight data items in a radix-2 FFT, the transformation is listed below, with addresses in parentheses shown in binary: 0 (000 ) 0 (000 ) 1 (001 ) 4 (100 ) 2 (010 ) 2 (010 ) 3 (011 ) 6 (110 ) 4 (100 ) 1 (001 ) 5 (101 ) 5 (101 ) 6 (110 ) 3 (011 ) 7 (111 ) 7 (111 ) Without special support such address transformation would take an extra memory access to get the new address, or involve a fair amount of logical instructions to transform the address. The DSP solution is based on the observation that the resulting binary address is simply the reverse of the initial address! For example, address 100 (4) becomes (1). Hence, many DSPs have this second novel addressing mode bit reverse 2 addressing whereby the hardware reverses the lower bits of the address, with the number of bits reversed depending on the step of the FFT algorithm. Figure shows the static frequency of data addressing modes in a DSP for a set of 54 library routines. This architecture has 17 addressing modes for desktop and server computers account for 95% of the DSP addressing. Despite measuring handcoded routines to derive Figure, the use of novel addressing mode is sparse. 38

Frequency of addressing modes for TI TMS320C54x DSP. 1.

39 Frequency of addressing modes for TI TMS320C54x DSP TYPE AND SIZE OF OPERANDS The type of an operand designates by encoding in the opcode or the data can be annotated with tags that are interpreted by the hardware. These tags specify the type of the operand, and the operation is chosen accordingly. In desktop and server architectures the type of an operand usually Integer Single-precision floating point Character, and so on The size of operands generally Character (8 bits) Half word (16 bits) Word (32 bits) Single-precision floating point (also 1 word) Double-precision floating point (2 words) 39

40 Integers are almost universally represented as two s complement binary numbers. Characters are usually in ASCII, but the 16bit Unicode is gaining popularity with the internationalization of computers. Until the early 1980s, most computer manufacturers chose their own floating-point representation. Almost all computers since that time follow the same standard for floating point, the IEEE standard 754. Some architecture provides operations on character strings, although such operations are usually quite limited and treat each byte in the string as a single character. Typical operations supported on character strings are comparisons and moves. For business applications, some architectures support a decimal format, usually called packed decimal or binary-coded decimal 4 bits are used to encode the values 0 9, and 2 decimal digits are packed into each byte. Numeric character strings are sometimes called unpacked decimal, and operations called packing and unpacking are usually provided for converting back and forth between them. Our SPEC benchmarks use byte or character, half word (short integer), word (integer), double word (long integer) and floating-point data types. Figure shows the dynamic distribution of the sizes of objects referenced from memory for these programs. Figure uses memory references to examine the types of data being accessed. The double word data type is used for double-precision floating-point in floating-point programs and for addresses, since the computer uses 64-bit addresses. On a 32-bit address computer the 64-bit addresses would be replaced by 32-bit addresses, and so almost all double-word accesses in integer programs would become single word accesses. 40

Distribution of data accesses by size for the benchmark programs 1.12 OPERANDS FOR MEDIA AND SIGNAL PROCESSING Graphics applications deal with 2D and 3D images.

41 Distribution of data accesses by size for the benchmark programs 1.12 OPERANDS FOR MEDIA AND SIGNAL PROCESSING Graphics applications deal with 2D and 3D images. A common 3D data type is called a vertex, a data structure with three components: x coordinate, y coordinate, a z coordinate, and a fourth coordinate (w) to help with color or hidden surfaces. Three vertices specify a graphics primitive such as a triangle. Vertex values are usually 32-bit floating-point values. Assuming a triangle is visible, when it is rendered it is filled with pixels. Pixels are usually 32 bits, usually consisting of four 8-bit channels: R (red), G (green), B (blue) and A (which denotes the transparency of the surface or transparency of the pixel when the pixel is rendered). DSPs add fixed point to the data types discussed so far. If you think of integers as having a binary point to the right of the least significant bit, fixed point has a binary point just to the right of the sign bit. Hence, fixed-point data are fractions between -1 and

42 EXAMPLE Here are three simple16-bit patterns: What values do they represent if they are two s complement integers? Fixed-point numbers? ANSWER Number representation tells us that the i-th digit to the left of the binary point represents 2i-1 and the i-th digit to the right of the binary point represents 2-i. First assume these three patterns are integers. Then the binary point is to the far right, so they represent 214, 211, and ( ), or 16384, 2048, and Fixed point places the binary point just to the right of the sign bit, so as fixed point these patterns represent 2-1, 2-4, and ( ). The fractions are 1/2, 1/16, and ( )/4096 or 2305/4096,which represents about , , and Alternatively, for an n-bit two s-complement, fixed-point number we could just use the divide the integer presentation by the 2n-1 to derive the same results: 16384/32768=1/2, 2048/32768=1/16, and 18440/32768=2305/ OPERATIONS IN THE INSTRUCTION SET The operators supported by most instruction set architectures can be categorized as in Figure. One rule of thumb across all architectures is that the most widely executed instructions are the simple operations of an instruction set. For example Figure shows 10 simple instructions that account for 96% of instructions executed for a collection of integer programs running on the popular Intel 80x86. Hence, the implementor of these instructions should be sure to make these fast, as they are the common case. 42

The top 10 instructions for the 80x86 1.

43 The top 10 instructions for the 80x OPERATIONS FOR MEDIA AND SIGNAL PROCESSING Media processing is judged by human perception, the data for multimedia operations is often much narrower than the 64-bit data word of modern desktop and server processors. 43

44 A partitioned add operation on 16-bit data with a 64-bit ALU would perform four 16-bit adds in a single clock cycle. The extra hardware cost is simply to prevent carries between the four 16-bit partitions of the ALU. For example, such instructions might be used for graphical operations on pixels. These operations are commonly called Single-Instruction Multiple Data (SIMD) or vector instructions. Most graphics multimedia applications use 32-bit floating-point operations. Some computers double peak performance of single-precision, floating-point operations; they allow a single instruction to launch two 32-bit operations on operands found side-by-side in a double precision register. Just as in the prior case, the two partitions must be insulated to prevent operations on one half to affect the other. Such floating-point operations are called paired-single operations. DSP operations: DSPs provide operations as follows, but they change the semantics a bit. Arithmetic and logical Integer arithmetic and logical operations: add, subtract, and, or, multiple, divide Data transfer Loads-stores (move instructions on computers with memory addressing) Control Branch, jump, procedure call and return, traps First, because they are often used in real time applications, there is not an option of causing an exception on arithmetic overflow. To support such an unyielding environment, DSP architectures use saturating arithmetic: if the result is too large to be represented, it is set to the largest representable number, depending on the sign of the result. A second issue for DSPs is that there are several modes to round the wider accumulators into the narrower data words. Finally, the targeted kernels for DSPs accumulate a series of products, and hence have a multiply-accumulate or MAC instruction. MACs are key to dot product operations for vector and matrix multiplies. 44

1.15 INSTRUCTIONS FOR CONTROL FLOW There are four different types of control-flow change: Conditional branches Jumps Procedure calls Procedure returns We want to know the relative frequency of these

45 1.15 INSTRUCTIONS FOR CONTROL FLOW There are four different types of control-flow change: Conditional branches Jumps Procedure calls Procedure returns We want to know the relative frequency of these events, as each event is different, may use different instructions, and may have different behavior. Figure shows the frequencies of these control-flow instructions for a load-store computer running our benchmarks. Breakdown of control flow instructions into three classes: calls or returns, jumps, and conditional branches. Addressing modes for control flow instructions: The destination address of a control flow instruction must always be specified. This destination is specified explicitly in the instruction in the vast majority of cases procedure return being the major exception since for return the target is not known at compile time. The most common way to specify the destination is to supply a displacement that is added to the program counter, or PC. 45

46 To implement returns and indirect jumps when the target is not known at compile time, a method other than PC-relative addressing is required. Here, there must be a way to specify the target dynamically, so that it can change at runtime. This dynamic address may be as simple as naming a register that contains the target address; alternatively, the jump may permit any addressing mode to be used to supply the target address. These register indirect jumps are also useful for four other important features: Case or Switch statements found in most programming languages (which select among one of several alternatives); Virtual functions or methods in object-oriented languages like C++ or Java (which allow different routines to be called depending on the type of the argument); High order functions or function pointers in languages like C or C++ (which allows functions to be passed as arguments giving some of the flavor of object oriented programming), and Dynamically shared libraries (which allow a library to be loaded and linked at runtime only when it is actually invoked by the program rather than loaded and linked statically before the program is run). In all four cases the target address is not known at compile time, and hence is usually loaded from memory into a register before the register indirect jump. Figure shows the distribution of displacements for PC-relative branches in instructions. About 75% of the branches are in the forward direction. 46

47 Branch distances in terms of number of instructions between the target and the branch instruction. Conditional Branch Options Since most changes in control flow are branches, deciding how to specify the branch condition is important. Figure (1) shows the three primary techniques in use today and their advantages and disadvantages. One of the most noticeable properties of branches is that a large number of the comparisons are simple tests, and a large number are comparisons with zero. Thus, some architecture chooses to treat these comparisons as special cases, especially if a compare and branch instruction is being used. Figure (2) shows the frequency of different comparisons used for conditional branching. 47

somewhere, sometimes in a special link register or just a GPR.

48 (2)Frequency of different types of compares in conditional branches Procedure Invocation Options: Procedure calls and returns include control transfer and possibly some state saving; at a minimum the return address must be saved somewhere, sometimes in a special link register or just a GPR. Some older architecture provide a mechanism to save many registers, while newer architectures require the compiler to generate stores and loads for each register saved and restored. 48

49 There are two basic conventions in use to save registers: either at the call site or inside the procedure being called. Caller saving means that the calling procedure must save the registers that it wants preserved for access after the call, and thus the called procedure need not worry about registers. Callee saving is the opposite: the called procedure must save the registers it wants to use, leaving the caller is unrestrained. There are times when caller save must be used because of access patterns to globally visible variables in two different procedures. For example, suppose we have a procedure P1 that calls procedure P2, and both procedures manipulate the global variable x. If P1 had allocated x to a register, it must be sure to save x to a location known by P2 before the call to P2. A compiler s ability to discover when a called procedure may access registerallocated quantities is complicated by the possibility of separate compilation. Suppose P2 may not touch x but can call another procedure, P3, that may access x, yet P2 and P3 are compiled separately. Because of these complications, most compilers will conservatively caller save any variable that may be accessed during a call ENCODING AN INSTRUCTION SET: The instructions are encoded into a binary representation for execution by the processor. This representation affects not only the size of the compiled program; it affects the implementation of the processor, which must decode this representation to quickly find the operation and its operands. The operation is typically specified in one field, called the opcode.. Some older computers have one to five operands with 10 addressing modes for each operand. For such a large number of combinations, typically a separate address specifier is needed for each operand: the address specifier tells what addressing mode is used to access the operand. At the other extreme are load-store computers with only one memory operand and only one or two addressing modes; obviously, in this case, the addressing mode can be encoded as part of the opcode. 49

50 When encoding the instructions, the number of registers and the number of addressing modes both have a significant impact on the size of instructions, as the the register field and addressing mode field may appear many times in a single instruction. There are several competing forces when encoding the instruction set: The desire to have as many registers and addressing modes as possible. The impact of the size of the register and addressing mode fields on the average instruction size and hence on the average program size. A desire to have instructions encoded into lengths that will be easy to handle in a pipelined implementation. Figure shows three popular choices for encoding the instruction set. The first we call variable, since it allows virtually all addressing modes to be with all operations. This style is best when there are many addressing modes and operations. The second choice we call fixed, since it combines the operation and the addressing mode into the opcode. Three basic variations in instruction encoding: variable length, fixed length, and hybrid. 50

51 Often fixed encoding will have only a single size for all instructions; it works best when there are few addressing modes and operations. The trade-off between variable encoding and fixed encoding is size of programs versus ease of decoding in the processor. Variable tries to use as few bits as possible to represent the program, but individual instructions can vary widely in both size and the amount of work to be performed. Reduced code size in RISCs: As RISC computers started being used in embedded applications, the 32-bit fixed format became a liability since cost and hence smaller code are important. In response, several manufacturers offered a new hybrid version of their RISC instruction sets, with both 16-bit and 32-bit instructions. The narrow instructions support fewer operations, smaller address and immediate fields, fewer registers, and two-address format rather than the classic three-address format of RISC computers CROSSCUTTING ISSUES: THE ROLE OF COMPILERS Today almost all programming is done in high-level languages for desktop and server applications. This development means that since most instructions executed are the output of a compiler, an instruction set architecture is essentially a compiler target. The Structure of Recent Compilers A compiler writer s first goal is correctness all valid programs must be compiled correctly. The second goal is usually speed of the compiled code. Typically, a whole set of other goals follows these two, including fast compilation, debugging support, and interoperability among languages. Normally, the passes in the compiler transform higher-level, more abstract representations into progressively lower-level representations. Eventually it reaches the instruction set. This structure helps manage the complexity of the transformations and makes writing a bug-free compiler easier. 51

52 The complexity of writing a correct compiler is a major limitation on the amount of optimization that can be done. Although the multiple-pass structure helps reduce compiler complexity, it also means that the compiler must order and perform some transformations before others. In the diagram of the optimizing compiler in Figure. Compilers typically consist of two to four passes, with more highly optimizing compilers having more passes. The compiler consists of four phases 1. Font end per language: The functionality of this phase is to transform high level language into common intermediate form. This phase functionally dependent on language and machine independent. 2. High-level optimization : The functionality of this phase is loop transformations and procedure integration. This phase somewhat language dependent and largely machine independent. 52

53 3. Global Optimizer : The functionality of this phase is local and global optimizations and register allocation. This phase contains small language dependency and machine dependencies slightly 4. Code generator: The functionality of this phase machine dependent optimization. This phase highly machine dependent and language independent. Optimizations performed by modern compilers can be classified by the style of the transformation, as follows: High-level optimizations are often done on the source with output fed to later optimization passes. Local optimizations optimize code only within a straight-line code fragment (called a basic block by compiler people). Global optimizations extend the local optimizations across branches and introduce a set of transformations aimed at optimizing loops. Register allocation. Processor-dependent optimizations attempt to take advantage of specific architectural knowledge. Register Allocation Register allocation algorithms today are based on a technique called graph coloring. The basic idea behind graph coloring is to construct a graph representing the possible candidates for allocation to a register and then to use the graph to allocate registers. Roughly speaking, the problem is how to use a limited set of colors so that no two adjacent nodes in a dependency graph have the same color. The emphasis in the approach is to achieve 100% register allocation of active variables. The problem of coloring a graph in general can take exponential time as a function of the size of the graph (NP-complete). There are heuristic algorithms, however, that work well in practice yielding close allocations that run in near linear time. Graph coloring works best when there are at least 16 (and preferably more) general-purpose registers available for global allocation for integer variables and additional registers for floating point. 53

54 Unfortunately, graph coloring does not work very well when the number of registers is small because the heuristic algorithms for coloring the graph are likely to fail. Impact of Optimizations on Performance The typical optimizations are given in Figure. The last column of Figure indicates the frequency with which the listed optimizing transforms were applied to the source program. Major types of optimizations and examples in each class. 54

55 The Impact of Compiler Technology on the Architect s Decisions The interaction of compilers and high-level languages significantly affects how programs use an instruction set architecture. The stack is used to allocate local variables. The stack is grown and shrunk on procedure call or return, respectively. The global data area is used to allocate statically declared objects, such as global variables and constants. A large percentage of these objects are arrays or other aggregate data structures The heap is used to allocate dynamic objects that do not adhere to a stack discipline. Register allocation is much more effective for stack-allocated objects than for global variables, and register allocation is essentially impossible for heap-allocated objects because they are accessed with pointers. The variable a could not be register allocated across the assignment to *p without generating incorrect code. Aliasing causes a substantial problem because it is often difficult or impossible to decide what objects a pointer may refer to. A compiler must be conservative; some compilers will not allocate any local variables of a procedure in a register when there is a pointer that may refer to one of the local variables THE MIPS ARCHITECTURE A simple 64-bit load-store architecture called MIPS. MIPS emphasizes A simple load-store instruction set Design for pipelining efficiency (discussed in Appendix A), including a fixed instruction set encoding Efficiency as a compiler target MIPS provide a good architectural model for study, not only because of the popularity of this type of processor, but also because it is an easy architecture to understand. 55

56 In the 15 years since the first MIPS processor, there have been many versions of MIPS. We will use a subset of what is now called MIPS64, which will often abbreviate to just MIPS Registers for MIPS MIPS64 has bit general-purpose registers (GPRs), named R0, R1,, R31. GPRs are also sometimes known as integer registers. Additionally, there is a set of 32 floating-point registers (FPRs), named F0, F1,..., F31, which can hold 32 singleprecision (32-bit) values or 32 double-precision (64-bit) values. (When holding one single-precision number, the other half of the FPR is unused.) Both single- and double-precision floating-point operations (32-bit and 64-bit) are provided. MIPS also include instructions that operate on two single precision operands in a single 64-bit floating-point register. The value of R0 is always 0. A few special registers can be transferred to and from the general-purpose registers. An example is the floating-point status register, used to hold information about the results of floating-point operations. There are also instructions for moving between a FPR and a GPR. Data types for MIPS The data types are 8-bit bytes, 16-bit half words, 32-bit words, and 64-bit double words for integer data and 32-bit single precision and 64-bit double precision for floating point. Half words were added because they are found in languages like C and popular in some programs, such as the operating systems, concerned about size of data structures.single-precision floating-point operands were added for similar reasons. The MIPS64 operations work on 64-bit integers and 32- or 64-bit floating point. Bytes, half words, and words are loaded into the general-purpose registers with either zeros or the sign bit replicated to fill the 32 bits of the GPRs. Once loaded, they are operated on with the 64-bit integer operations. 56

57 Addressing modes for MIPS data transfers The only data addressing modes are immediate and displacement, both with 16bit fields. Register indirect is accomplished simply by placing 0 in the 16-bit displacement field, and absolute addressing with a 16-bit field is accomplished by using register 0 as the base register. Embracing zero gives us four effective modes, although only two are supported in the architecture. MIPS memory is byte addressable in Big Endian mode with a 64-bit address. As it is a load-store architecture, all references between memory and either GPRs or FPRs are through loads or stores. Memory accesses involving GPRs can be to a byte, half word, word, or double word. The FPRs may be loaded and stored with single-precision or double-precision numbers. All memory accesses must be aligned. MIPS Instruction Format Since MIPS have just two addressing modes, these can be encoded into the opcode. All instructions are 32 bits with a 6-bit primary opcode. The formats are simple while providing 16-bit fields for displacement addressing, immediate constants, or PCrelative branch addresses. MIPS Operations MIPS supports the list of simple operations.there are four broad classes of instructions: loads and stores, ALU operations, branches and jumps, and floatingpoint operations. Any of the general-purpose or floating-point registers may be loaded or stored, except that loading R0 has no effect. Single-precision floating-point numbers occupy half a floating point register. Conversions between single and double precision must be done explicitly. 57

58 Instruction layout for MIPS 58

59 Example Assuming that R8 and R10 are 64-bit registers: means that the byte at the memory location addressed by the contents of register R8 is sign-extended to form a 32-bit quantity that is stored into the lower half of register R10. All ALU instructions are register-register instructions. The operations include simple arithmetic and logical operations: add, subtract, AND, OR, XOR, and shifts. Immediate forms of all these instructions are provided using a 16-bit signextended immediate. The operation LUI (load upper immediate) loads bits 32 to 47 of a register, while setting the rest of the register to 0. LUI allows a 32-bit constant to be built in two instructions, or a data transfer using any constant 32-bit address in one extra instruction. MIPS Control Flow Instructions MIPS provide compare instructions, which compare two registers to see if the first is less than the second. If the condition is true, these instructions place a 1 in the destination register (to represent true); otherwise they place the value 0.Because these operations set a register, they are called set-equal, set-not-equal, set-less-than, and so on. There are also immediate forms of these compares. Control is handled through a set of jumps and a set of branches. The four jump instructions are differentiated by the two ways to specify the destination address and by whether or not a link is made. Two jumps use a 26-bit offset shifted two bits and then replaces the lower 28 bits of the program counter (of the instruction sequentially following the jump) to determine the destination address. The other two jump instructions specify a register that contains the destination address. There are two flavors of jumps: plain jump, and jump and link. 59

60 All branches are conditional. The branch condition is specified by the instruction, which may test the register source for zero or nonzero; the register may contain a data value or the result of a compare.the branch target address is specified with a 16-bit signed offset that is added to the program counter, which is pointing to the next sequential instruction. There is also a branch to test the floating-point status register for floating point conditional branches. MIPS Floating-Point Operations Floating-point instructions manipulate the floating-point registers and indicate whether the operation to be performed is single or double precision. The operations MOV.S and MOV.D copy a single-precision (MOV.S) or double-precision (MOV.D) floating-point register to another register of the same type. The operations MFC1 and MTC1 move data between a single floating-point register and an integer register; moving a double-precision value to two integer registers requires two instructions. Conversions from integer to floating point are also provided, and vice versa. The floating-point operations are add, subtract, multiply, and divide; a suffix D is used for double precision and a suffix S is used for single precision (e.g., ADD.D, ADD.S, SUB.D, SUB.S, MUL.D, MUL.S, DIV.D, DIV.S). Floating-point compares set a bit in the special floating-point status register that can be tested with a pair of branches: BC1T and BC1F, branch floating-point true and branch floating-point false. To get greater performance for graphics routines, MIPS64 has instructions that perform two 32-bit floating-point operations on each half of the 64-bit floating point register. These paired single operations include ADD.PS, SUB.PS, MUL.PS, and DIV.PS. (They are loaded and store using double precision loads and stores.) Giving a nod towards the importance of DSP applications, MIPS64 also includes both integer and floating-point multiply-add instructions: MADD, MADD.S, MADD.D, and MADD.PS. 60

61 1.19 THE TRIMEDIA TM32 CPU Media processor is a name given to a class of embedded processors that are dedicated to multimedia processing, typically being cost sensitive like embedded processors but following the compiler orientation from desktop and server computing. Like DSPs, they operate on narrower data types than the desktop, and must often deal with infinite, continuous streams of data. The Trimedia TM32 CPU is a representative of this class. As multimedia applications have considerable parallelism in the processing of these data streams, the instruction set architectures often look different from the desktop. It is intended for products like set top boxes and advanced televisions. First, there are many more registers: bit registers, which contain either integer or floating point data. Second, and not surprisingly, it offers the partitioned ALU or SIMD instructions to allow computations on multiple instances of narrower data.third, showing its heritage, for integers it offers both two s complement arithmetic favored by desktop processors and saturating arithmetic favored by DSPs. If there are not five independent instructions available for the compiler to schedule together that is, the rest are dependent then NOPs are placed in the leftover slots. This instruction coding technique is called, naturally enough, Very Long Instruction Word (VLIW), and it predates the Trimedia processors. The Trimedia TM32 CPU has longer instruction words and they often contain NOPs, Trimedia compacts its instructions in memory, decoding them to the full size when loaded into the cache. 61

The cost in code size of these VLIW instructions is still a factor of two to three larger than MIPS after compaction. List of operations and number of variations in Trimedia TM32 CPU.

62 The cost in code size of these VLIW instructions is still a factor of two to three larger than MIPS after compaction. List of operations and number of variations in Trimedia TM32 CPU. Given the Trimedia TM32 CPU has longer instruction words and they often contain NOPs, Trimedia compacts its instructions in memory, decoding them to the full size when loaded into the cache.figure shows the TM32 CPU instruction mix for the EEMBC benchmarks. Using the unmodified source code, the instruction mix is similar to others, although there are more byte data transfers. If the C code is hand-tuned, it can extensively use SIMD instructions. Note the large number of pack and merge instructions to align the data for the SIMD instructions. The cost in code size of these VLIW instructions is still a factor of two to three larger than MIPS after compaction. ***END*** 62

63 UNIT II INSTRUCTION LEVEL PARALLELISM 2.1 PIPELINING AND HAZARDS Pipelining is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction. A pipeline is like an assembly line. In a computer pipeline, each step in the pipeline completes a part of an instruction. Like the assembly line, different steps are completing different parts of different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe instructions enter at one end, progress through the stages, and exit at the other end. The throughput of an instruction pipeline is determined by how often an instruction exits the pipeline. The time required between moving an instruction one step down the pipeline is a processor cycle. Because all stages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage. The pipeline designer s goal is to balance the length of each pipeline stage. If the stages are perfectly balanced, then the time per instruction on the pipelined processor assuming ideal conditions is equal to The speedup from pipelining equals the number of pipe stages. The stages will not be perfectly balanced; furthermore, pipelining does involve some overhead. Thus, the time per instruction on the pipelined processor will not have its minimum possible value. Pipelining yields a reduction in the average execution time per instruction. Decreasing the number of clock cycles per instruction (CPI). 63

64 Pipelining is an implementation technique that exploits parallelism among the instructions in a sequential instruction stream. It has the substantial advantage that, unlike some speedup techniques, it is not visible to the programmer. The Basics of a RISC Instruction Set RISC (Reduced Instruction Set Computer) architecture or load-store architecture. RISC architectures are characterized by a few key properties, which dramatically simplify their implementation: All operations on data apply to data in registers and typically change the entire register (32 or 64 bits per register). The only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register, respectively. Load and store operations that load or store less than a full register (e.g., a byte, 16- bits, or 32-bits) are often available. The instructions formats are small in number with all instructions typically being one size. Like other RISC architectures, the MIPS instruction set provides 32 registers, although register 0 always has the value 0. Most RISC architectures, like MIPS, have three classes of instructions: 1. ALU instructions: These instructions take either two registers or a register and a signextended immediate (called ALU immediate instructions, they have a 16-bit offset in MIPS), operate on them, and store the result into third register. Typical operations include add (ADDD), subtract (SUBD), and logical operations (such as AND or OR), which do not differentiate between 32-bit and 64-bit versions. Immediate versions of these instructions use the same mnemonics with a suffix of I. 64

65 In MIPS, there are both signed and unsigned forms of the arithmetic instructions; the unsigned forms, which do not generate overflow exceptions and thus are the same in 32-bit and 64-bit mode have a U at the end (e.g. ADDU, SUBU, ADDUI). 2. Load and store instructions: These instructions take a register source, called the base register and an immediate field (16-bit in MIPS), and called the offset, as operand. The sum called the effective address of the contents of the base register and the signextended offset is used as a memory address. In the case of a load instruction, a second register operand acts as the destination for the data loaded from memory. In the case of a store, the second register operand is the source of the data that is stored into memory. The instructions load word (LD) and store word (SD) load or store the entire 64-bit register contents. 3. Branches and Jumps: Branches are conditional transfers of control. There are usually two ways of specifying the branch condition in RISC architectures: with a set of condition bits (sometimes called a condition code) or by a limited set of comparisons between a pair of registers or between a register and zero. In all RISC architectures, the branch destination is obtained by adding a sign-extended offset (16-bits in MIPS) to the current PC. Unconditional jumps are provided in many RISC architectures. A Simple Implementation of a RISC Instruction Set Every instruction takes at most five clock cycles. Every instruction in this RISC subset can be implemented in at most five clock cycles. The five clock cycles are as follows. 65

66 1. Instruction fetch cycle (IF): Send the program counter (PC) to memory and fetch the current instruction from memory. Update the PC to the next sequential PC by adding four (since each instruction is four bytes) to the PC. 2. Instruction decode/register fetch cycle (ID): Decode the instruction and read the registers corresponding to register source specifiers from the register file. Do the equality test on the registers as they are read, for a possible branch. Sign extend the offset field of the instruction in case it is needed. Compute the possible branch target address by adding the sign-extended offset to the incremented PC. Decoding is done in parallel with reading registers, which is possible because the register specifiers are at a fixed location in RISC architecture. This technique is known as fixed-field decoding. 3. Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of three functions depending on the instruction type. Memory reference: The ALU adds the base register and the offset to form the effective address. Register-Register ALU instruction: The ALU performs the operation specified by the ALU opcode on the values read from the register file. Register-Immediate ALU instruction: The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate. 4. Memory access (MEM): If the instruction is a load, then read from memory using the effective address computed in the previous cycle. If it is a store, then the data from the second register read from the register file is written into memory using the effective address. 5. Write-back cycle (WB): Register-Register ALU instruction or Load instruction: Write the result into the register file, whether it comes from the memory system (for a load) or from the ALU (for an ALU instruction). 66

67 The Classic Five-Stage Pipeline for A RISC processor We can pipeline the execution a new instruction on each clock cycle. Each of the clock cycles from the previous section becomes a pipe stage: a cycle in the pipeline. Each instruction takes five clock cycles to complete, during each clock cycle the hardware will initiate a new instruction and will be executing some part of the five different instructions. Simple RISC pipeline. First, we use separate instruction and data memories, which we would typically implement with separate instruction and data caches. The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access. Notice that if our pipelined processor has a clock cycle that is equal to that of the unpipelined version, the memory system must deliver five times the bandwidth. This increased demand is one cost of higher performance. Second, the register file is used in the two stages: one for reading in ID and one for writing in WB. We need to perform two reads and one write every clock cycle. To handle reads and a write to the same register (and for another reason, which will become obvious shortly), we perform the register write in the first half of the clock cycle and the read in the second half. 67

68 The pipeline can be thought of as a series of datapaths shifted in time. Third, Figure does not deal with the PC. To start a new instruction every clock, we must increment and store the PC every clock, and this must be done. During the IF stage in preparation for the next instruction. One further problem is that a branch does not change the PC until the ID stage. Although it is critical to ensure that instructions in the pipeline do not attempt to use the hardware resources at the same time, we must also ensure that instructions in different stages of the pipeline do not interfere with one another. This separation is done by introducing pipeline registers between successive stages of the pipeline, so that at the end of a clock cycle all the results from a given stage are stored into a register that is used as the input to the next stage on the next clock cycle. 68

69 In the case of a pipelined processor, the pipeline registers also play the key role of carrying intermediate results from one stage to another where the source and destination may be directly adjacent. It is sometimes useful to name the pipeline registers. The registers are called ID/IF, IF/RF, RF/EX, EX/MEM, MEM/WB. A pipeline showing the pipeline registers between successive pipeline stages Basic Performance Issues in Pipelining Pipelining increases the CPU instruction throughput the number of instructions completed per unit of time but it does not reduce the execution time of an individual instruction. The increase in instruction throughput means that a program runs faster and has lower total execution time, even though no single instruction runs faster! 69

70 The fact that the execution time of each instruction does not decrease puts limits on the practical depth of a pipeline. In addition to limitations arising from pipeline latency, limits arise from imbalance among the pipe stages and from pipelining overhead. Imbalance among the pipe stages reduces performance since the clock can run no faster than the time needed for the slowest pipeline stage. Pipeline overhead arises from the combination of pipeline register delay and clock skew. The pipeline registers add setup time, which is the time that a register input must be stable before the clock signal that triggers a write occurs, plus propagation delay to the clock cycle. Clock skew, which is maximum delay between when the clock arrives at any two registers, also contributes to the lower limit on the clock cycle. Once the clock cycle is as small as the sum of the clock skew and latch overhead, no further pipelining is useful, since there is no time left in the cycle for useful work. Average instruction execution time = Clock cycle Average CPI The Major Hurdle of Pipelining Pipeline Hazards Hazards, which prevents the next instruction in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards: 1. Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution. 2. Data hazards arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. 3. Control hazards arise from the pipelining of branches and other instructions that change the PC. 70

Hazards in pipelines can make it necessary to stall the pipeline. Avoiding a hazard often requires that some instructions in the pipeline be allowed to proceed while others are delayed.

71 Hazards in pipelines can make it necessary to stall the pipeline. Avoiding a hazard often requires that some instructions in the pipeline be allowed to proceed while others are delayed. Performance of Pipelines with Stalls Speedup from pipelining, Pipelining can be thought of as decreasing the CPI or the clock cycle time. Since it is traditional to use the CPI to compare pipelines. The ideal CPI on a pipelined processor is almost always 1. Hence, we can compute the pipelined CPI: If we ignore the cycle time overhead of pipelining and assume the stages are perfectly balanced, then the cycle time of the two processors can be equal, leading to The unpipelined CPI is equal to the depth of the pipeline, leading to Alternatively, if we think of pipelining as improving the clock cycle time, then we can assume that the CPI of the unpipelined processor, as well as that of the pipelined processor, is 1. 71

This leads to In cases where the pipe stages are perfectly balanced and there is no overhead, the clock cycle on the pipelined processor is smaller than the clock cycle of the unpipelined processor

72 This leads to In cases where the pipe stages are perfectly balanced and there is no overhead, the clock cycle on the pipelined processor is smaller than the clock cycle of the unpipelined processor by a factor equal to the pipelined depth: This leads to the following: Thus, if there are no stalls, the speedup is equal to the number of pipeline stages, matching our intuition for the ideal case. Structural Hazards When a processor is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If some combination of instructions cannot be accommodated because of resource conflicts, the processor is said to have a structural hazard. The most common instances of structural hazards arise when some functional unit is not fully pipelined. Then a sequence of instructions using that unpipelined unit cannot proceed at the rate of one per clock cycle. Another common way that structural hazards appear is when some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. 72

73 For example, a processor may have only one register-file write port, but under certain circumstances, the pipeline might want to perform two writes in a clock cycle. This will generate a structural hazard. When a sequence of instructions encounters this hazard, the pipeline will stall one of the instructions until the required unit is available. Such stalls will increase the CPI from its usual ideal value of 1.A stall is commonly called a pipeline bubble or just bubble, since it floats through the pipeline taking space but carrying no useful work. A processor with only one memory port will generate a conflict whenever a memory reference occurs. 73

74 A pipeline stalled for a structural hazard a load with one memory port. Data Hazards A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This overlap introduces data and control hazards. Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on an unpipelined processor. Consider the pipelined execution of these instructions: ADDD R1,R2,R3 SUBD R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11 All the instructions after the ADD use the result of the ADD instruction. As shown in Figure, the ADDD instruction writes the value of R1 in the WB pipe stage, but the SUBD instruction reads the value during its ID stage. This problem is called a data hazard. The AND instruction is also affected by this hazard. As we can see from Figure, the write of R1 does not complete until the end of clock cycle 5. Thus, the AND instruction that reads the registers during clock cycle 4 will receive the wrong results. 74

75 The XOR instruction operates properly, because its register read occurs in clock cycle 6, after the register write. The OR instruction also operates without incurring a hazard because we perform the register file reads in the second half of the cycle and the writes in the first half. Minimizing Data Hazard Stalls By Forwarding The problem posed in Figure A.6 can be solved with a simple hardware technique called forwarding (also called bypassing and sometimes short-circuiting). If the result can be moved from the pipeline register where the ADD stores it to where the SUB needs it, then the need for a stall can be avoided. Using this observation, forwarding works as follows: 1. The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs. 2. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Forwarding can be generalized to include passing a result directly to the functional unit that requires it: A result is forwarded from the pipeline register corresponding to the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit. Take, for example, the following sequence: ADDD R1,R2,R3 LD R4,0(R1) SD 12(R1),R4 To prevent a stall in this sequence, we would need to forward the values of the ALU output and memory unit output from the pipeline registers to the ALU and data memory inputs. Figure shows all the forwarding paths for this example. 75

76 The use of the result of the ADD instruction in the next three instructions causes a hazard, since the register is not written until after those instructions read it. A set of instructions that depend on the ADD result use forwarding paths to avoid the data hazard. 76

77 Data Hazards Requiring Stalls Unfortunately, not all potential data hazards can be handled by bypassing. Consider the following sequence of instructions: LD R1,0(R2) SUBD R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 The pipelined datapath with the bypass paths for this example is shown. Stores require an operand during MEM, and forwarding of that operand The LD instruction does not have the data until the end of clock cycle 4 (its MEM cycle), while the SUBD instruction needs to have the data by the beginning of that clock cycle. The load instruction has a delay or latency that cannot be eliminated by forwarding alone. Instead, we need to add hardware, called a pipeline interlock, to preserve the correct execution pattern. In general, a pipeline interlock detects a hazard and stalls the pipeline until the hazard is cleared. 77

78 In this case, the interlock stalls the pipeline, beginning with the instruction that wants to use the data until the source instruction produces it. This pipeline interlock introduces a stall or bubble, just as it did for the structural hazard. The CPI for the stalled instruction increases by the length of the stall. The load instruction can bypass its results to the AND and OR instructions, but not to the SUBD, since that would mean forwarding the result in negative time. In the top half, we can see why a stall is needed: the MEM cycle of the load produces a value that is needed in the EX cycle of the SUBD, which occurs at the same time. 78

79 Branch Hazards Control hazards can cause a greater performance loss for our MIPS pipeline than do data hazards. When a branch is executed, it may or may not change the PC to something other than its current value plus 4. Recall that if a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken, or untaken. If instruction i is a taken branch, then the PC is normally not changed until the end of ID, after the completion of the address calculation and comparison Figure shows that the simplest method of dealing with branches is to redo the fetch of the instruction following a branch, once we detect the branch during ID (when instructions are decoded). The first IF cycle is essentially a stall, because it never performs useful work. You may have noticed that if the branch is untaken, then the repetition of the IF stage is unnecessary since the correct instruction was indeed fetched. We will develop several schemes to take advantage of this fact shortly. A branch causes a one-cycle stall in the five-stage pipeline. Reducing Pipeline Branch Penalties The software can try to minimize the branch penalty using knowledge of the hardware scheme and of branch behavior. The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known. The attractiveness of this solution lies primarily in its simplicity both for hardware and software. A higher performance, and only slightly more complex, scheme is to treat every branch as not taken, simply allowing the hardware to continue as if the branch were not executed. In the simple five-stage pipeline, this predict-not-taken or predictuntaken scheme is implemented by continuing to fetch instructions as if the branch were a normal instruction. 79

80 An alternative scheme is to treat every branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target. Because in our five-stage pipeline we don t know the target address any earlier than we know the branch outcome, there is no advantage in this approach for this pipeline. In either a predict-taken or predict-not-taken scheme, the compiler can improve performance by organizing the code so that the most frequent path matches the hardware s choice. The predict-not-taken scheme and the pipeline sequence when the branch is untaken (top) and taken (bottom). A fourth scheme in use in some processors is called delayed branch. This technique was heavily used in early RISC processors and works reasonably well in the five-stage pipeline. In a delayed branch, the execution cycle with a branch delay of one is branch instruction sequential successor1 branch target if taken The sequential successor is in the branch-delay slot. This instruction is executed whether or not the branch is taken. 80

81 The behavior of a delayed branch is the same whether or not the branch is taken. The limitations on delayed-branch scheduling arise from (1) the restrictions on the instructions that are scheduled into the delay slots and (2) our ability to predict at compile time whether a branch is likely to be taken or not. To improve the ability of the compiler to fill branch delay slots, most processors with conditional branches have introduced a cancelling or nullifying branch. In a cancelling branch, the instruction includes the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the branchdelay slot is simply executed as it would normally be with a delayed branch. When the branch is incorrectly predicted, the instruction in the branch-delay slot is simply turned into a no-op. Performance of Branch Schemes What is the effective performance of each of these schemes? The effective pipeline speedup with branch penalties, assuming an ideal CPI of 1, is Because of the following: Pipeline stall cycles from branches = Branch frequency Branch penalty 81

82 We obtain The branch frequency and branch penalty can have a component from both unconditional and conditional branches. However, the latter dominate since they are more frequent. How is Pipelining Implemented? A Simple Implementation of MIPS Every MIPS instruction can be implemented in at most five clock cycles. The five clock cycles are as follows. 1. Instruction fetch cycle (IF): IR <-- Mem[PC]; NPC <-- PC + 4; Operation: Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise the register NPC is used to hold the next sequential PC. 2. Instruction decode/register fetch cycle (ID): A <-- Regs[rs]; B <-- Regs[Irt]; Imm <-- sign extended immediate field of IR; Operation: Decode the instruction and access the register file to read the registers (rs and rt are the register specifiers). The outputs of the general-purpose registers are read into two temporary registers (A and B) for use in later clock cycles.the lower 16 bits of the IR are also sign-extended and stored into the temporary register Imm, for use in the next cycle. Decoding is done in parallel with reading registers, which is possible because these fields are at a fixed location in the MIPS instruction format. 82

83 Because the immediate portion of an instruction is located in an identical place in every MIPS format, the sign-extended immediate is also calculated during this cycle in case it is needed in the next cycle. 3. Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of four functions depending on the MIPS instruction type. Memory reference: ALUOutput <-- A + Imm; Operation: The ALU adds the operands to form the effective address and places the result into the register ALUOutput. Register-Register ALU instruction: ALUOutput <-- A func B; Operation: The ALU performs the operation specified by the function code on the value in register A and on the value in register B. The result is placed in the temporary register ALUOutput. Register-Immediate ALU instruction: ALUOutput <-- A op Imm; Operation: The ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the temporary register ALUOutput. Branch: ALUOutput <-- NPC + Imm; Cond <--(A == 0) Operation: The ALU adds the NPC to the sign-extended immediate value in Imm to compute the address of the branch target. 83

84 Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken. Since we are considering only one form of branch (BEQZ), the comparison is against 0. The load-store architecture of MIPS means that effective address and execution cycles can be combined into a single clock cycle, since no instruction needs to simultaneously calculate a data address, calculate an instruction target address, and perform an operation on the data. The other integer instructions not included above are jumps of various forms, which are similar to branches. 4. Memory access/branch completion cycle (MEM): The PC is updated for all instructions: PC <-- NPC; Memory reference: LMD <-- Mem[ALUOutput] or Mem[ALUOutput] <-- B; Operation: Access memory if needed. If instruction is a load, data returns from memory and is placed in the LMD (load memory data) register; if it is a store, then the data from the B register is written into memory. In either case the address used is the one computed during the prior cycle and stored in the register ALUOutput. Branch: if (cond) PC <-- ALUOutput Operation: If the instruction branches, the PC is replaced with the branch destination address in the register ALUOutput. 5. Write-back cycle (WB): Register-Register ALU instruction: Regs[rd] <-- ALUOutput; Register-Immediate ALU instruction: Regs[rt] <-- ALUOutput; 84

85 Load instruction: Regs[rt] <-- LMD; Operation: Write the result into the register file, whether it comes from the memory system (which is in LMD) or from the ALU (which is in ALUOutput); the register destination field is also in one of two positions (rd or rt) depending on the effective opcode.figure shows how an instruction flows through the data path. At the end of each clock cycle, every value computed during that clock cycle and required on a later clock cycle (whether for this instruction or the next) is written into a storage device, which may be memory, a general-purpose register, the PC, or a temporary register (i.e., LMD, Imm, A, B, IR, NPC, ALUOutput, or Cond). The temporary registers hold values between clock cycles for one instruction, while the other storage elements are visible parts of the state and hold values between successive instructions. The implementation of the MIPS datapath allows every instruction to be executed in four or five clock cycles. 85

A Basic Pipeline for MIPS Figure shows the MIPS pipeline with the appropriate registers, called pipeline registers or pipeline latches, between each pipeline stage.

86 A Basic Pipeline for MIPS Figure shows the MIPS pipeline with the appropriate registers, called pipeline registers or pipeline latches, between each pipeline stage. The registers are labeled with the names of the stages they connect. Figure is drawn so that connections through the pipeline registers from one stage to another are clear. All of the registers needed to hold values temporarily between clock cycles within one instruction are subsumed into these pipeline registers. The fields of the instruction register (IR), which is part of the IF/ID register, are labeled when they are used to supply register names. The pipeline registers carry both data and control from one pipeline stage to the next. The datapath is pipelined by adding a set of registers, one between each pair of pipe stages. To control this simple pipeline we need only determine how to set the control for the four multiplexers in the datapath of Figure. The two multiplexers in the ALU stage are set depending on the instruction type, which is dictated by the IR field of the ID/EX register. 86

87 The top ALU input multiplexer is set by whether the instruction is a branch or not, and the bottom multiplexer is set by whether the instruction is a register-register ALU operation or any other type of operation. The multiplexer in the IF stage chooses whether to use the value of the incremented PC or the value of the EX/MEM.ALUOutput (the branch target) to write into the PC. This multiplexer is controlled by the field EX/MEM.cond. The fourth multiplexer is controlled by whether the instruction in the WB stage is a load or a ALU operation. Implementing the Control for the MIPS Pipeline The process of letting an instruction move from the instruction decode stage (ID) into the execution stage (EX) of this pipeline is usually called instruction issue; an instruction that has made this step is said to have issued. For the MIPS integer pipeline, all the data hazards can be checked during the ID phase of the pipeline. If a data hazard exists, the instruction is stalled before it is issued. Likewise, we can determine what forwarding will be needed during ID and set the appropriate controls then. Detecting interlocks early in the pipeline reduces the hardware complexity because the hardware never has to suspend an instruction that has updated the state of the processor, unless the entire processor is stalled. Alternatively, we can detect the hazard or forwarding at the beginning of a clock cycle that uses an operand (EX and MEM for this pipeline). To show the differences in these two approaches, we will show how the interlock for a RAW hazard with the source coming from a load instruction (called a load interlock) can be implemented by a check in ID, while the implementation of forwarding paths to the ALU inputs can be done during EX.Figure lists the variety of circumstances that we must handle. Once a hazard has been detected, the control unit must insert the pipeline stall and prevent the instructions in the IF and ID stages from advancing. Figure shows the comparisons and possible forwarding operations where the destination of the forwarded result is an ALU input for the instruction currently in EX. Figure shows the relevant segments of the pipelined datapath with the additional multiplexers and connections in place. 87

Dealing with Branches in the Pipeline In MIPS, the branches (BEQZ and BNEZ) require testing a register for equality to zero or to another register for equality.

88 Situations that the pipeline hazard detection hardware can see by comparing the destination and sources of adjacent instructions. The logic to detect the need for load interlocks during the ID stage of an instruction requires three comparisons. Dealing with Branches in the Pipeline In MIPS, the branches (BEQZ and BNEZ) require testing a register for equality to zero or to another register for equality. Thus, it is possible to complete this decision by the end of the ID cycle by moving the zero test into that cycle. To take advantage of an early decision on whether the branch is taken, both PCs (taken and untaken) must be computed early. 88

Computing the branch target address during ID requires an additional adder because the main ALU, which has been used for this function so far, is not usable until EX.

89 Computing the branch target address during ID requires an additional adder because the main ALU, which has been used for this function so far, is not usable until EX. Figure shows the revised pipelined datapath. Forwarding of data to the two ALU inputs (for the instruction in EX) can occur from the ALU result (in EX/MEM or in MEM/WB) or from the load result in MEM/WB. 89

90 Forwarding of results to the ALU requires the addition of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs. The stall from branch hazards can be reduced by moving the zero test and branch target calculation into the ID phase of the pipeline. 90

91 Revised pipeline structure In some processors, branch hazards are even more expensive in clock cycles than in our example, since the time to evaluate the branch condition and compute the destination can be even longer. What Makes Pipelining Hard to Implement? The first part, considers the challenges of exceptional situations where the instruction execution order is changed in unexpected ways. In the second part, we discuss some of the challenges raised by different instruction sets. Dealing with Exceptions Exceptional situations are harder to handle in a pipelined CPU because the overlapping of instructions makes it more difficult to know whether an instruction can safely change the state of the CPU. In a pipelined CPU, an instruction is executed piece by piece and is not completed for several clock cycles. Unfortunately, other instructions in the pipeline can raise exceptions that may force the CPU to abort the instructions in the pipeline before they complete. Types of Exceptions and Requirements The terminology used to describe exceptional situations where the normal execution order of instruction is changed varies among CPUs. 91

92 The terms interrupt, fault, and exception are used, though not in a consistent fashion. We use the term exception to cover all these mechanisms, including the following: I/O device request Invoking an operating system service from a user program Tracing instruction execution Breakpoint (programmer-requested interrupt) Integer arithmetic overflow FP arithmetic anomaly Page fault (not in main memory) Misaligned memory accesses (if alignment is required) Memory-protection violation Using an undefined or unimplemented instruction Hardware malfunctions Power failure The requirements on exceptions can be characterized on five semiindependent axes: 1. Synchronous versus asynchronous If the event occurs at the same place every time the program is executed with the same data and memory allocation, the event is synchronous. With the exception of hardware malfunctions, asynchronous events are caused by devices external to the CPU and memory. Asynchronous events usually can be handled after the completion of the current instruction, which makes them easier to handle. 2. User requested versus coerced If the user task directly asks for it, it is a user request event. In some sense, user-requested exceptions are not really exceptions, since they are predictable. They are treated as exceptions, however, because the same mechanisms that are used to save and restore the state are used for these userrequested events. Because the only function of an instruction that triggers this exception is to cause the exception, user-requested exceptions can always be handled after the instruction has completed. 92

93 Coerced exceptions are caused by some hardware event that is not under the control of the user program. Coerced exceptions are harder to implement because they are not predictable. 3. User maskable versus user nonmaskable If an event can be masked or disabled by a user task, it is user maskable. This mask simply controls whether the hardware responds to the exception or not. 4. Within versus between instructions This classification depends on whether the event prevents instruction completion by occurring in the middle of execution no matter how short or whether it is recognized between instructions. Exceptions that occur within instructions are usually synchronous, since the instruction triggers the exception. It s harder to implement exceptions that occur within instructions than those between instructions, since the instruction must be stopped and restarted. Asynchronous exceptions that occur within instructions arise from catastrophic situations (e.g., hardware malfunction) and always cause program termination. 5. Resume versus terminate If the program s execution always stops after the interrupt, it is a terminating event. If the program s execution continues after the interrupt, it is a resuming event. It is easier to implement exceptions that terminate execution, since the CPU need not be able to restart execution of the same program after handling the exception. If a pipeline provides the ability for the processor to handle the exception, save the state, and restart without affecting the execution of the program, the pipeline or processor is said to be restartable. Stopping and Restarting Execution As in unpipelined implementations, the most difficult exceptions have two properties: (1) They occur within instructions (that is, in the middle of the instruction execution corresponding to EX or MEM pipe stages), and (2) They must be restartable. 93

94 When an exception occurs, the pipeline control can take the following steps to save the pipeline state safely: 1. Force a trap instruction into the pipeline on the next IF. 2. Until the trap is taken, turn off all writes for the faulting instruction and for all instructions that follow in the pipeline; this can be done by placing zeros into the pipeline latches of all instructions in the pipeline, starting with the instruction that generates the exception, but not those that precede that instruction. This prevents any state changes for instructions that will not be completed before the exception is handled. 3. After the exception-handling routine in the operating system receives control, it immediately saves the PC of the faulting instruction. This value will be used to return from the exception later. When we use delayed branches, as mentioned in the last section, it is no longer possible to re-create the state of the processor with a single PC because the instructions in the pipeline may not be sequentially related. So we need to save and restore as many PCs as the length of the branch delay plus one. If the pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch, the pipeline is said to have precise exceptions. Exceptions in MIPS Figure shows the MIPS pipeline stages and which problem exceptions might occur in each stage. With pipelining, multiple exceptions may occur in the same clock cycle because there are multiple instructions in execution. For example, consider this instruction sequence: This pair of instructions can cause a data page fault and an arithmetic exception at the same time, since the LD is in the MEM stage while the ADDD is in the EX stage. 94

95 Exceptions that may occur in the MIPS pipeline. Exceptions raised from instruction or data-memory access account for six out of eight cases. Instruction Set Complications No MIPS instruction has more than one result, and our MIPS pipeline writes that result only at the end of an instruction s execution. When an instruction is guaranteed to complete it is called committed. In the MIPS integer pipeline, all instructions are committed when they reach the end of the MEM stage (or beginning of WB) and no instruction updates the state before that stage. A related source of difficulties arises from instructions that update memory state during execution, such as the string copy operations on the VAX or IBM 360. To make it possible to interrupt and restart these instructions, the instructions are defined to use the general-purpose registers as working registers A different set of difficulties arises from odd bits of state that may create additional pipeline hazards or may require extra hardware to save and restore. Condition codes are a good example of this. 2.2 INSTRUCTION-LEVEL PARALLELISM Concepts and Challenges Instruction-level parallelism (ILP) is the potential overlap the execution of instructions using pipeline concept to improve performance of the system. The various techniques that are used to increase amount of parallelism are reduces the impact of data and control hazards and increases processor ability to exploit parallelism 95

96 There are two approaches to exploiting ILP. 1. Static Technique Software Dependent 2. Dynamic Technique Hardware Dependent The static technique is compiler-intensive approach, which have broader adoption in the embedded market than the desktop or server markets, the new IA-64 architecture and Intel s Itanium, use this more static approach. The dynamic technique is hardware intensive approach, which dominate the desktop and server markets and are used in a wide range of processors, including: the Pentium III and 4, the Althon, the MIPS R10000/12000, the Sun ultrasparc III, the Power-PC 603, G3, and G4, and the Alpha The major techniques together with the component of the CPI equation that the technique affects. The simplest and most common way to increase the amount of parallelism is loop-level parallelism. Here is a simple example of a loop, which adds two 1000element arrays, that is completely parallel: for (i=1;i<=1000; i=i+1) x[i] = x[i] + y[i]; 96

97 Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little or no opportunity for overlap. There are a number of techniques for converting such loop-level parallelism into instruction-level parallelism are basically work by unrolling the loop either statically by the compiler or dynamically by the hardware. An important alternative method for exploiting loop-level parallelism is the use of vector instructions. Pipeline CPI: CPI (Cycles per Instruction) for a pipelined processor is the sum of the base CPI and all contributions from stalls: Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation. By reducing each of the terms of the right-hand side, we minimize the overall pipeline CPI and thus increase the IPC (Instructions per Clock). Data Dependence and Hazards: To exploit instruction-level parallelism, determine which instructions can be executed in parallel. If two instructions are parallel, they can execute simultaneously in a pipeline without causing any stalls. If two instructions are dependent they are not parallel and must be executed in order. There are three different types of dependences: data dependences (also called true data dependences), name dependences, and control dependences. Data Dependences: An instruction j is data dependent on instruction i if either of the following holds: Instruction i produces a result that may be used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. 97

98 The second condition simply states that one instruction is dependent on another if there exists a chain of dependences of the first type between the two instructions. This dependence chain can be as long as the entire program. For example, consider the following code sequence that increments a vector of values in memory (starting at 0(R1) and with the last element at 8(R2)) by a scalar in register F2: Loop: L.D F0,0(R1) ; F0=array element ADD.D F4,F0,F2 ; add scalar in F2 S.D F4,0(R1) ; store result DADDUI R1,R1,#-8 ; decrement pointer 8 bytes BNE R1,R2,LOOP ; branch R1!=zero The data dependences in this code sequence involve both floating point data: Loop: L.D F0,0(R1) ;F0=array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store result and integer data: DADDIU R1,R1,-8 ;decrement pointer ;8 bytes (per DW) BNE R1,R2,Loop ; branch R1!=zero Both of the above dependent sequences, as shown by the arrows, with each instruction depending on the previous one. The arrows here and in following examples show the order that must be preserved for correct execution. The arrow points from an instruction that must precede the instruction that the arrowhead points to. If two instructions are data dependent they cannot execute simultaneously or be completely overlapped. The dependence implies that there would be a chain of one or more data hazards between the two instructions. Executing the instructions simultaneously will cause a processor with pipeline interlocks to detect a hazard and stall, thereby reducing or eliminating the overlap. 98

99 Dependences are a property of programs. Whether a given dependence results in an actual hazard being detected and whether that hazard actually causes a stall are properties of the pipeline organization. This difference is critical to understanding how instruction-level parallelism can be exploited. The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline. The importance of the data dependences is that dependence (1) indicates the possibility of a hazard, (2) determines the order in which results must be calculated, and (3) sets an upper bound on how much parallelism can possibly be exploited. Name Dependences The name dependence occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between the instructions associated with that name. There are two types of name dependences between an instruction i that precedes instruction j in program order: An antidependence between instruction i and instruction j occurs when instruction j writes a register or memory location that instruction i reads. The original ordering must be preserved to ensure that i reads the correct value. An output dependence occurs when instruction i and instruction j write the same register or memory location. The ordering between the instructions must be preserved to ensure that the value finally written corresponds to instruction j. Both anti-dependences and output dependences are name dependences, as opposed to true data dependences, since there is no value being transmitted between the instructions. 99

100 Since a name dependence is not a true dependence, instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. This renaming can be more easily done for register operands, where it is called register renaming. Register renaming can be done either statically by a compiler or dynamically by the hardware. Before describing dependences arising from branches, let s examine the relationship between dependences and pipeline data hazards. Control Dependences: A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order. Every instruction, except for those in the first basic block of the program, is control dependent on some set of branches, and, in general, these control dependences must be preserved to preserve program order. One of the simplest examples of a control dependence is the dependence of the statements in the then part of an if statement on the branch. For example, in the code segment: if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2is control dependent on p2 but not on p1. In general, there are two constraints imposed by control dependences: An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. For example, we cannot take an instruction from the then-portion of an if- statement and move it before the if-statement. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. For 100

101 example, we cannot take a statement before the if-statement and move it into the then-portion. Control dependence is preserved by two properties in a simple pipeline, First, instructions execute in program order. This ordering ensures that an instruction that occurs before a branch is executed before the branch. Second, the detection of control or branch hazards ensures that an instruction that is control dependent on a branch is not executed until the branch direction is known. Data Hazards A hazard is created whenever there is dependence between instructions, and they are close enough that the overlap caused by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. Because of the dependence, preserve order that the instructions would execute in, if executed sequentially one at a time as determined by the original source program. The goal of both our software and hardware techniques is to exploit parallelism by preserving program order only where it affects the outcome of the program. Detecting and avoiding hazards ensures that necessary program order is preserved.data hazards may be classified as one of three types, depending on the order of read and write accesses in the instructions. Consider two instructions i and j, with i occurring before j in program order. The possible data hazards are RAW (read after write) j tries to read a source before i writes it, so j incorrectly gets the old value. This hazard is the most common type and corresponds to a true data dependence. Program order must be preserved to ensure that j receives the value from i. In the simple common five-stage static pipeline a load instruction followed by an integer ALU instruction that directly uses the load result will lead to a RAW hazard. WAW (write after write) j tries to write an operand before it is written by i. The writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination. This hazard corresponds to output dependence. 101

102 WAW hazards are present only in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled. The classic five-stage integer pipeline writes a register only in the WB stage and avoids this class of hazards. WAR (write after read) j tries to write a destination before it is read by i, so i incorrectly gets the new value. This hazard arises from an antidependence. WAR hazards cannot occur in most static issue pipelines even deeper pipelines or floating point pipelines because all reads are early (in ID) and all writes are late (in WB). A WAR hazard occurs either when there are some instructions that write results early in the instruction pipeline, and other instructions that read a source late in the pipeline or when instructions are reordered. Control dependence is not the critical property that must be preserved. Instead, the two properties critical to program correctness and normally preserved by maintaining both data and control dependence are the exception behavior and the data flow. The property of whether a value will be used by an upcoming instruction is called liveness. This type of code scheduling is sometimes called speculation, since the compiler is betting on the branch outcome; in this case, the bet is that the branch is usually not taken. 2.3 OVERCOMING DATA HAZARDS WITH DYNAMIC SCHEDULING The Dynamic Scheduling is used handle some cases when dependences are unknown at a compile time. In which the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior. it also allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline. Although a dynamically scheduled processor cannot change the data flow, it tries to avoid stalling when dependences, which could generate hazards, are present. 102

103 Dynamic Scheduling: The Idea A major limitation of the simple pipelining techniques is that they all use inorder instruction issue and execution: Instructions are issued in program order and if an instruction is stalled in the pipeline, no later instructions can proceed. Thus, if there is dependence between two closely spaced instructions in the pipeline, this will lead to a hazard and a stall. If there are multiple functional units, these units could lie idle. If instruction j depends on a long-running instruction i, currently in execution in the pipeline, then all instructions after j must be stalled until i is finished and j can execute. For example, consider this code: DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14 The SUB.D instruction cannot execute because the dependence of ADD.D on DIV.D causes the pipeline to stall; yet SUB.D is not data dependent on anything in the pipeline. This hazard creates a performance limitation that can be eliminated by not requiring instructions to execute in program order. In the classic five-stage pipeline both structural and data hazards could be checked during instruction decode (ID): When an instruction could execute without hazards, it was issued from ID knowing that all data hazards had been resolved. To allow us to begin executing the SUB.D in the above example, we must separate the issue process into two parts: checking for any structural hazards and waiting for the absence of a data hazard. We can still check for structural hazards when we issue the instruction; thus, we still use in-order instruction issue (i.e., instructions issue in program order), but we want an instruction to begin execution as soon as its data operand is available. Thus, this pipeline does out-of-order execution, which implies out-of-order completion. 103

104 Out-of-order execution introduces the possibility of WAR and WAW hazards, which do not exist in the five-stage integer pipeline and its logical extension to an inorder floating-point pipeline. Out-of-order completion also creates major complications in handling exceptions. Dynamic scheduling with out-of-order completion must preserve exception behavior in the sense that exactly those exceptions that would arise if the program were executed in strict program order actually do arise. Imprecise exceptions can occur because of two possibilities: 1. The pipeline may have already completed instructions that are later in program order than the instruction causing the exception, and 2. The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception. To allow out-of-order execution, we essentially split the ID pipe stage of our simple five-stage pipeline into two stages: 1 Issue Decode instructions, check for structural hazards. 2 Read operands Wait until no data hazards, then read operands. In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. Score-boarding is a technique for allowing instructions to execute out-of-order when there are sufficient resources and no data dependences; it is named after the CDC 6600 scoreboard, which developed this capability. We focus on a more sophisticated technique, called Tomasulo s algorithm, that has several major enhancements over scoreboarding. 104

105 Dynamic Scheduling Using Tomasulo s Approach This scheme was invented by RobertTomasulo, and was first used in the IBM 360/91. it uses register renaming to eliminate output and anti-dependencies, i.e. WAW and WAR hazards. Output and anti-dependencies are just name dependencies, there is no actual data dependence. Tomasulo's algorithm implements register renaming through the use of what are called reservation stations. Reservation stations are buffers which fetch and store instruction operands as soon as they are available. In addition, pending instructions designate the reservation station that will provide their input. Finally, when successive writes to a register overlap in execution, only the last one is actually used to update the register. As instructions are issued, the register specifies for pending operands are renamed to the names of the reservation station, which provides register renaming. Since there can be more reservation stations than real registers, the technique can even eliminate hazards arising from name dependences that could not be eliminated by a compiler. The use of reservation stations, rather than a centralized register file, leads to two other important properties. First, hazard detection and execution control are distributed: The information held in the reservation stations at each functional unit determine when an instruction can begin execution at that unit. Second, results are passed directly to functional units from the reservation stations where they are buffered, rather than going through the registers. This bypassing is done with a common result bus that allows all units waiting for an operand to be loaded simultaneously (on the 360/91 this is called the common data bus, or CDB). In pipelines with multiple execution units and issuing multiple instructions per clock, more than one result bus will be needed. 105

106 The basic structure of a MIPS floating point unit using Tomasulo s algorithm. Figure shows the basic structure of a Tomasulo-based MIPS processor, including both the floating-point unit and the load/store unit. Instructions are sent from the instruction unit into the instruction queue from which they are issued in FIFO order. The reservation stations include the operation and the actual operands, as well as information used for detecting and resolving hazards. Load buffers have three functions: hold the components of the effective address until it is computed, track outstanding loads that are waiting on the memory, and hold the results of completed loads that are waiting for the CDB. Similarly, store buffers have three functions: hold the components of the effective address until it is computed, hold the destination memory addresses of outstanding stores that are waiting for the data value to store, and hold the address and value to store until the memory unit is available. 106

107 All results from either the FP units or the load unit are put on the CDB, which goes to the FP register file as well as to the reservation stations and store buffers. The FP adders implement addition and subtraction, and the FP multipliers do multiplication and division. There are only three steps in Tomasulo s Aprroach : 1. Issue Get the next instruction from the head of the instruction queue. If there is a matching reservation station that is empty, issue the instruction to the station with the operand values (renames registers) 2. Execute(EX) When all the operands are available, place into the corresponding reservation stations for execution. If operands are not yet available, monitor the common data bus (CDB) while waiting for it to be computed. 3. Write result (WB) When the result is available, write it on the CDB and from there into the registers and into any reservation stations (including store buffers) waiting for this result. Stores also write data to memory during this step: When both the address and data value are available, they are sent to the memory unit and the store completes. The tag field describes which reservation station contains the instruction that will produce a result needed as a source operand. Once an instruction has issued and is waiting for a source operand, it refers to the operand by the reservation station number where the instruction that will write the register has been assigned. Unused values, such as zero, indicate that the operand is already available in the registers.the register names are discarded when an instruction issues to a reservation station. Each reservation station has six fields: Op The operation to perform on source operands S1 and S2. Qj, Qk The reservation stations that will produce the corresponding source operand; a value of zero indicates that the source operand is already available in Vj or Vk, or is unnecessary. 107

108 Vj, Vk The value of the source operands. Note that only one of the V field or the Q field is valid for each operand. For loads, the Vk field is used to the offset from the instruction. A used to hold information for the memory address calculation for a load or store. Busy Indicates that this reservation station and its accompanying functional unit are occupied. The register file has a field, Qi: Qi The number of the reservation station that contains the operation whose result should be stored into this register. If the value of Qi is blank (or 0), no currently active instruction is computing a result destined for this register, meaning that the value is simply the register contents. 2.5 DYNAMIC SCHEDULING: EXAMPLES AND THE ALGORITHM Tomasulo s scheme offers two major advantages over earlier and simpler schemes: (1) the distribution of the hazard detection logic and (2) the elimination of stalls for WAW and WAR hazards. The first advantage arises from the distributed reservation stations and the use of the CDB. If multiple instructions are waiting on a single result, and each instruction already has its other operand, then the instructions can be released simultaneously by the broadcast on the CDB. If a centralized register file were used, the units would have to read their results from the registers when register buses are available. The second advantage, the elimination of WAW and WAR hazards, is accomplished by renaming registers using the reservation stations, and by the process of storing operands into the reservation station as soon as they are available. Tomasulo s Algorithm: the details Loads and stores go through a functional unit for effective address computation before proceeding to independent load or store buffers. 108

109 Loads take a second execution step to access memory and then go to Write Result to send the value from memory to the register file and/or any waiting reservation stations. Stores complete their execution in the Write Result stage, which writes the result to memory. Tomasulo s Algorithm: A Loop-Based Example To understand the full power of eliminating WAW and WAR hazards through dynamic renaming of registers, we must look at a loop. Consider the earlier following simple sequence for multiplying the elements of an array by a scalar in F2: Loop: L.D F0,0(R1) MUL.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,-8 BNE R1,R2,Loop ; branches if R1 0 A load and store can safely be done in a different order, provided they access different addresses. If a load and a store access the same address, then either: the load is before the store in program order and interchanging them results in a WAR hazard, or The store is before the load in program order and interchanging them results in a RAW hazard. Similarly, interchanging two stores to the same address results in WAW hazard. Hence, to determine if a load can be executed at a given time, the processor can check whether any uncompleted store that precedes the load in program order shares the same data memory address as the load. Similarly, a store must wait until there are no unexecuted loads or stores that are earlier in program order and share the same data memory address. 109

110 Let s consider the situation of a load first. If we perform effective address calculation in program order, then when a load has completed effective address calculation, we can check whether there is an address conflict by examining the A field of all active store buffers. If the load address matches the address of any active entries in the store buffer, the load instruction is not sent to the load buffer until the conflicting store completes.stores operate similarly, except that the processor must check for conflicts in both the load buffers and the store buffers, since conflicting stores cannot be reordered with respect to either a load or a store. In Tomasulo s scheme two different techniques are combined: the renaming of the architectural registers to a larger set of registers and the buffering of source operands from the register file. Source operand buffering resolves WAR hazards that arise when the operand is available in the registers. Tomasulo s scheme is particularly appealing if the designer is forced to pipeline an architecture for which it is difficult to schedule code, that has a shortage of registers, or for which the designer wishes to obtain high performance without pipeline specific compilation. On the other hand, the advantages of the Tomasulo approach versus compiler scheduling for a efficient single-issue pipeline are probably fewer than the costs of implementation. The key components for enhancing ILP in Tomasulo s algorithm are dynamic scheduling, register renaming, and dynamic memory disambiguation. The techniques are used for two purposes: to predict whether a branch will be taken and to find the target more quickly. 2.5 REDUCING BRANCH COSTS WITH DYNAMIC HARDWARE PREDICTION The amount of ILP we attempt to exploit grows, control dependences rapidly become the limiting factor. Although schemes in this section are helpful in processors that try to maintain one instruction issue per clock, for two reasons they are crucial to any processor that tries to issue more than one instruction per clock. 110

111 First, branches will arrive up to n times faster in an n-issue processor and providing an instruction stream to the processor will probably require that we predict the outcome of branches. Second, Amdahl s Law reminds us that relative impact of the control stalls will be larger with the lower potential CPI in such machines. The goal of all these mechanisms is to allow the processor to resolve the outcome of a branch early, thus preventing control dependences from causing stalls. The effectiveness of a branch prediction scheme depends not only on the accuracy, but also on the cost of a branch when the prediction is correct and when the prediction is incorrect. Basic Branch Prediction and Branch-Prediction Buffers The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table. A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not. if the prediction is correct it may have been put there by another branch that has the same low-order address bits. The prediction is a hint that is assumed to be correct, and fetching begins in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored back. The performance of the buffer depends on both how often the prediction is for the branch of interest and how accurate the prediction is when it matches. This simple one-bit prediction scheme has a performance shortcoming: Even if a branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken. To remedy this, two-bit prediction schemes are often used. In a two-bit scheme, a prediction must miss twice before it is changed. Figure shows the finite-state processor for a two-bit prediction scheme. 111

112 The states in a two-bit prediction scheme. The two bits are used to encode the four states in the system. In a counter implementation, the counters are incremented when a branch is taken and decremented when it is not taken; the counters saturate at 00 or 11. One complication of the two-bit scheme is that it updates the prediction bits more often than a one-bit predictor, which only updates the prediction bit on a mispredict. Since we typically read the prediction bits on every cycle, a two-bit predictor will typically need both a read and a write access port. The two-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2 n greater than or equal to one half of its maximum value (2 1: when the counter is n 1 ), the branch is predicted as taken; otherwise, it is predicted untaken. As in the two-bit scheme, the counter is incremented on a taken branch and decremented on an untaken branch. Studies of n-bit predictors have shown that the two-bit predictors do almost as well, and thus most systems rely on two-bit branch predictors rather than the more general n-bit predictors. 112

113 If the instruction is decoded as a branch and if the branch is predicted as taken, fetching begins from the target as soon as the PC is known. Otherwise, sequential fetching and executing continue. If the prediction turns out to be wrong, the prediction bits are changed as shown in Figure. Although this scheme is useful for most pipelines, the five-stage, classic pipeline finds out both whether the branch is taken and what the target of the branch is at roughly the same time, assuming no hazard in accessing the register specified in the conditional branch. To exploit more ILP, the accuracy of our branch prediction becomes critical, this problem in two ways: by increasing the size of the buffer and by increasing the accuracy of the scheme we use for each prediction. Correlating Branch Predictors: These two-bit predictor schemes use only the recent behavior of a single branch to predict the future behavior of that branch. It may be possible to improve the prediction accuracy if we also look at the recent behavior of other branches rather than just the branch we are trying to predict. Consider a small code fragment from the SPEC92 benchmark if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb) { Here is the MIPS code that we would typically generate for this code fragment assuming that aa and bb are assigned to registers R1 and R2: L1: DSUBUI R3,R1,#2 BNEZ R3,L1 ;branch b1 (aa!=2) DADD R1,R0,R0 ;aa=0 DSUBUI R3,R2,#2 113

114 L2: BNEZ R3,L2 ;branch b2(bb!=2) DADD R2,R0,R0 ; bb=0 DSUBU R3,R1,R2 ;R3=aa-bb BEQZ R3,L3 ;branch b3 (aa==bb) Let s label these branches b1, b2, and b3. The key observation is that the behavior of branch b3 is correlated with the behavior of branches b1 and b2. Clearly, if branches b1 and b2 are both not taken (i.e., the if conditions both evaluate to true and aa and bb are both assigned 0), then b3 will be taken, since aa and bb are clearly equal. A predictor that uses only the behavior of a single branch to predict the outcome of that branch can never capture this behavior. Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors or two-level predictors. Tournament Predictors: Adaptively Combining Local and Global Predictors The primary motivation for correlating branch predictors came from the observation that the standard 2-bit predictor using only local information failed on some important branches and that by adding global information, the performance could be improved. Tournament predictors take this insight to the next level, by using multiple predictors, usually one based on global information and one based on local information, and combining them with a selector. Tournament predictors can achieve both better accuracy at medium sizes (8Kb-32Kb) and also make use of very large numbers of prediction bits effectively. Tournament predictors are the most popular form of multilevel branch predictors. A multilevel branch predictor use several levels of branch prediction tables together with an algorithm for choosing among the multiple predictors; Existing tournament predictors use a 2-bit saturating counter per branch to choose among two different predictors. The four states of the counter dictate whether to use predictor 1 or predictor 2. The state transition diagram is shown in Figure. 114

Figure shows how the tournament predictor selects between a local and global predictor depending on the benchmark, as well as on the branch.

115 The state transition diagram for a tournament predictor has four states corresponding to which predictor to use. The advantage of a tournament predictor is its ability to select the right predictor for the right branch. Figure shows how the tournament predictor selects between a local and global predictor depending on the benchmark, as well as on the branch. The ability to choose between a prediction based on strictly local information and one incorporating global information on a per branch basis is particularly critical in the integer benchmarks. The fraction of predictions coming from the local predictor for a tournament predictor using the SPEC89 benchmarks. 115

116 Figure looks at the performance of three different predictors (a local 2-bit predictor, a correlating predictor, and a tournament predictor) for different numbers of bits using SPEC89 as the benchmark. As we saw earlier, the prediction capability of the local predictor does not improve beyond a certain size. The correlating predictor shows a significant improvement, and the tournament predictor generates slightly better performance. The misprediction rate for three different predictors on SPEC89 as the total number of bits is increased. 2.6 HIGH PERFORMANCE INSTRUCTION DELIVERY Branch Target Buffers A branch-prediction cache that stores the predicted address for the next instruction after a branch is called a branch-target buffer or branch-target cache. For the classic, five-stage pipeline, a branch-prediction buffer is accessed during the ID cycle, so that at the end of ID we know the branch-target address (since it is computed during ID), the fall-through address (computed during IF), and the prediction. Thus, by the end of ID we know enough to fetch the next predicted instruction. 116

For a branch-target buffer, we access the buffer during the IF stage using the instruction address of the fetched instruction, a possible branch, to index the buffer.

117 For a branch-target buffer, we access the buffer during the IF stage using the instruction address of the fetched instruction, a possible branch, to index the buffer. If we get a hit, then we know the predicted instruction address at the end of the IF cycle, which is one cycle earlier than for a branch-prediction buffer. Because we are predicting the next instruction address and will send it out before decoding the instruction, we must know whether the fetched instruction is predicted as a taken branch. Figure shows what the branch-target buffer looks like. If the PC of the fetched instruction matches a PC in the buffer, then the corresponding predicted PC is used as the next PC. A branch-target buffer If a matching entry is found in the branch-target buffer, fetching begins immediately at the predicted PC. Note that the entry must be for this instruction, because the predicted PC will be sent out before it is known whether this instruction is even a branch. If we did not check whether the entry matched this PC, then the wrong PC would be sent out for instructions that were not branches, resulting in a slower processor. We only need to store the predicted-taken branches in the branch-target buffer, since an untaken branch follows the same strategy as a non branch. 117

118 Figure shows the steps followed when using a branch-target buffer and where these steps occur in the pipeline. From this we can see that there will be no branch delay if a branch-prediction entry is found in the buffer and is correct. The steps involved in handling an instruction with a branch-target buffer 118

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature