Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Size: px

Start display at page:

Download "Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01."

Lindsey Pope
6 years ago
Views:

pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using a

How? It starts with the P4 engine, called NetBurst Architecture, and then adds the hardware to

1 Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using a dual processor system, except for the licensing software, which recognizes it as a single processor. How? It starts with the P4 engine, called NetBurst Architecture, and then adds the hardware to provide two processor environments in the one chip. * Hyperthreading SMP Single-Threaded Simultaneously Multithreaded Processor Page 1 of 11

The Pentium III pipeline The Pentium IV pipeline http://www.anandtech.com/cpuchipsets/showdoc.aspx?

grouped into the major functions of fetch and deliver engine, an execution engine, and a reorder and

2 The Pentium III pipeline The Pentium IV pipeline The NetBurst Architecture starts simply as a twenty-stage or more pipeline where the stages are grouped into the major functions of fetch and deliver engine, an execution engine, and a reorder and retire block. ] ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Page 2 of 11

3 Front End ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Page 3 of 11

4 Front End The fetch and deliver engine is the concern of the first five or so stages, figure 5. The pentium 4 has an L1 instruction cache, called the Execution Trace Cache, which keeps a decomposed version of recent instructions, decomposed in the sense that the instruction is converted into one or a series of micro-operations that the hardware can process easily. The next instruction to be fetched is looked for in the Trace Cache, if found it is entered into the μop Queue and the procedure is finished, if not found it is requested from the L2 cache and so on down the memory line. The data retrieved from the L2 memory lane is queued and decoded and then queued again essentially decoding a number of instructions concurrently, although I think that the series of instructions are done sequentially along the queue but the effect is about the same. The results are put into the Uop Queue ready for scheduling. Remember that Pre-fetching is an integral part of the process, so along with the coordination of the Branch Prediction logic the instructions are being pumped into the pipeline continually and the operands determined by the Decode logic are being setup and obtained from the memory lane if necessary. So if the L1 instruction cache has a hit the μop Queue is updated immediately, otherwise the next instructions are shipped from the memory lane and go through a queuing process to decode them into micro-ops Page 4 of 11

5 The Rapid Execution Engine ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Now the execution engine. Use figure 6 as a reference. The allocate and register rename logic sets up the operands for buffer usage and charts the relationships of the data to permit out-oforder execution. The scheduler is actually five (or so) schedulers that pick off any micro-ops that are ready to go (have their operands ready) from the ready queues. It is interesting to note here that there are two ready queues allocated one as the memory queue and the other as the general instruction queue and that each scheduler also has a small queue of eight to twelve entries of which it selects operations for the execution units. The ready queues send uops to the five scheduler queues as fast as they can. The schedulers follow an algorithm based on instruction readiness, functional unit readiness and lastly first-in/first-out order. So very significant out-of-order execution can and does occur. After execution the micro-ops are placed in the re-order buffer. The Back End The re-order and retire block puts every thing back into the right order, as if all instructions on the processor were done in the sequential order of the program. One important thing to consider in this section is the store buffer and the L1 cache which generally are not visible to other processors but are fully visible to both the logical processors of a Pentium 4 with HT enabled. Page 5 of 11

6 Hyper-threading Intel's hyperthreading is an attempt to get two logical processors working on one processor chip. And for this purpose they have pretty much been successful. Each logical processor has its own architectural state maintained inside the chip and the functional units of the old processor are now shared by both of them. The extra logic required for each one is minimal, mainly consisting of the program instruction counter and its own set of integer and floating point registers file. Then it's just a matter of keeping track of which process the instruction belongs to and the standard pipelining issues of dependence and managing data flow. Here is a simplified picture (also look at figure 4 of reference 5. "A Detailed Look..."): process 1 Fetch Queue process 0 PC & registers PC & registers Arch. state Arch. state --> > ready? --> > v Scheduler v v v v adder addr loader loader FP to retirement section v Page 6 of 11

7 The heart of the hyper-threading concept is sharing the resources in the chip by both processes concurrently. The Execution Engine is a bunch of resources that just pump data in and get results out of the other side. It doesn't really care much which process or which data is being working on as long as the operands are ready and the functional units are not full. So the scheduler chases up the queue picking instructions that have all their operands ready. The out-of-order execution feature is applied here to permit the execution of any instruction that has all its operands ready, which basically means that the instruction has no operand that is waiting on a dependence of an earlier instruction in the stream. The only hint that there exists two threads in all the streams is a restriction that each process cannot be locked out completely, there are limits as to how much of the resources can be used by one thread. The Allocator and Scheduler logic is what is expected to keep track of which registers in the buffers belong to which Architectural State. The Fetch and Deliver Engine uses the two Program Instruction Counters to Pre-Fetch instructions from the memory lane or the Trace Cache, alternating between each logical processor unless no instruction is outstanding on one side and the waiting line moves up. The actual queues in the section are shared with limitations on how much resources each thread can consume autonomically. The resulting micro-operations are put into the μop Queue for the next section to process. The Reorder and Retirement Block works similarly to the first section, alternating tasks from each logical processor to retire instructions in program order. The results are put back into the appropriate registers or into the Store Buffer according to the sequence of each program. What can be useful here is that the data addresses or operands in each thread can be linked across the Page 7 of 11

8 two logical processors to provide signaling and pre-processing of the mutual environment. Note also that the internal and external caches, memory and the queues in the chip are all resources that are shared for both logical processors. Everything is kept organized by the links between instructions and between data and the unique Architectural State for each thread. Mainly, where Hyperthreading sees its advantage is in using up the bubbles in the pipeline. In a single process thread there is a certain difficulty in maintaining a fully productive pipeline without stalls, simply because an average section of code will have dependences in the sequence of instructions that cause stalls in the flow of the program. These stalls will end up being wasted clock cycles. Restructuring the original program into two threads that do not have instructions that are dependent on nearby earlier instructions and using the hyperthreading technology means that these lost bubbles can do useful work in the global execution time of the project. Page 8 of 11

9 My opinion and observation of hyperthreading as implemented by Intel The first thing that comes to mind is the idea of a buddy system, like what is used in sports that have a danger factor to them. Like scuba diving where you have a buddy system where you watch out for a partner and the partner watches out for you. It seems that the type of project that would best make use of the feature would be one where the two threads would have enough involvement with each other that the sharing would be worthwhile but not so much involvement that one thread would have to wait for the other all the time. For a project where the two threads have nothing in common then there would be a speedup over a single processor but there could also be structural hazards that could occur, so ultimately if the threads are entirely unconnected they would probably be better run on two distinct processors instead. The only advantage to using a hyper-threaded processor could be the price break over doing the same thing on two separate computers. The next disadvantage I see is that most multi-threaded operating systems and programs generally are more than two threads in number. Plus there's an operating system that gets in there at times as well. There will still be a fair bit of slicing and scheduling of processes which usually involves some loss in time due to overhead. In a server type environment the computers are grouped into clusters where there are 2 or 4 processors on each motherboard. This is interesting in the fact that there now exists a hierarchy of memory accessibility. The hyperthreaded processor is intimately involved with its buddy. The other processors on the motherboard are capable of sharing certain cache and memory that is external to a CPU chip. Other motherboards will have similar setups and intercommunication would be in a network community. Even more remote is the communication between cluster racks. The best way of to deal with this is to group the threads in such a way that they match the structure of the hierarchy so as to minimize the non-uniform aspects of memory access. This cluster structure reminds me of the DASH prototype of reference 6 where there are 4 closely knit processors connected in a larger network. The scheduling of threads can get complicated but are not necessarily impossible, the challenge is how to arrange the program into little independent clumps. In situations where the threads have very little involvement with each other, it is probably a better plan to just put the threads on separate CPU's and just not bother with the buddy logical processor. Another interesting approach is to use the processor pair like in reference 7. One thread is used to process the main program thread and the buddy thread is just set up to prefetch the data and instructions so they are ready in the instruction cache and the data cache when the program thread needs to use them. This could be expanded to include a monitoring feature in the buddy thread to communicate with other buddy threads to optimize the global project. Another thought is to use the buddy process as a kind of kernel mode server for the main thread since many of the kernel-type calls are usually closely related to the program thread being executed. Another problem to face in the hyper-threaded environment is that of the kernel's idle wait loop. If a program thread is sharing the resources of the "Pentium 4 with HT" with an idle loop then it is not going to get the full benefit of the resources and the idle loop is just another way of saying pipeline bubble. Page 9 of 11

10 Summary The Net-Burst architecture seems to be developed as a precursor to the hyperthreading technology, setting up the environment for the add-on. The main value it has is the well-developed out-of-order execution engine and the branchprediction to compensate for a fairly long pipeline. The hyper-threading addition to the chip puts the architecture at a whole new level of technology. It takes up minimal die-space and allows excellent throughput of two processes of instructions. The main point is that it is the start of a larger scale of computing, but it is too small a step to do anything really fantastic. It is a definite improvement over what was offered before for most uses but it doesn't really do it better than two distinct processors could do except in a few specialized situations like where two threads are intertwined in just the right ways. The best implementation seems to be to use the second logical processor as an overseer of the first, being a good buddy. The future promises are encouraging, getting more threads supported in the architectural states and putting more processors on the same motherboard will make the splicing of programs into little related threads more feasible for getting significant throughput. The supply lines then become more important, though. It's not going to be a solution for doing massively parallel applications in the blink of an eye, but it does and will do a significant advancement of the general computing market and that's all it is meant to do anyway. Raymond Bruton Bibliography ) Introduction to Multithreading, Superthreading and Hyperthreading 2.) Intel's NetBurst Architecture - The Pentium 4's innards get a name 3.) Hyper-Threading Technology Architecture and Microarchitecture bstract.htm 4.) Hyper-Threading Technology for Servers 5.) A Detailed Look Inside the Intel NetBurst(TM) Micro-Architecture of the Intel Pentium 4 Processor 6.) The DASH prototype: implementation and performance Page 10 of 11

11 7.) Speculative Precomputation: Exploring the Use of Multithreading for Latency bstract.htm Page 11 of 11

Superscalar Processors

Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance