Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1
Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig Itel s implemetatio of Simultaeous Multithreadig Techology (SMT). Itroduced i the Foster MP-based Xeo ad the 3.06 GHz Northwood-based Petium 4 i 2002. Multiple Threads 2
Threadig A thread is the smallest sequece of programmed istructios that ca be maaged idepedetly. A thread is composed of multiple istructios. Threads exist withi the same process ad could share the same resources while differet processes do ot share the same resources. Multi-threadig is a type of executio model that allows multiple threads to exist withi the cotext of a process such that they execute idepedetly but share their process resources. A thread maitais a list of iformatio relevat to its executio icludig the priority schedule, exceptio hadlers, a set of CPU registers, ad stack state i the address space of its hostig process. Multithreadig o A Chip Multithreadig helps fid a way to hide true data depedecy stalls, cache miss stalls, ad brach stalls by fidig istructios (from other process threads) that are idepedet of these stallig istructios. Hardware multithreadig icrease the utilizatio of resources o a chip by allowig multiple processes (threads) to share the fuctioal uits of a sigle processor. Processor must duplicate the state hardware for each thread a separate register file, PC, istructio buffer, ad store buffer for each thread. The caches, TLBs, BHT, BTB, ca be shared (although the miss rates may icrease if they are ot sized accordigly). The memory ca be shared through virtual memory mechaisms. Hardware must support efficiet thread cotext switchig. 3
Overview Hyper-threadig Uses processors resources i a highly efficiet way. Cosists of two logical processors for every physical core. Separate threads ca be executed o each logical processor simultaeously. Here is a dumb video Multi-threadig is ehaced by Multi-core techology which allows the parallel executio of software threads across multiple processor cores. Simultaeous Multithreadig (SMT) A variatio o multithreadig that uses the resources of a multiple-issue, dyamically scheduled processor to exploit both program ILP ad thread-level parallelism (TLP) Most superscaler processors have more machie level parallelism tha most programs ca effectively use (i.e., tha have ILP). With register reamig ad dyamic schedulig, multiple istructios from idepedet threads ca be issued without regard to depedecies amog them Need separate reame tables (RUUs) for each thread or eed to be able to idicate which thread the etry belogs to. Need the capability to commit from multiple threads i oe cycle. Itel s Petium 4 SMT is called hyper-threadig Supports just two threads (doubles the architecture state). 4
Advatages ad Disadvatages Advatages of Hyper-threadig: Performace boost by as much as 30% whe compared to idetical processig core without hyper-threadig. Disadvatages of Hyper-threadig: Possible icrease i core size of about 5 percet caused by the duplicatio of certai sectios of the CPU core (Itel's claim). Overall power cosumptio is higher. Icreases cache thrashig (the repeated displacig ad loadig of cache blocks), which ARM claims could be as much as 42%. Hyper-Threadig Architecture Makes a sigle physical processor appear as multiple logical processors. Each logical processor has a copy of architecture state. Logical processors share a sigle set of physical executio resources. From a architecture perspective we have to worry about the logical processors usig shared resources. Caches, executio uits, brach predictors, cotrol logic, ad buses. 5
HT vs. Dual Processors Dual Processor System resources are duplicated. Hyper-Threadig Architectural state is duplicated Data, segmet, cotrol, ad debug registers. Resources shared by the logical CPUs Executio egie, caches, system-bus iterface ad firmware. This ca icreases the performace of the processor up too 30%. Implemetig Multithreadig Two mai questios whe Multithreadig: Thread schedulig policy Whe to switch threads? Pipelie partitioig How do threads share the pipelie? The choices deped o How much sacrifice you are willig to take i sigle thread performace. What kid of latecies you are willig to tolerate. 6
OS Support The OS must support HT sice it maages the threads. Types of Multithreadig Fie-Grai switch threads o every istructio issue. Roud-robi thread iterleavig (skippig stalled threads). Processor must be able to switch threads o every clock cycle. Advatage ca hide throughput losses that come from both short ad log stalls. Disadvatage slows dow the executio of a idividual thread sice a thread that is ready to execute without stalls is delayed by istructios from other threads. Coarse-Grai switches threads oly o costly stalls (e.g., L2 cache misses). Advatages thread switchig does t have to be essetially free ad much less likely to slow dow the executio of a idividual thread. Disadvatage limited, due to pipelie start-up costs, i its ability to overcome throughput loss. Pipelie must be flushed ad refilled o thread switches. 7
Types of Multithreadig Coarse-grai multithreadig (CGMT). Fie-grai multithreadig (FGMT). Simultaeous multithreadig (SMT). Fie-Grai (FGMT) Sacrifices sigificat sigle thread performace. Tolerates latecies (L2 misses, mis-predicted braches). Thread schedulig Roud-robi thread switchig after each cycle. Pipelie partitioig No flushig, dyamic. Multiple threads i pipelie at oce. Lots of threads eeded. Fidig a home i GPU s. 8
Software Implemetatio From a software aspect - sychroizatio of objects is ofte required whe implemetig software apps with multi-threadig. These objects are used to protect memory from beig modified by multiple threads at the same time. A mutex is oe such object. It is a lock which ca be locked by a thread, ad ay successive attempt to lock it by aother thread or by the same thread, will be blocked util the mutex is ulocked, thus keepig a item from beig accessed more tha oce at the same time. Computer Scietists like to refer to pieces of code protected by objects like mutexes as Critical Sectios. Software Implemetatio It is best to keep Critical Sectios as short as possible to allow the applicatio to be as parallel as possible, because the larger the Critical Sectio the greater the chace that multiple threads will access it at the same time, thus causig stalls. Itel provides several helpful templates to make implemetatio easier. 9
Coclusios Hardware support for multi-, hyper-, ad simultaeous threadig exists i all major processors. Operatig system must support multi-threadig. Types of multi-threadig Coarse-grai. Fie-grai. Simultaeous. Idividual latecies are sacrificed for the good of the etire program (or process). 10