Computer Architecture ELEC2401 & ELEC3441

Size: px

Start display at page:

Download "Computer Architecture ELEC2401 & ELEC3441"

Belinda Elisabeth Webster
6 years ago
Views:

Computer Architecture ELEC2401 & ELEC3441 Lecture 15 ultithreadig & ulti-core Processors Dr. Hayde Kwok-Hay So 100,000 10,000 Departmet of Electrical ad Electroic Egieerig 1 Performace (vs.

5 GHz) Itel Core Duo Extreme 2 cores, 3.0 GHz Itel Core 2 Extreme 2 cores, 2.9 GHz AD Athlo 64, 2.8 GHz 11,865 14,38719,484 AD Athlo, 2.6 GHz Itel Xeo EE 3.2 GHz 7,108 Itel D850EVR motherboard (3.

0 GHz Petium III processor 3,016 Professioal Workstatio XP1000, 667 Hz 21264A 1,779 Digital AlphaServer 8400 6/575, 575 Hz 21264 1,267 993 AlphaServer 4000 5/600, 600 Hz 21164 Digital Alphastatio

1 Computer Architecture ELEC2401 & ELEC3441 Lecture 15 ultithreadig & ulti-core Processors Dr. Hayde Kwok-Hay So 100,000 10,000 Departmet of Electrical ad Electroic Egieerig 1 Performace (vs. VAX-11/780) Ed of a Era AX-11/780, 5 Hz Itel Xeo 6 cores, 3.3 GHz (boost to 3.6 GHz) Itel Xeo 4 cores, 3.3 GHz (boost to 3.6 GHz) Itel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Itel Core Duo Extreme 2 cores, 3.0 GHz Itel Core 2 Extreme 2 cores, 2.9 GHz AD Athlo 64, 2.8 GHz 11,865 14,38719,484 AD Athlo, 2.6 GHz Itel Xeo EE 3.2 GHz 7,108 Itel D850EVR motherboard (3.06 GHz, Petium 4 processor with Hyper-Threadig Techology) 6,043 6,681 IB Power4, 1.3 GHz 4,195 Itel VC820 motherboard, 1.0 GHz Petium III processor 3,016 Professioal Workstatio XP1000, 667 Hz 21264A 1,779 Digital AlphaServer /575, 575 Hz , AlphaServer /600, 600 Hz Digital Alphastatio 5/500, 500 Hz Digital Alphastatio 5/300, 300 Hz %/year Digital Alphastatio 4/266, 266 Hz 183 IB POWERstatio 100, 150 Hz 117 Digital 3000 AXP/500, 150 Hz 80 HP 9000/750, 66 Hz 51 IB RS6000/540, 30 Hz 24 52%/year IPS 2000, 25 Hz IPS /120, 16.7 Hz Su-4/260, 16.7 Hz 9 VAX 8700, 22 Hz 5 Limited by Power, ILP, speed 24,129 21,871 25%/year 1.5, VAX-11/ d sem. ' Ways to Achieve Parallelism Istructio Level Parallelism (ILP) Parallel operatios come from istructios that execute i parallel Dyamic: Super-scalar processor, OOO executio Static: VLIW Data Level Parallelism (DLP) Parallel operatios come from cocurret operatio o idepedet data Vector machies, SID extesios Thread Level Parallelism 2d sem. ' d sem. '

Coectig Cores ultiprocessor Systems o a Chip achies with more tha 1 processors was popular amog servers ad

move to multi-core desigs O-chip Network Shared Shared memory ulti-processor board level 2d sem.

'15-16 Direct Network 6 Network Typology Usually i the form of low latecy, high throughput, poit-to-poit etwork

Sometimes with dedicated machie istructios ulti-hop routig for further processors Typology of etwork plays a

2 Coectig Cores ultiprocessor Systems o a Chip achies with more tha 1 processors was popular amog servers ad supercomputers i the 80 ad 90s Uiprocessor speed comes to a halt due to power wall All major processor vedors move to multi-core desigs O-chip Network Shared Shared memory ulti-processor board level 2d sem. '15-16 Chip ulti-processor 5 Direct Coectios 2d sem. '15-16 Direct Network 6 Network Typology Usually i the form of low latecy, high throughput, poit-to-poit etwork betwee processors By pass I/O subsystems Allows low-latecy commuicatio betwee eighborig processors mesh rig Sometimes with dedicated machie istructios ulti-hop routig for further processors Typology of etwork plays a importat role e.g. Rig, torus, mesh Ofte tie to the distributed memory system Ofte proprietary desig Commercial examples: torus AD: HyperTrasport Itel: QuickPath Itercoect 2d sem. ' d sem. '

O-chip Network The study of buildig costructig etwork i system-o-chip A complete computer system o a chip Icludig graphs, peripheral ad memory cotrollers, accelerators PSoC: multi-processor system o

sem. '15-16 9 Shared memory cores Commo typology for commercial multi-core processors Various combiatio of shared ad private cache/memory I$ ai Shared L2$ D$ Core I$ D$ Core e.g. Itel Core, Core 2 ai 2d sem.

Itel Nehalem, Sady Bridge, Ivy Bridge Symmetric ul-processors Processor symmetric All memory is equally far away from all processors Ay processor ca do ay I/O (set up a DA trasfer) - bus I/O

3 O-chip Network The study of buildig costructig etwork i system-o-chip A complete computer system o a chip Icludig graphs, peripheral ad memory cotrollers, accelerators PSoC: multi-processor system o a chip ultiple compute core i the system ostly proprietary Some example of o-chip etwork: Advaced icrocotroller Bus Architecture (ABA): o-chip itercoect developed by AR Wishboe: OpeCore stadard 2d sem. ' Shared memory cores Commo typology for commercial multi-core processors Various combiatio of shared ad private cache/memory I$ ai Shared L2$ D$ Core I$ D$ Core e.g. Itel Core, Core 2 ai 2d sem. ' I$ Shared L3$ L2$ L2$ D$ Core I$ D$ Core e.g. Itel Nehalem, Sady Bridge, Ivy Bridge Symmetric ul-processors Processor symmetric All memory is equally far away from all processors Ay processor ca do ay I/O (set up a DA trasfer) - bus I/O cotroller bridge I/O bus I/O cotroller Graphics output Processor I/O cotroller Networks Sychroiza-o The eed for sychroizaio arises wheever there are cocurret processes i a system (eve i a uiprocessor system) Two classes of sychroizaio: Producer-Cosumer: A cosumer process must wait uil the producer process has produced data utual Exclusio: Esure that oly oe process uses a resource at a give Ime producer P1 cosumer P2 Shared Resource 11 12

4 A Producer-Cosumer Example A Producer-Cosumer Example co$ued Producer tail Producer postig Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail The program is wriqe assumig istrucios are executed i order. head Cosumer R tail R tail R head R Seque-al Cosistecy A odel P P P P P P Cosumer: Load R head, (head) spi: Load R tail, (tail) if R head ==R tail goto spi Load R, (R head ) R head =R head +1 Store (head), R head process(r) Problems? A system is seque<ally cosistet if the result of ay execuio is the same as if the operaios of all the processors were executed i some sequeial order, ad the operaios of each idividual processor appear i the order specified by the program Leslie Lamport SequeIal Cosistecy = arbitrary order-preservig iterleavig of memory refereces of sequeial programs Producer postig Item x: Load R tail, (tail) 1 Store (R tail ), x R tail =R tail +1 2 Store (tail), R tail Ca the tail poiter get updated before the item x is stored? Cosumer: Load R head, (head) spi: Load R tail, (tail) 3 if R head ==R tail goto spi Load R, (R head ) 4 R head =R head +1 Store (head), R head process(r) Programmer assumes that if 3 happes after 2, the 4 happes after 1. Problem sequeces are: 2, 3, 4, 1 4, 1, 2, 3 Seque-al Cosistecy Sequetial cocurret tasks: T1, T2 Shared variables: X, Y (iitially, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 ( Y) Load R 2, (X) Store (X ), R 2 ( X) what are the legitimate aswers for X ad Y? (X,Y ) ε {(1,11), (0,10), (1,10), (0,11)}? If y is 11 the x caot be

5 Seque-al Cosistecy Issues i Impleme-g Seque-al Cosistecy Sequetial cosistecy imposes more memory orderig costraits tha those imposed by uiprocessor program depedecies ( ) P P P P P P What are these i our example? Implemetatio of SC is complicated by two issues T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 ( Y) Load R 2, (X) additioal SC requiremets Store (X ), R 2 ( X) Out-of-order executio capability Load(a); Load(b) yes Load(a); Store(b) yes if a b Store(a); Load(b) yes if a b Store(a); Store(b) yes if a b Does (ca) a system with caches or out-of-order executio capability provide a sequetially cosistet view of the memory? more o this later s s ca prevet the effect of a store from beig see by other processors No commo commercial architecture has a sequetially cosistet memory model! Feces Istruc$os to serialize memory accesses Coherece i SPs Processors with relaxed or weak memory models (i.e., permit Loads ad Stores to differet addresses to be reordered) eed to provide memory fece istructios to force the serializatio of memory accesses -1 A 100 cache-1-2 A 100 cache-2 Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): embar Sparc V9 (RO): embar #LoadLoad, embar #LoadStore embar #StoreLoad, embar #StoreStore PowerPC (WO): Syc, EIEIO AR: DB (Data Barrier) X86/64: mfece (Global Barrier) feces are expesive operatios, however, oe pays the cost of serializatio oly whe it is required - bus A 100 memory Suppose -1 updates A to 200. write-back: memory ad cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programmig? 19 20

6 Write-back s & SC Write-through s & SC T1 is executed prog T1 ST X, 1 ST Y,11 cache-1 writes back Y T2 executed cache-1 writes back X cache-2 writes back X & Y cache-1 memory Y =10 X = 1 X = cache-2 Y = X = Y = X = Y = Y = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T1 executed T2 executed prog T1 ST X, 1 ST Y,11 cache-1 X= 0 Y=10 memory Y =10 X = 1 X = cache-2 Y = Y = Y = Write-through caches do t preserve seque<al cosistecy either prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R aitaiig Coherece Hardware support is required such that oly oe processor at a Ime has write permissio for a locaio o processor ca load a stale copy of the locaio a[er a write cache coherece protocols Coherece vs. Cosistecy A cache coherece protocol esures that all writes by oe processor are evetually visible to other processors, for oe memory address i.e., updates are ot lost A memory cosistecy model gives the rules o whe a write by oe processor ca be observed by a read o aother, across differet addresses Equivaletly, what values ca be see by a load A cache coherece protocol is ot eough to esure sequeial cosistecy But if sequeially cosistet, the caches must be coheret CombiaIo of cache coherece protocol plus processor memory reorder buffer used to implemet a give architecture s memory cosistecy model 23 24

7 Warmup: Parallel I/O Problems with Parallel I/O Proc. Address (A) Data (D) R/W Either or DA ca be the Bus aster ad effect trasfers Bus A D R/W Physical Page trasfers occur while the Processor is ruig DA DISK Proc. d portios of page Bus DA Physical DA trasfers DISK Disk: Physical memory may be stale if cache copy is dirty (DA stads for Direct Access, meas the I/O device ca read/write memory autoomous from the ) Disk : may hold stale data ad ot see memory writes Soopy, Goodma 1983 Idea: Have cache watch (or soop upo) DA trasfers, ad the do the right thig Soopy cache tags are dual-ported Soopy Ac-os for DA Observed Bus Cycle State Actio Used to drive Bus whe is Bus aster Address ot cached No actio Proc. A R/W D Tags ad State Data (lies) A R/W Soopy read port attached to Bus DA Read d, umodified Disk d, modified Address ot cached DA Write d, umodified Disk d, modified No actio itervees No actio purges its copy??? 27 28

8 Shared ul-processor Soopy Coherece Protocols 1 2 Soopy Soopy Bus Physical write miss: the address is ivalidated i all other caches before the write is performed read miss: if a dirty copy is foud i some cache, a write -back is performed before the memory is read 3 Soopy DA DISKS Use soopy mechaism to keep all processors view of memory coheret State Trasi-o Diagram The SI protocol Two Processor Example (Readig ad wri-g the same cache lie) Each cache lie has state bits state bits Address tag Read miss (P1 gets lie from memory) Read by ay processor Write miss (P1 gets lie from memory) reads (P 1 writes back) S itet to write : odified S: Shared I: Ivalid I P 1 reads or writes itet to write (P 1 writes back) state i processor P 1 P 1 reads P 1 writes P 2 reads P 2 writes P 1 reads P 1 writes P 2 writes P 1 writes P 1 Read miss P 2 Read miss P 2 reads, P 1 writes back S S P 2 itet to write P 1 reads, P 2 writes back P 1 itet to write I I P 1 reads or writes Write miss P 2 itet to write P 2 reads or writes Write miss P 1 itet to write 31 32

9 Read miss Read by ay processor S Observa-o reads P 1 writes back itet to write If a lie is i the state the o other cache ca have a copy of the lie! stays coheret, muliple differig copies caot exist I P 1 reads or writes Write miss itet to write ESI: A Ehaced SI protocol icreased performace for private data Each cache lie has a tag state bits Address tag Write miss P 1 write or read reads P 1 writes back Read miss, shared Read by ay processor S P 1 itet to write P 1 write itet to write : odified Exclusive E: Exclusive but umodified S: Shared I: Ivalid Other processor reads itet to write, P1 writes back E I P 1 read itet to write state i processor P 1 Read miss, ot shared Op-mized Soop with Level-2 s Iterve-o -1-2 $ $ $ $ A 200 cache-1 cache-2 L2 $ L2 $ L2 $ L2 $ Sooper Sooper Sooper Sooper - bus A 100 memory (stale data) Processors o[e have two-level caches small, large L2 (usually both o chip ow) Iclusio property: etries i must be i L2 ivalidaio i L2 ivalidaio i Soopig o L2 does ot affect - badwidth What problem could occur? 35 Whe a read-miss for A occurs i cache-2, a read request for A is placed o the bus -1 eeds to supply & chage its state to shared The memory may respod to the request also! Does memory kow it has stale data? -1 eeds to itervee through memory cotroller to supply correct data to cache-2 36

10 False Sharig state lie addr data0 data1... datan A cache lie cotais more tha oe word -coherece is doe at the lie-level ad ot word-level Suppose 1 writes word i ad 2 writes word k ad both words have the same lie address. What ca happe? Out-of-Order Loads/Stores & CC load/store buffers sooper Wb-req, Iv-req, Iv-rep (I/S/E) Blockig caches Oe request at a time + CC SC No-blockig caches pushout (Wb-rep) (S-rep, E-rep) (S-req, E-req) / Iterface ultiple requests (differet addresses) cocurretly + CC Relaxed memory models CC esures that all processors observe the same order of loads ad stores to a address Ackowledgemets These slides cotai material developed ad copyright by: Arvid (IT) Krste Asaovic (IT/UCB) Joel Emer (Itel/IT) James Hoe (CU) Joh Kubiatowicz (UCB) David Patterso (UCB) Joh Lazzaro (UCB) IT material derived from course UCB material derived from course CS152, CS252 2d sem. '

Computer Architecture ELEC3441

Computer Architecture ELEC3441 Lecture 13 ulti-core Processors Dr. Hayde Kwok-Hay o 100,000 10,000 Departmet of Electrical ad Electroic Egieerig 1 Performace (vs. VAX-11/780) Ed of a Era 1000 100 10 AX-11/780,