Uiprocessors HPC Prof. Robert va Egele
Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures Pipelie ad vector machies Istructio set architectures Shared memory Istructio schedulig ad executio Distributed memory Message passig Data storage Parallel programmig models Memory hierarchy Shared vs distributed memory Caches, TLB Hybrid Compiler optimizatios BSP HPC 2
Levels of Parallelism Superscalar ad multicore processors (fie grai) (stacked, fie grai to coarse grai) HPC 3
Processor Families Family CISC Vector RISC VLIW Istructio Set Architecture (ISA) DEC VAX Itel 80x86 (IA-32) CRAY Covex HP PA-RISC SGI MIPS DEC Alpha Su Sparc IBM PowerPC Multiflow Cydrome Itel IA-64 Processors VAX-11/780 Itel Petium Pro CRAY T90 Covex C-4 PA-8600 MIPS R10000 Alpha21264 Su UltraSparc-III IBM Power3 Multiflow Trace Cydrome Cydra-5 Itel Itaium HPC 4
CISC, RISC, VLIW, ad Vector Processor Families CISC (complex istructio set computer) CISC ISAs offer specialized istructios of various (ad variable) legth Istructios typically executed i microcode RISC (reduced istructio set computer) No microcode ad relatively few istructios May RISC processors are superscalar (for istructio-level parallelism) Oly load ad store istructios access memory Sigle commo istructio word legth More registers tha CISC VLIW (very log istructio word) Budles multiple istructios for parallel executio Depedeces betwee istructios i a budle are prohibited More registers tha CISC ad RISC Vector machies Sigle istructios operate o vectors of data (ot ecessarily parallel) Multiple vector istructios ca be chaied HPC 5
Superscalar Architectures HPC 6
Istructio Pipelie Five istructios are i the pipelie at differet stages A istructio pipelie icreases the istructio badwidth Classic 5-stage pipelie: IF istructio fetch ID istructio decode EX execute (fuctioal uits) MEM load/store WB write back to registers ad forward results ito the pipelie whe eeded by aother istructio HPC 7
Istructio Pipelie Example Example 4-stage pipelie Four istructios (gree, purple, blue, red) are processed i this order Cycle 0: istructios are waitig Cycle 1: Gree is fetched Cycle 2: Purple is fetched, gree decoded Cycle 3: Blue is fetched, purple decoded, gree executed Cycle 4: Red is fetched, blue decoded, purple executed, gree i write-back Etc. HPC 8
Istructio Pipelie Hazards Pipelie hazards From data depedeces Forwardig i the WB stage ca elimiate some data depedece hazards From istructio fetch latecies (e.g. I-cache miss) From memory load latecies (e.g. D-cache miss) A hazard is resolved by stallig the pipelie, which causes a bubble of oe ore more cycles Example Suppose a stall of oe cycle occurs i the IF stage of the purple istructio Cycle 3: Purple caot be decoded ad a o-operatio (NOP) is iserted HPC 9
N-way Superscalar RISC 2-way superscalar RISC pipelie RISC istructios have the same word size Ca fetch multiple istructios without havig to kow the istructio cotet of each N-way superscalar RISC processors fetch N istructios each cycle Icreases the istructio-level parallelism HPC 10
A CISC Istructio Set i a Superscalar Architecture Petium processors traslate CISC istructios to RISC-like µops Advatages: Higher istructio badwidth Maitais istructio set architecture (ISA) compatibility Petium 4 has a 31 stage pipelie divided ito three mai stages: Fetch ad decode Executio Retiremet Simplified block diagram of the Itel Petium 4 HPC 11
Istructio Fetch ad Decode Simplified block diagram of the Itel Petium 4 Petium 4 decodes istructios ito µops ad deposits the µops i a trace cache Allows the processor to fetch the µops trace of a istructio that is executed agai (e.g. i a loop) Istructios are fetched: Normally i the same order as stored i memory Or fetched from brach targets predicted by the brach predictio uit Petium 4 oly decodes oe istructio per cycle, but ca deliver up to three µops per cycle to the executio stage RISC architectures typically fetch multiple istructios per cycle HPC 12
Istructio Executio Stage Simplified block diagram of the Itel Petium 4 Executes multiple µops i parallel Istructio-level parallelism (ILP) The scheduler marks a µop for executio whe all operads of the a µop are ready The µops o which a µop depeds must be executed first A µop ca be executed out-oforder i which it appeared Petium 4: a µop is re-executed whe its operads were ot ready O Petium 4 there are 4 ports to sed a µop ito Each port has oe or more fixed executio uits Port 0: ALU0, FPMOV Port 1: ALU1, INT, FPEXE Port 2: LOAD Port 3: STORE HPC 13
Retiremet Looks for istructios to mark completed Are all µops of the istructio executed? Are all µops of the precedig istructio retired? (puttig istructios back i order) Notifies the brach predictio uit whe a brach was icorrectly predicted Processor stops executig the wrogly predicted istructios ad discards them (takes»10 cycles) Petium 4 retires up to 3 istructios per clock cycle Simplified block diagram of the Itel Petium 4 HPC 14
Software Optimizatio to Icrease CPU Throughput Processors ru at maximum speed (high istructio per cycle rate (IPC)) whe 1. There is a good mix of istructios (with low latecies) to keep the fuctioal uits busy 2. Operads are available quickly from registers or D-cache 3. The FP to memory operatio ratio is high (FP : MEM > 1) 4. Number of data depedeces is low 5. Braches are easy to predict The processor ca oly improve #1 to a certai level with out-of-order schedulig ad partly #2 with hardware prefetchig Compiler optimizatios effectively target #1-3 The programmer ca help improve #1-5 HPC 15
Istructio Latecy ad Throughput a=u*v; b=w*x; c=y*z Latecy: the umber of clocks to complete a istructio whe all of its iputs are ready Throughput: the umber of clocks to wait before startig a idetical istructio Idetical istructios are those that use the same executio uit The example shows three multiply operatios, assumig there is oly oe multiply executio uit Actual typical latecies (i cycles) Iteger add: 1 FP add: 3 FP multiplicatio: 3 FP divisio: 31 HPC 16
Istructio Latecy Case Study Cosider two versios of Euclid s algorithm 1. Modulo versio 2. Repetitive subtractio versio Which is faster? it fid_gcf1(it a, it b) { while (1) { a = a % b; if (a == 0) retur b; if (a == 1) retur 1; b = b % a; if (b == 0) retur a; if (b == 1) retur 1; } } Modulo versio it fid_gcf2(it a, it b) { while (1) { if (a > b) a = a - b; else if (a < b) b = b - a; else retur a; } } Repetitive subtractio HPC 17
Istructio Latecy Case Study Cosider the cycles estimated for the case a=48 ad b=40 Istructio # Latecy Cycles Istructio # Latecy Cycles Modulo 2 68 136 Subtract 5 1 5 Compare 3 1 3 Compare 5 1 5 Brach 3 1 3 Brach 14 1 14 Other 6 1 6 Other 0 Total 14 148 Total 24 24 Modulo versio Repetitive subtractio Executio time for all values of a ad b i [1..9999] Modulo versio Repetitive subtractio Bleded versio 18.55 sec 14.56 sec 12.14 sec HPC 18
Data Depedeces (w*x)*(y*z) Istructio level parallelism is limited by data depedeces Types of depedeces: RAW: read-after-write also called flow depedece WAR: write-after-read also called ati depedece WAW: write-after-write also called output depedece The example shows a RAW depedece WAR ad WAW depedeces exist because of storage locatio reuse (overwrite with ew value) WAR ad WAW are sometimes called false depedeces RAW is a true depedece HPC 19
Data Depedece Case Study Removig redudat operatios by (re)usig (temporary) space may icrease the umber of depedeces Example: two versios to iitialize a fiite differece matrix 1. Recurret versio with lower FP operatio cout 2. No-recurret versio with fewer depedeces Which is fastest depeds o effectiveess of loop optimizatio ad istructio schedulig by compiler (ad processor) to hide latecies ad the umber of distict memory loads dxi=1.0/h(1) do i=1, dxo=dxi dxi=1.0/h(i+1) diag(i)=dxo+dxi offdiag(i)=-dxi eddo With recurrece (cross iter dep ot show) WAR RAW do i=1, dxo=1.0/h(i) dxi=1.0/h(i+1) diag(i)=dxo+dxi offdiag(i)=-dxi eddo Without recurrece (cross iter dep ot show) HPC 20
Case Study (1) dxi = 1.0/h[0]; for (i=1; i<; i++) { dxo = dxi; dxi = 1.0/h[i+1]; diag[i] = dxo+dxi; offdiag[i] = -dxi; } Itel Core 2 Duo 2.33 GHz for (i=1; i<; i++) { dxo = 1.0/h[i]; dxi = 1.0/h[i+1]; diag[i] = dxo+dxi; offdiag[i] = -dxi; } gcc O3 fdiit.c time./a.out 0.135 sec gcc O3 fdiit.c time./a.out 0.270 sec Oe cross-iteratio flow depedece but fewer memory loads Oe more memory load but fewer deps Depedece spas multiple istructios, so has o big impact HPC 21
Timig compariso of o-optimized kerels recurret versio: 1.856 sec o-recurret versio: 7.315 sec Case Study (2) Timig compariso of compiler optimized kerels recurret versio: 0.841 sec o-recurret versio: 0.841 sec f77 -g fdiit.f -o fdiit collect -o fdiit.er./fdiit UltraSparc IIIi 1.2 GHz f77 -fast -g fdiit.f -o fdiit collect -o fdiit.er./fdiit HPC 22
Istructio Schedulig z = (w*x)*(y*z); k = k+2; m = k+ = +4; Istructio schedulig ca hide data depedece latecies With static schedulig the compiler moves idepedet istructios up/dow to fixed positios With dyamic schedulig the processor executes istructios out of order whe they are ready The example shows a optimal istructio schedule, assumig there is oe multiply ad oe add executio uit Note that m=k+ uses the old value of (WAR depedece) Advaced processors remove WAR depedeces from the executio pipelie usig register reamig HPC 23
Hardware Register Reamig r0 = r1 + r2 r2 = r7 + r4 Sequetial r0 = r1 + r2; r2 = r7 + r4 r2 = r2 Parallel Pitfall: it does ot help to maually remove WAR depedeces i program source code by itroducig extra scalar variables, because the compiler s register allocator reuses registers assiged to these variables ad thereby reitroduces the WAR depedeces at the register level Register reamig performed by a processor removes uecessary WAR depedeces caused by register reuse Fixed umber of registers are assiged by compiler to hold temporary values A processor s register file icludes a set of hidde registers A hidde register is used i place of a actual register to elimiate a WAR hazard The WB stage stores the result i the destiatio register HPC 24
double *p, a[10]; *p = 0; s += a[i]; r0 = 0 r1 = p r2 = a+i store r0 i M[r1] load M[r2] ito r3 Data Speculatio r2 = a+i adv load M[r2] ito r3 r0 = 0 r1 = p store r0 i M[r1] check adv load address: reload r3 A load should be iitiated as far i advace as possible to hide memory latecy Whe a store to address A1 is followed by a load from address A2 the there is a RAW depedece whe A1=A2 A compiler assumes there is a RAW depedece if it caot disprove A1=A2 The advaced load istructio allows a process to igore a potetial RAW depedece ad sort out the coflict at ru time whe the store address is kow ad is the same HPC 25
Cotrol Speculatio if (i<) x = a[i] if (i>) jump to skip load a[i] ito r0 skip: speculative load a[i] ito r0 if (i>) jump to skip check spec load exceptios skip: Cotrol speculatio allows coditioal istructios to be executed before the coditioal brach i which the istructio occurs Hide memory latecies A speculative load istructio performs a load operatio Eables loadig early Exceptios are igored A check operatio verifies that whether the load trigged a exceptio (e.g. bus error) Reraise the exceptio HPC 26
Hardware Brach Predictio for (a=0; a<100; a++) { if (a % 2 == 0) do_eve(); else do_odd(); } Simple brach patter that is predicted correctly by processor for (a=0; a<100; a++) { if (flip_coi() == HEADS) do_heads(); else do_tails(); } Radom brach patter that is difficult to predict Brach predictio is a architectural feature that eables a processor to fetch istructios of a target brach Whe predicted correctly there is o brach pealty Whe ot predicted correctly, the pealty is typically >10 cycles Brach predictio uses a history mechaism per brach istructio by storig a compressed form of the past brachig patter HPC 27
Improvig Brach Predictio I a complicated brach test i C/C++ move the simplest to predict coditio to the frot of the cojuctio if (i == 0 && a[i] > 0) this example also has the added beefit of testig the more costly a[i]>0 less frequetly Rewrite cojuctios to logical expressios if (t1==0 && t2==0 && t3==0) Þ if ( (t1 t2 t3) == 0 ) Use max/mi or arithmetic to avoid braches if (a >= 255) a = 255; Þ a = mi(a, 255); Note that i C/C++ the cod?the:else operator ad the && ad operators result i braches! HPC 28
Memory hierarchy Data Storage Performace of storage CPU ad memory Virtual memory, TLB, ad pagig Cache HPC 29
Memory Hierarchy faster volatile Storage systems are orgaized i a hierarchy: Speed Cost Volatility cheaper per byte HPC 30
Performace of Storage Registers are fast, typically oe clock cycle to access data Cache access takes tes of cycles Memory access takes hudreds of cycles Movemet betwee levels of storage hierarchy ca be explicit or implicit HPC 31
CPU ad Memory CPU TLB registers I-cache D-cache L1 Cache O/off chip Uified L2 Cache Memory Bus Mai Memory I/O Bus Disk HPC 32
Memory Access A logical address is traslated ito a physical address i virtual memory usig a page table The traslatio lookaside buffer (TLB) is a efficiet o-chip address traslatio cache Memory is divided ito pages Virtual memory systems store pages i memory (RAM) ad o disk Page fault: page is fetched from disk L1 caches (o chip): I-cache stores istructios D-cache stores data L2 cache (E-cache, o/off chip) Is typically uified: stores istructios ad data HPC 33
Locality i a Memory Referece Patter The workig set model A process ca be i physical memory if ad oly if all of the pages that it is curretly usig (the most recetly used pages) ca be i physical memory Page thrashig occurs if pages are moved excessively betwee physical memory ad disk Operatig system picks ad adjusts workig set size OS attempts to miimize page faults HPC 34
Traslatio Lookaside Buffer Logical address to physical address traslatio with TLB lookup to limit page table reads (page table is stored i physical memory) Logical pages are mapped to physical pages i memory (typical page size is 4KB) HPC 35
Fidig TLB Misses collect -o testprog.er -h DTLB_miss,o./testprog HPC 36
Caches Direct mapped cache: each locatio i mai memory ca be cached by just oe cache locatio N-way associative cache: each locatio i mai memory ca be cached by oe of N cache locatios Fully associative: each locatio i mai memory ca be cached by ay cache locatio Experimet shows SPEC CPU2000 bechmark cache miss rates HPC 37
Cache Details A N-way associative cache has a set of N cache lies per row A cache lie ca be 8 to 512 bytes Loger lies icrease memory badwidth performace, but space ca be wasted whe applicatios access data i a radom order A typical 8-way L1 cache (o-chip) has 64 rows with 64 byte cache lies Cache size = 64 rows x 8 ways x 64 bytes = 32768 bytes A typical 8-way L2 cache (o/off chip) has 1024 rows with 128 byte cache lies HPC 38
Cache Misses Compulsory miss: readig some_static_data[0] for (i = 0; i < 100000; i++) X[i] = some_static_data[i]; for (i = 0; i < 100000; i++) X[i] = X[i] + Y[i]; Capacity miss: X[i] o loger i cache Coflict miss: X[i] ad Y[i] are mapped to the same cache lie (e.g. whe cache is direct mapped) Compulsory misses Caused by the first referece to a datum Misses are effectively reduced by prefetchig Capacity misses Cache size is fiite Misses are reduced by limitig the workig set size of the applicatio Coflict misses Replacemet misses are caused by the choice of victim cache lie to evict by the replacemet policy Mappig misses are caused by level of associativity HPC 39
False Sharig O a multi-core processor each core has a L1 cache ad shares the L2 cache with other cores False sharig occurs whe two caches of processors (or cores) cache two differet o-shared data items that reside o the same cache lie Cache coherecy protocol marks the cache lie (o all cores) dirty to force reload To avoid false sharig: Allocate o-shared data o differet cache lies (usig malloc) Limit the use of global variables HPC 40
Fidig Cache Misses collect -o test.memory.1.er -S off -p o -h ecstall,o,cycles,o./cachetest -g memory.erg collect -o test.memory.2.er -S off -h icstall,o,dcstall,o./cachetest -g memory.erg HPC 41
Further Readig [HPC] pages 7-56 Optioal: [OPT] pages 51-141 HPC 42