ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

ECE5917 SoC Architecture: MP SoC Part 1 Tae Hee Ha: tha@skku.edu Semicoductor Systems Egieerig Sugkyukwa Uiversity

Outlie Overview Parallelism Data-Level Parallelism Istructio-Level Parallelism Thread-Level Parallelism Processor-Level Parallelism Multi-core 2

Overview 3

Where Are We Headed? ø Time frame is popularity based. (Not based o first appearace) 1000000 100000 CPU-GPU Fusio Multithread, Multi-core MIPS 10000 1000 100 10 1 0.1 8086 286 Speculative, OOO Superscalar Pipeliig 486 386 Era of Istructio Level Parallelism Multithread SIMD-extesio Era of Thread & Processor Level Parallelism Special Purpose HW 0.01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Sigle-chip CPU Era (~ 2004) 4

Where Are We Headed? (Itel AMD Architecture Trasitio) 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Itel Desktop & Server Itel Mobile AMD Desktop & Server 130m Northwood/Gallati 1-core P6 (Petium III) 180m K7 130m Tualati 1-core K7 130m K7 130m Baias sigle-coree crisis NetBurst Core Nehalem Sady Bridge 90m Prescott / Smithfield P6 (Petium M) 90m Dotha CELL Shock K8 65m Cedar Mill / Presler 2-core (MCM) 65m Yoah 2-core 65m Merom 2-core 45m Pery 4-core (MCM) 45m Nehalem 4-core K10 (K8L) 32m Westmere >= 6-core Bulldozer 130m K8 90m K8 65m K8 65m K10 45m K10 32m Bulldozer 1-core 2-core 4-core >= 6-core 32m Sady Bridge 22m Ivy Bridge Sigle-Core Era Multi-Core Era System-level Itegratio Era 5

Processor Architectures: Fly s Classificatio SISD: Sigle Istructio, Sigle Data stream Uiprocessor SISD Istructio Pool SIMD: Sigle Istructio, Multiple Data streams Same istructio executed by multiple processig uits e.g.: multimedia processors, vector architectures Data Pool PU MISD: Multiple Istructio, Sigle Data stream Successive fuctioal uits operate o the same stream of data Rarely foud i geeral-purpose commercial desigs SIMD Istructio Pool MIMD: Multiple Istructio, Multiple Data streams Each processor has its ow istructio ad data streams Most popular form of parallel processig Sigle-user: high-performace for oe applicatio Data Pool PU PU PU Multiprogrammed: ruig may tasks simultaeously (e.g., servers) PU 6

System-level Itegratio (Chuck Moore, AMD at MICRO 2008) Sigle-chip CPU Era: 1986 2004 Extreme focus o sigle-threaded performace Multi-issue, out-of-order executio plus moderate cache hierarchy Chip Multiprocessor (CMP) Era: 2004 2010 Early: Hasty itegratio of multiple cores ito same chip/package Mid-life: Address some of the HW scalability ad (memory) iterferece issues Curret: Homogeeous CPUs plus moderate system-level fuctioality System-level Itegratio Era: ~2010 oward Itegratio of substatial system-level fuctioality Heterogeeous processors ad accelerators Itrospective cotrol systems for maagig o-chip resources & evets 7

Challeges!: Chuck Moore (AMD, 2011) Itegratio (log scale) Moore s Law We are here? DFM Variability Reliability Wire delay Power Budget (TDP) Power Wall Server: power = $$ DT: elimiate fas Mobile: battery We are here Frequecy Frequecy Wall We are here Time Time Time ILP complexity Wall Locality Sigle Thread Performace IPC We are here Performace We are here Sigle-thread Perf. We are here? Issue Width Cache Size Time 8

Three Walls to Serial Performace Memory Wall Istructio Level Parallelism (ILP) Wall Power Wall Source: excellet article, The May-Core Iflectio Poit for Mass Market Computer Systems, by Joh L. Maferdelli, Microsoft Corporatio http://www.ctwatch.org/quarterly/articles/ 2007/02/the-may-core-iflectio-poitfor-mass-market-computer-systems/ 9

Recall: Memory Wall Processor Memory(DRAM) Performace Gap! DRAM: A 1-cycle access i 1980 takes 100s of cycles i 2010 Registers: fast but small ad expesive We wat: fast, large, ad cheap memory 10

Recall: Typical Memory Hierarchy Radom Access (Read) Latecy Type Access time Capacity Maaged by Smaller, Faster, Costlier Register (F/F) L1 $ (SRAM) L2 L3 $ (SRAM) Register 1 cycle» 500~1,000B Compiler L1 cache» 3~4 cycles» 64KB HW L2 cache» 10~30 cycles» 256KB HW L3 cache» 30~60 cycles» 2~8MB HW Mai memory (DRAM) Mai Memory» 100~300 cycles 512~4GB (mobile) / 4~16GB (PC) OS Larger, Slower, Cheaper» 10 5 Performace Gap Flash Cache SSD Flash storage» 5K~10K cycles 8~32GB (mobile) / 128~512GB (PC) OS/Operator Data Storage (HDD) HDD» 10M~20M cycles > 1TB (PC) OS/Operator Exteral Secodary Storage (Exteral HDD, Tape, CD/DVD, Cloud Server) 11

Recall: How to alleviate the Memory Wall Problem Hidig/Reducig the memory access latecy Holistic approach Caches, Local memory, DRAM stackig, HW/SW prefetchig, Data locality optimizatio, Memory cotroller, SMT Icreasig the badwidth Latecy helps BW, but ot vice versa Reducig the umber of memory accesses keepig as much reusable data o cache ad local memory as possible 12

ILP Wall Duplicate hardware speculatively executes future istructios before the results of curret istructios are kow, while providig hardware safeguards to prevet the errors that might be caused by out of order executio 1. e = a + b 2. f = c + d 3. g = e f Braches must be guessed to decide what istructios to execute simultaeously If you guess wrog, you throw away this part of the result Data depedecies may prevet successive istructios from executig i parallel, eve if there are o braches 13

Power Wall Moore s Law Trasistor desity icreases every 18~24 moths CMOS Power Active power Stadby power v v v P total = V 2 f C a + V I leakage Drastic icrease i leakage curret ad decrease i oise margi prevet the voltage scalig aroud 1V Limitatios i Processor Performace Memory Wall ILP Wall Power Wall Not oly Battery, but also Heat! Itel 80386 cosumed ~ 2 W 3.3 GHz Itel Core i7 cosumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what ca be cooled by air 14

Power Wall Power dissipatio i clocked digital devices is proportioal to the clock frequecy, imposig a atural limit o clock rates Sigificat icrease i clock speed without heroic (ad expesive) coolig is ot possible à Chips would simply melt Clock speed icreased by a factor of 1,000 durig last two decades The ability of maufacturers to dissipate heat is limited though Look back at the last five years, the clock rates are pretty much flat You could bak o Materials Sciece (MS) breakthroughs The MS Egieers have usually delivered, ca they keep doig it?? 15

Pollack s Rule: Trade-offs improvemet (X) 4 3 2 1 0 Pollack s Rule 1.5 1 0.7 0.5 0.35 0.18 Area (Lead / Compactio) Performace (Lead / Compactio) Pollack s Rule: "performace icrease due to m-architecture advaces is roughly proportioal to [the] square root of [the] icrease i complexity Implicatios (i the same techology) New m-arch cosumes about 2-3x die area of the last m-arch, but provides 1.5-1.7x performace CMOS Process Techology (mm) 16

Multi-core Put multiple CPU s o the same die Why is this better tha multiple dies? Smaller, Cheaper Closer, so lower iter-processor latecy Ca share a L2 Cache (complicated) Less power Cost of multi-core: Complexity Slower sigle-thread executio 17

Creatig Parallel Processig Programs It is difficult to write SW that uses multiple processors to complete oe task faster, ad the problem gets worse as the umber of processors icreases The first reaso is that you must get better performace ad efficiecy from a parallel processig program o a multiprocessor Thik a aalogy of eight reporters tryig to write a sigle story i hopes of doig the work eight times faster 18

But (Fortuately) With the rise of the Iteret ad rich multimedia applicatios, the eed for hadlig idepedet tasks ad huge data icreased dramatically à Task Level Parallelism ad Data Level Parallelism User computig eviromet is chagig to iclude may backgroud tasks Multiprocessors ca speed up these types of applicatios with the help of tighter itegratio of cores ad multithreadig 19

Multi-core vs. Maycore Multi-core: curret trajectory Stay with curret fastest core desig Replicate every 18 moths (2, 4, 8... Etc ) Advatage: Do ot alieate serial workload Example: AMD X2 (2 core), Itel Core2 Quad (4 cores), AMD Barceloa (4 cores) Maycore: covergig i this directio Simplify cores (shorter pipelies, lower clock frequecies, i-order processig) Start at 100s of cores ad replicate every 18 moths Advatage: easier verificatio, defect tolerace, highest compute/surface-area, best power efficiecy Examples: Cell SPE (8 cores), Nvidia G80 (128 cores), Itel Polaris (80 cores), Cisco/Tesilica Metro (188 cores) Covergece: Ultimately toward Maycore Maycore if we ca figure out how to program it! Hedge: Heterogeeous Multi-core 20

Maycore System: CPU or GPU CPU GPU Large cache ad sophisticated flow cotrol miimize latecy for arbitrary memory access for serial process Simple flow cotrol ad limited cache, more trasistors for computig i parallel High arithmetic itesity hides memory latecy Cotrol ALU ALU ALU ALU Cache DRAM DRAM CPU GPU Source: NVIDIA 21

How Small is Small Xtesa x 3 Power5 (Server) TesilicaDP ARM 389mm 2 120W@1900MHz Itel Core2 sc (laptop) 130mm 2 Itel Core2 15W@1000MHz ARM Cortex A8 (automobiles) 5mm 2 0.8W@800MHz Power 5 Tesilica DP (cell phoes / priters) 0.8mm 2 0.09W@600MHz Each core operates at 1/3 to 1/10th efficiecy of largest chip, but you ca pack 100x more cores oto a chip ad cosume 1/20 the power Tesilica Xtesa (Cisco router) 0.32mm 2 for 3! 0.05W@600MHz 22

More Cocurrecy: Desig for Low Power Xtesa x 3 TesilicaDP ARM Cubic power improvemet with lower clock rate due to V 2 F Itel Core2 Slower clock rates eable use of simpler cores Power 5 Simpler cores use less area (lower leakage) ad reduce cost Tailor desig to applicatio to reduce waste This is how iphoes ad MP3 players are desiged to maximize battery life ad miimize cost 23

Tesio betwee Cocurrecy ad Power Efficiecy Highly cocurret systems ca be more power efficiet Dyamic power is proportioal to V 2 fc Build systems with eve higher cocurrecy? However, may algorithms are uable to exploit massive cocurrecy yet If higher cocurrecy caot deliver faster time to solutio, the power efficiecy beefit wasted So we should build fewer/faster processors? 24

Path to Power Efficiecy: Reducig Waste i Computig Examie methodology of low-power embedded computig market optimized for low power, low cost, ad high computatioal efficiecy Years of research i low-power embedded computig have show oly oe desig techique to reduce power: reduce waste. ¾ Mark Horowitz, Staford Uiversity & Rambus Ic. Sources of Waste Wasted trasistors (surface area) Wasted computatio (useless work/speculatio/stalls) Wasted badwidth (data movemet) Desigig for serial performace 25

What s Next? All Large Core Mixed Large ad Small Core May Small Cores All Small Core May Floatig- Poit Cores + 3D Stacked Memory Memory Differet Classes of Chips Home Games / Graphics Busiess Scietific The questio is ot whether this will happe but whether we are ready Source: Jack Dogarra, Itl. Supercomputig Cof. (ISC) 2008 26

Itel Sigle-chip Cloud Computer (Dec. 2009) 27

Parallelism - Itroductio 28

Little s Law Throughput (T) = Number-i-flight (N) / Latecy (L) Example: 4 floatig-poit registers, 8 cycles per floatig-poit op Little s Law à ½ issue per cycle Issue Executio WB 29

Basic Performace Quatities Latecy: Every operatio requires time to execute i.e. istructio, memory or etwork latecy Little s Law relates these three: Cocurrecy = Latecy Badwidth - or - Effective Throughput = Expressed Cocurrecy / Latecy Badwidth: # of (parallel) operatios completed per cycle Cocurrecy: i.e. # of FPUs, DRAM, Network, etc Total # of operatios i flight Cocurrecy must be filled with parallel operatios Ca t exceed peak throughput with superfluous cocurrecy Each chael has a maximum (limited) throughput 30

Performace Optimizatio: Cotedig Forces Cotedig forces of device efficiecy ad usage/traffic Improve throughput Reduce Volume of Data Restructure to satisfy Little s Law Implemetatio & Algorithmic Optimizatio Ofte boils dow to several key challeges: Maagemet of data/task locality Maagemet of data depedecies Maagemet of commuicatio Maagemet of variable ad dyamic parallelism 31

Classes of Parallelism ad Parallel Architectures (1/2) Basically two kids of parallelism i applicatios: Data-level parallelism (DLP) There are may data items that ca be operated o at the same time Task-level parallelism (TLP) Tasks of work are created that ca operate idepedetly ad largely i parallel Source: Computer Architecture 5 th ed.: A Quatitative Approach (Morga Kaufma, by Heessy & Patterso, 2011) 32

Classes of Parallelism ad Parallel Architectures (2/2) Computer HW i tur ca exploit these two kids of applicatio parallelism i four major ways: Istructio-level parallelism Exploits DLP at modest levels with compiler help usig ideas like pipeliig ad at medium levels usig ideas like speculative executio Vector architectures ad GPUs Exploits DLP by applyig a sigle istructio to a collectio of data i parallel (SIMD) Thread-level parallelism Exploits either DLP or TLP i a tightly coupled hardware model that allows for iteractio amog parallel threads Request-level parallelism Exploits parallelism amog largely decoupled tasks specified by the programmer or the OS Source: Computer Architecture 5 th ed.: A Quatitative Approach (Morga Kaufma, by Heessy & Patterso, 2011) 33

Uses of Parallelism Horizotal parallelism for throughput More uits workig i parallel A B C D Throughput Vertical parallelism for latecy hidig Pipeliig: keep uits busy whe waitig for depedecies of resource, data, ad cotrol A B C D A B C A B A Latecy 34

Program Executio Time Latecy metric: program executio time i secods CPUtime = = = Secods Program = Istructios Program IC CPI CCT Cycles Program Cycles Istructio Secods Cycle Secods Cycle Your system architecture ca affect all of them CPI (Cycles per istructios): memory latecy, IO latecy, CCT (clock freq.): cache org., power budget, IC (Istructio cout): OS overhead, compiler choice Idepedet? 35

Architecture Methods for Performace Ehacemet Powerful istructios MD-techique Multiple data operads per operatio: SIMD (Vector, Sub-word SIMD Extesio) MO-techique Pipeliig Multiple operatios per istructio: Sophisticated ISA (e.g. CISC-like), VLIW Multiple istructio issue Sigle stream: Superscalar Multiple streams Multithreadig, Multi-core 36

Powerful Istructios MD Techique MD-techique Multiple data operads per operatio SIMD: Sigle Istructio Multiple Data Vector istructio: Assembly: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Set Ldv Mulvi Ldv Addv Stv vl,64 v1,0(r2) v2,v1,5 v1,0(r1) v3,v1,v2 v3,0(r3) 37

Powerful Istructios MD Techique SIMD computig All PEs (Processig Elemets) execute same operatio Typical mesh or hypercube coectivity Exploit data locality of e.g. image processig applicatios Dese ecodig (few istructio bits eeded) time SIMD Executio Method PE1 PE2 PE Istructio 1 Istructio 2 Istructio 3 Istructio 38

Powerful Istructios MD Techique Sub-word parallelism SIMD o restricted scale for Multimedia istructios short vectors added to existig ISAs for microprocessors Examples: Itel MMX/SSE/AVX, ARM NEON, AMD 3Dow 39

Powerful Istructios MO Techique MO-techique: multiple operatios per istructio Two optios: CISC (Complex Istructio Set Computer) VLIW (Very Log Istructio Word) field FU 1 FU 2 FU 3 FU 4 FU 5 istructio sub r8, r5, 3 ad r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bez r5, 13 VLIW istructio example 40

Parallelism - Data Level Parallelism 41

Recall: Fly s Classificatio of Processor Architecture SISD: Sigle Istructio, Sigle Data stream Uiprocessor SISD Istructio Pool SIMD: Sigle Istructio, Multiple Data streams Same istructio executed by multiple processig uits e.g.: multimedia processors, vector architectures Data Pool PU MISD: Multiple Istructio, Sigle Data stream Successive fuctioal uits operate o the same stream of data Rarely foud i geeral-purpose commercial desigs SIMD Istructio Pool MIMD: Multiple Istructio, Multiple Data streams Each processor has its ow istructio ad data streams PU Most popular form of parallel processig Sigle-user: high-performace for oe applicatio Multiprogrammed: ruig may tasks simultaeously (e.g., servers) Data Pool PU PU PU 42

Data-level Parallelism Data parallelism focuses o distributig the data across differet parallel computig odes, which is usually foud i: Multimedia Computig Idetical ops o streams or arrays of soud samples, pixels, video frames Scietific Computig Weather forecastig, car-crash simulatio, biological modelig 43

DLP Kerel domiate may Computatioal Workloads 44

DLP ad Throughput Computig Source: Chuck Moore (AMD, 2011) 45

Data Parallelism & Loop Level Parallelism (LLP) Data Parallelism: Similar idepedet/parallel computatios o differet elemets of arrays that usually result i idepedet (or parallel) loop iteratios A commo way to icrease parallelism amog istructios is to exploit data parallelism amog idepedet iteratios of a loop: exploit Loop Level Parallelism (LLP) By urollig the loop either statically by the compiler, or dyamically by hardware, which icreases the size of the basic block preset This resultig larger basic block provides more istructios that ca be scheduled or re-ordered by the compiler/hardware to elimiate more stall cycles for (i=1; i<=1000; i=i+1;) x[i] = x[i] + y[i]; LV LV ADDV SV 4 vector istructios: Load Vector X Load Vector Y Add Vector X, X, Y Store Vector X 46

Resurgece of DLP Covergece of applicatio demads ad techology costraits drives architecture choice New applicatios, such as graphics, machie visio, speech recogitio, machie learig, etc. all require large umerical computatios that are ofte trivially data parallel SIMD-based architectures (Vector-SIMD, subword-simd, SIMT/GPUs) are most efficiet way to execute these algorithms 47

SIMD Classificatios Vector architectures SIMD extesios (sub-word SIMD) E.g) Itel - MMX: Multimedia Extesios (1996), SSE: Streamig SIMD Extesios (1999), AVX: Advaced Vector Extesio (2010) Graphics Processig Uits (GPUs) 48

Vector Architectures Basic idea: Read sets of data elemets ito vector registers Operate o those registers Disperse the results back ito memory Registers are cotrolled by compiler Register files act as compiler cotrolled buffers Used to hide memory latecy Leverage memory badwidth Vector loads/stores deeply pipelied Pay for memory latecy oce per vector ld/st! Regular loads/stores Pay for memory latecy for each vector elemet SCALAR (1 operatio) r1 + r3 r2 add r3, r1, r2 VECTOR (N operatios) v1 Rs1 Rs1 Rs1 + + + v3 Rd Rd Rd Rd Rd v2 Rs2 Rs2 Rs2 Vector legth vadd.vv v3, v1, v2 49

Vector Programmig Model Scalar Registers r15 v15 Vector Registers r0 v0 [0] [1] [2] [VLRMAX-1] Vector Legth Register VLR Vector Arithmetic Istructios ADDV v3, v1, v2 v1 v2 v3 + + + + + + [0] [1] [VLR-1] Vector Load & Store Istructios LV v1, r1, r2 v1 Vector Register Base, r1 Stride, r2 Memory 50

Multiple Datapaths Vector elemets iterleaved across laes Example: V[0, 4, 8, ] o Lae 1, V[1, 5, 9, ] o Lae 2, etc. Compute for multiple elemets per cycle Example: Lae 1 computes o V[0] ad V[4] i oe cycle Modular, scalable desig No iter-lae commuicatio eeded for most vector istructios 51

Vector Processors (I) A vector is a oe-dimesioal array of umbers May scietific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2; A vector processor is oe whose istructios operate o vectors rather tha scalar (sigle data) values Basic requiremets Need to load/store vectors à vector registers (cotai vectors) Need to operate o vectors of differet legths à vector legth register (VLEN) Elemets of a vector might be stored apart from each other i memory à vector stride register (VSTR) Stride: distace betwee two elemets of a vector 52

Vector Processors (II) A vector istructio performs a operatio o each elemet i cosecutive cycles Vector fuctioal uits are pipelied Each pipelie stage operates o a differet data elemet Vector istructios allow deeper pipelies No itra-vector depedecies à o hardware iterlockig withi a vector No cotrol flow withi a vector Kow stride allows prefetchig of vectors ito cache/memory 53

Vector Processor Pros No depedecies withi a vector Pipeliig, parallelizatio work well Ca have very deep pipelies, o depedecies! Each istructio geerates a lot of work Reduces istructio fetch badwidth Highly regular memory access patter Iterleavig multiple baks for higher memory badwidth Prefetchig No eed to explicitly code loops Fewer braches i the istructio sequece 54

Vector Processor Cos Still requires a traditioal scalar uit (iteger ad FP) for the o-vector operatios Difficult to maitai precise iterrupts (ca t rollback all the idividual operatios already completed) Compiler or programmer has to vectorize programs Not very efficiet for small vector sizes Not suitable/efficiet for may differet classes of applicatios Requires a specialized, high-badwidth, memory system Usually built aroud heavily baked memory with data iterleavig 55

Vector Processor Limitatios Performace of a vector istructio depeds o the legth of the operad vectors Iitiatio rate Rate at which idividual operatios ca start i a fuctioal uit For fully pipelied uits this is oe operatio per cycle Start-up time (latecy) Time it takes to produce the first elemet of the result Depeds o how deep the pipelie of the fuctioal uits are Especially large for load/store uit 56

Multimedia SIMD Extesios Key ideas: Media applicatios operate o data types arrower tha the ative word size Video & Graphics systems use 8 bits per primary color Audio samples use 8-16 bits No memories associated with ALU s, but a pool of relatively wide (64 to 256 bits) registers that store several arrower operads E.g) 256-bit adder: 16 simultaeous operatios o 16 bits, 32 simultaeous operatios o 8 bits No direct commuicatio betwee ALU s, but via registers ad with special shufflig/permutatio istructios Not co-processors or supercomputers, but tightly itegrated ito CPU pipelie 57

Multimedia SIMD Extesios Meat for programmers to utilize Not for compilers to geerate Recet x86 compilers Capable for FP itesive apps Why is it popular? Costs little to add to the stadard arithmetic uit Easy to implemet Need smaller memory badwidth tha vector Separate data trasfers aliged i memory Vector: sigle istructio, 64 memory accesses, page fault i the middle of the vector likely! Use much smaller register space Fewer operads No eed for sophisticated mechaisms of vector architecture 58

Multimedia Extesios (aka SIMD extesios) 64b 32b 32b 16b 16b 16b 16b 8b 8b 8b 8b 8b 8b 8b 8b Very short vectors added to existig ISAs for microprocessors Use existig wide-bit register split ito small-bit registers Licol Labs TX-2 from 1957 had 36b datapath split ito 2 18b or 4 9b Newer desigs have wider registers 128b for PowerPC Altivec, Itel SSE2/3/4 256b for Itel AVX Sigle istructio operates o all elemets withi a register 16b 16b 16b 16b 16b 16b 16b 16b 4 16b adds + + + + 16b 16b 16b 16b 59

SIMD Multimedia Extesios like SSE-4 At the core of multimedia extesios SIMD parallelism Variable-sized data fields: Vector legth = register width / type size V0 V1 V2 V3 V4 V5 V31 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 0 1 2 3. Sixtee 8-bit Operads Eight 16-bit Operads Four 32-bit Operads WIDE UNIT 60

Multimedia Extesios versus Vectors Limited istructio set: o vector legth cotrol o strided load/store or scatter/gather uit-stride loads must be aliged to 64/128-bit boudary Limited vector register legth: requires superscalar dispatch to keep multiply/add/load uits busy loop urollig to hide latecies icreases register pressure Tred towards fuller vector support i microprocessors Better support for misaliged memory accesses Support of double-precisio (64-bit floatig-poit) New Itel AVX spec (aouced April 2008), 256b vector registers (expadable up to 1024b) 61

Parallelism - Istructio Level Parallelism 62

ILP? Istructio-level parallelism (ILP) is a measure of how may of the operatios i a computer program ca be performed simultaeously The potetial overlap amog istructios is called istructio level parallelism There are two approaches to istructio level parallelism: Dyamic approach where maily hardware locates the parallelism à Superscalar Static approach that largely relies o software to locate parallelism à VLIW (Very Log Istructio Word) How much ILP exists i programs is very applicatio specific I certai fields, such as graphics ad scietific computig the amout ca be very large However, workloads such as cryptography exhibit much less parallelism 63

ILP vs. PLP ILP (Istructio-Level-Parallelism) Overlap idividual machie operatios (add, mul, load ) so that they execute i parallel PLP (Processor-Level Parallelism) Havig separate processors gettig separate chuks of the program ( processors programmed to do so) 64

Micro-architectural Techiques for ILP Istructio pipeliig Superscalar or VLIW Multiple executio uits are used to execute multiple istructios i parallel Out-of-Order executio Note that this techique is idepedet of both pipeliig ad superscalar Register reamig is used to eable out-of-order executio Speculative executio Executio of complete istructios or parts of istructios before beig certai whether this executio should take place Brach predictio is used with speculative executio 65

Micro-architectural Techiques for ILP I Order Brach Predictio I-Cache Fetch Uit Decode / Reame Istructio (fetch) buffer Moder processor techiques Deep pipelies Superscalar issue Out-of-order, speculative executio Dispatch Reservatio statios Brach predictio Register reamig, dataflow order I Order Out of Order It It Float Float L/S L/S Reorder buffer Retire Executio flow I order, speculative fetch Out of order execute I order commit Usig reorder buffer for precise exceptios Write buffer D-Cache 66

ILP (Parallel Istructio Executio) Costraits ILP Costraits Structural Depedece (Resource Cotetio) Code Depedeces (Sequetial Sematics of the Program) Cotrol Depedeces Data Depedeces (RAW) True Depedeces Storage Coflicts (ot i I-Order Processors) (WAR) Ati-Depedeces (WAW) Output Depedeces 67

Types of Depedecies Structural Depedece (Structural Hazard) HW perspective Code Depedece SW (Program) perspective Data depedece (Data Hazard) Data True depedece Name depedecies Output depedece Ati-depedece Cotrol Depedece (Cotrol Hazard) Note) H/W termiology Hazards, S/W termiology Depedecies 68

Visualizig Pipeliig Time (clock cycles) I s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg O r d e r Ifetch Reg ALU DMem Reg Ifetch Reg ALU DMem Reg 69

Pipeliig Overlaps executio of istructios by exploitig Istructio Level Parallelism Recall that CPU time (Latecy) = Secods Program = = IC CPI CCT Cycles Program Secods Cycle = Istructios Program Cycles Istructio Secods Cycle Pipeliig became uiversal techique i 1985 Performace Ehacemet Reduce the umber of istructios per program (IC) Reduce the umber of cycles per istructio (CPI) Reduce the umber of secods per cycle (CCT) Give ISA, it fully depeds o SW (Compiler, Programmer) Mostly depeds o HW orgaizatio & implemetatio techology uder system requiremets Pipeliig ca reduce CCT & (effective) CPI 70

Pipeliig is ot quite that easy! Limits to pipeliig: Hazards prevet ext istructio from executig durig its desigated clock cycle Structural hazards: HW caot support this combiatio of istructios (sigle perso to fold ad put clothes away) Data hazards: Istructio depeds o result of prior istructio still i the pipelie (missig sock) Cotrol hazards: Caused by delay betwee the fetchig of istructios ad decisios about chages i cotrol flow (braches ad jumps) Note) H/W termiology Hazards, S/W termiology Depedecies 71

Structural Hazards I s t r. Load Istr 1 Istr 2 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Ifetch ALU Reg Ifetch DMem ALU Reg Reg DMem ALU Reg DMem Reg Whe two or more differet istructios wat to use same hardware resource i same cycle e.g., MEM uses the same memory port as IF as show i this slide. O r d e r Istr 3 Istr 4 Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg 72

Structural Hazards Structural hazards are reduced with these rules: Each istructio uses a resource at most oce Always use the resource i the same pipelie stage Use the resource for oe cycle oly ISAs desiged with this i mid Sometimes very complex to do this Heavily depeds o programs ad hardware resources Some commo Structural Hazards: Memory access coflict Floatig poit - Sice may floatig poit istructios require may cycles, it s easy for them to iterfere with each other Startig up more of oe type of istructio tha there are resources 73

Data Hazards Time (clock cycles) IF ID/RF EX MEM WB I s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 ad r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Ifetch Reg Ifetch ALU Reg Ifetch DMem ALU Reg Ifetch Reg DMem ALU Reg Ifetch Reg DMem ALU Reg Reg DMem ALU Reg DMem Reg The use of the result of the ADD istructio i the ext three istructios causes a hazard, sice the register is ot writte util after those istructios read it. 74

Data Hazards Read After Write (RAW) Caused by a depedece, eed for commuicatio Istr-J tries to read operad before Istr-I writes it I : add r1, r2, r3 J : sub r4, r1, 43 Happes i cocurret executio or OoO Write After Write (WAW) Caused by a output depedece ad the re-use of the ame r1 Istr-J tries to write operad (r1) before Istr-I writes it I: sub r1, r4, r3 J: add r1, r2, r3 K: mul r6, r1, r7 Write After Read (WAR) Caused by a ati-depedece ad the re-use of the ame r1 Istr-J tries to write operad (r1) before Istr-I reads it I: add r4, r1, r3 J: add r1, r2, r3 K: mul r6, r1, r7 v Solutios for Data Hazards Stallig Forwardig: coect ew value directly to ext stage Speculatio (w/ HW) or reorderig (w/ compiler ad/or HW) 75

Cotrol Hazards A cotrol hazard is whe we eed to fid the destiatio of a brach, ad ca t fetch ay ew istructios util we kow that destiatio 10: beq r1,r3,36 Ifetch Reg ALU DMem Reg 14: ad r2,r3,r5 Ifetch Reg ALU DMem Reg 18: or r6,r1,r7 Ifetch Reg ALU DMem Reg 22: add r8,r1,r9 Ifetch Reg ALU DMem Reg 36: xor r10,r1,r11 Ifetch Reg ALU DMem Reg 76

Five Brach Hazard Alteratives #1: Stall util brach directio is clear #2: Predict Brach Not Take Execute successor istructios i sequece Squash istructios i pipelie if brach actually take Advatage of late pipelie state update 47% MIPS braches ot take o average PC+4 already calculated, so use it to get ext istructio #3: Predict Brach Take 53% MIPS braches take o average But have t calculated brach target address i MIPS MIPS still icurs 1 cycle brach pealty Other machies: brach target kow before outcome #4: Execute Both Paths #5: Delayed Brach Defie brach to take place AFTER a followig istructio brach istructio sequetial successor 1 sequetial successor 2... sequetial successor brach target if take 1 slot delay allows proper decisio ad brach target address i 5 stage pipelie 77

Pipeliig Istructio Fetch PC I-Cache Pipelied desig Oe stage per cycle Overlap istructios Decode & Read operads Decoder Register File Cost: pipelie registers To reduce stalls Execute ALU Forwardig paths for data depedecies Memory access D-Cache Predict-ot-take braches for cotrol depedecies Istructio & data caches to reduce memory stalls Writeback 78

Pipeliig ad ILP Higher clock frequecy (lower CCT): Deeper pipelies Decompose pipelie stages ito smaller stages - Overlap more istructios Lower CPI base : Wider pipelies Isert multiple istructio i parallel i the pipelie Lower CPI stall : Diversified pipelies for differet fuctioal uits Out-of-order executio Balace coflictig goals Deeper & Wider pipelies è more cotrol hazards Brach predictio (speculatio) 79

Deep Pipeliig Fetch 1 Fetch 2 Idea: break up istructio ito N stages Ideal CCT = 1/N compared to o-pipelied So let s use a large N Decode Read Registers ALU Memory 1 Memory 2 Write Registers Other motivatios for deep pipelies Not all basic operatios have the same latecy Iteger ALU, FP ALU, cache access Difficult to fit them i oe pipelie stage CCT must be large eough to fit the logest oe Break some of them ito multiple pipelie stages e.g. data cache access i 2 stages, FP add i 2 stage, FP mul i 3 stage 80

Limits to Pipelie Depth Each pipelie stage itroduces some overhead (O) Delay of pipelie registers Iequalities i work per stage Caot break up work ito stages at arbitrary poits Clock skew Clocks to differet registers may ot be perfectly aliged T T/N O T/N O If origial CCT was T, with N stages CCT is T/N+O If N, speedup = T / (T/N+O) T/O Assumig that IC ad CPI stay costat Evetually overhead domiates ad leads to dimiishig returs 81

Pipeliig Limits Petium3 Petium4 [Grochowski,Itel, 1997] High clock frequecy, but modest performace gais Due to memory latecy ad brach delays Power cosumptios icreases dagerously! 82

Wide or Superscalar Pipelies Fetch 1 Decode Read Registers Idea: operate o N istructios each cycle Parallelism at the istructio level CPI base = 1/N ALU Memory Write Registers Optios (from simpler to harder) Oe iteger ad oe floatig-poit istructio Ay N=2 istructios Ay N=4 istructios Ay N=? Istructios What are the limits here? 83

Diversified Pipelies Fetch 1 Decode Read Registers Idea: decouple the executio portio of the pipelie for differet istructios Separate pipelies for simple iteger, iteger multiply, FP, load/store It Add It Mult FPU Memory It Mult FPU FPU Memory Memory Advatage: avoids uecessary stalls e.g. slow FP istructio does ot block idepedet iteger istructios FPU Write Registers Disadvatages WAW hazards Imprecise (out-of-order) exceptios 84

ILP Architectures Computer Architecture: is a cotract (istructio format ad the iterpretatio of the bits that costitute a istructio) betwee the class of programs that are writte for the architecture ad the set of processor implemetatios of that architecture I ILP Architectures: + iformatio embedded i the program pertaiig to available parallelism betwee istructios ad operatios i the program 85

Sequetial Architecture ad Superscalar Processors Program cotais o explicit iformatio regardig depedecies that exist betwee istructios Depedecies betwee istructios must be determied by the hardware It is oly ecessary to determie depedecies with sequetially precedig istructios that have bee issued but ot yet completed Compiler may re-order istructios to facilitate the hardware s task of extractig parallelism 86

Scalar, Superscalar, Deep pipelie Scalar Processor: Oe istructio pass through i each cycle Superscalar Processor More tha oe istructio pass through i each cycle For m-way Superscalar, effective CPI is 1/m of the pipelie 3-way pipelied Superscalar 87

Superscalar Performace Performace Spectrum? What if all istructios were depedet? Speedup = 0, Superscalar buys us othig What if all istructios were idepedet? Speedup = N where N = superscalarity Agai key is typical program behavior Some parallelism exists 88

Simplified View of a OoO Superscalar Processor I-Cache Issue width Brach Predictio Fetch Uit Istructio (fetch) buffer Decode / Reame Dispatch I Order Issue 16 14 12 17 15 13 Read registers or Assig register tag Advace istructios to reservatio statios Reservatio statios 10 11 It It Float Float L/S L/S Out of Order Executio 8 5 9 7 Moitor register tag Receive data beig forwarded Issue whe all operads ready Reorder Buffer 4 5 6 7 8 9 10 Write buffer Retire D-Cache I Order Commit 2 3 1 89

Idepedece Architecture ad VLIW Processors By kowig which operatios are idepedet, the hardware eeds o further checkig to determie which istructios ca be issued i the same cycle The set of idepedet operatios >> the set of depedet operatios Oly a subset of idepedet operatios are specified The compiler may additioally specify o which fuctioal uit ad i which cycle a operatio is executed The hardware eeds to make o ru-time decisios 90

VLIW Processors Operatio vs. Istructio Operatio: is a uit of computatio (add, load, brach = istructio i sequetial arch.) Istructio: set of operatios that are iteded to be issued simultaeously Compiler decides which operatio to go to each istructio (schedulig) All operatios that are supposed to begi at the same time are packaged ito a sigle VLIW istructio IF ID EX M WB EX M WB EX M WB IF ID EX M WB EX M WB EX M WB 91

VLIW: Very Log Istructio Word It Op 1 It Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Iteger Uits, Sigle Cycle Latecy Two Load/Store Uits, Three Cycle Latecy Two Floatig-Poit Uits, Four Cycle Latecy Compiler schedules parallel executio Multiple parallel operatios packed ito oe log istructio word Compiler must avoid data hazards (o iterlocks) 92

VLIW Stregths I hardware it is very simple: cosistig of a collectio of fuctio uits (adders, multipliers, brach uits, etc.) coected by a bus, plus some registers ad caches More silico goes to the actual processig (rather tha beig spet o brach predictio, for example), It should ru fast, as the oly limit is the latecy of the fuctio uits themselves Programmig a VLIW chip is very much like writig microcode 93

VLIW Limitatios The eed for a powerful compiler, Icreased code size arisig from aggressive schedulig policies, Larger memory badwidth ad register-file badwidth, Limitatios due to biary compatibility across implemetatios 94

VLIW past & future Declie of VLIWs for geeral purpose systems: Could t be itegrated i a sigle chip Biary compatibility betwee implemetatios Rediscovery of VLIW i embbeded No more itegrability issues Biary icompatibility ot relevat (for DSP ot CPU) Advateges of VLIW: Simplified hardware optimize ad-hoc the architecture to achieve ILP 95

Summary: Superscalar vs. VLIW Additioal ifo required i the program Depedeces aalysis Idepedeces aalysis Superscalar Noe Performed by HW Performed by HW VLIW Miimally, a partial list of idepedeces. A complete specificatio of whe ad where each operatio to be executed Performed by compiler Performed by compiler Schedulig Performed by HW Performed by compiler Role of compiler Rearrages the code to make the aalysis ad schedulig HW more successful Replaces virtually all the aalysis ad schedulig HW 96

ILP Ope Problems Pipelied schedulig : Optimized schedulig of pipelied behavioral descriptios Two simple type of pipeliig (structural ad fuctioal) Cotroller cost : Most schedulig algorithms do ot cosider the cotroller costs which is directly depedet o the cotroller style used durig schedulig Area costraits : The resource costraied algorithms could have better iteractio betwee schedulig ad floorplaig Realism: Schedulig realistic desig descriptios that cotai several special laguage costructs Usig more realistic libraries ad cost fuctios Schedulig algorithms must also be expaded to icorporate differet target architectures 97

Summary: Limits to ILP Doublig issue rates above today s 3-6 istructios per clock probably requires processor to: Issue 3-4 data-memory accesses per cycle, Resolve 2-3 braches per cycle, Reame ad access over 20 registers per cycle, ad Fetch 12-24 istructios per cycle. Complexity of implemetig these capabilities is likely to mea sacrifices i maximum clock rate Widest-issue processor teds to be slowest i terms of clock rate Also cosider ROI i terms of area ad power 98

Summary: Limits to ILP (cot d) Most ways to icrease performace also boost power cosumptio Key questio is eergy efficiecy: does a method icrease power cosumptio faster tha it boosts performace? Multiple-issue techiques are eergy iefficiet: Icurs logic overhead that grows faster tha issue rate Growig gap betwee peak issue rates ad sustaied performace Number of trasistors switchig = f (peak issue rate); performace = f (sustaied rate); growig gap betwee peak ad sustaied performace Þ Icreasig eergy per uit of performace 99

Evolved Solutio or Alteratives MT (Multithreaded) approach More tightly coupled tha MP Decetralized multithreaded architectures Hardware for iter-thread sychroizatio ad commuicatio Multiscalar (U of Wiscosi), Superthreadig (U of Miesota) Cetralized multithreaded architectures Share pipelies amog multiple threads TERA, SMT (throughput-orieted) Trace Processor, DMT (performace-orieted) MP (Multiprocessor) approach Decetralize all resources Multiprocessig o a sigle chip Commuicate through shared-memory: Staford Hydra Commuicate through messages: MIT RAW 100