COMPUTER ORGANIZATION AND DESIGN

Size: px

Start display at page:

Download "COMPUTER ORGANIZATION AND DESIGN"

Heather Alexander
5 years ago
Views:

1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 6 Parallel Processors from Cliet to Cloud Itroductio Goal: coectig multiple computers to get higher performace Multiprocessors Scalability, availability, power efficiecy Task-level (process-level) parallelism High throughput for idepedet jobs Parallel processig program Sigle program ru o multiple processors Multicore microprocessors Chips with multiple processors (cores) 6.1 Itroductio Chapter 6 Parallel Processors from Cliet to Cloud 2

2 Hardware ad Software Hardware Serial: e.g., Petium 4 Parallel: e.g., quad-core Xeo e5345 Software Sequetial: e.g., matrix multiplicatio Cocurret: e.g., operatig system Sequetial/cocurret software ca ru o serial/parallel hardware Challege: makig effective use of parallel hardware Chapter 6 Parallel Processors from Cliet to Cloud 3 What We ve Already Covered 2.11: Parallelism ad Istructios Sychroizatio 3.6: Parallelism ad Computer Arithmetic Subword Parallelism 4.10: Parallelism ad Advaced Istructio-Level Parallelism 5.10: Parallelism ad Memory Hierarchies Cache Coherece Chapter 6 Parallel Processors from Cliet to Cloud 4

3 Parallel Programmig Parallel software is the problem Need to get sigificat performace improvemet Otherwise, just use a faster uiprocessor, sice it s easier! Difficulties Partitioig Coordiatio Commuicatios overhead 6.2 The Difficulty of Creatig Parallel Processig Programs Chapter 6 Parallel Processors from Cliet to Cloud 5 Amdahl s Law Sequetial part ca limit speedup Example: 100 processors, 90 speedup? T ew = T parallelizable /100 + T sequetial Speedup = (1-F parallelizable 1 ) + F parallelizable /100 = 90 Solvig: F parallelizable = Need sequetial part to be 0.1% of origial time Chapter 6 Parallel Processors from Cliet to Cloud 6

4 Scalig Example Workload: sum of 10 scalars, ad matrix sum Speed up from 10 to 100 processors Sigle processor: Time = ( ) t add 10 processors Time = 10 t add + 100/10 t add = 20 t add Speedup = 110/20 = 5.5 (55% of potetial) 100 processors Time = 10 t add + 100/100 t add = 11 t add Speedup = 110/11 = 10 (10% of potetial) Assumes load ca be balaced across processors Chapter 6 Parallel Processors from Cliet to Cloud 7 Scalig Example (cot) What if matrix size is ? Sigle processor: Time = ( ) t add 10 processors Time = 10 t add /10 t add = 1010 t add Speedup = 10010/1010 = 9.9 (99% of potetial) 100 processors Time = 10 t add /100 t add = 110 t add Speedup = 10010/110 = 91 (91% of potetial) Assumig load balaced Chapter 6 Parallel Processors from Cliet to Cloud 8

5 Strog vs Weak Scalig Strog scalig: problem size fixed As i example Weak scalig: problem size proportioal to umber of processors 10 processors, matrix Time = 20 t add 100 processors, matrix Time = 10 t add /100 t add = 20 t add Costat performace i this example Chapter 6 Parallel Processors from Cliet to Cloud 9 Istructio ad Data Streams A alterate classificatio Istructio Streams Sigle Multiple Sigle SISD: Itel Petium 4 MISD: No examples today Data Streams Multiple SIMD: SSE istructios of x86 MIMD: Itel Xeo e5345 SPMD: Sigle Program Multiple Data A parallel program o a MIMD computer Coditioal code for differet processors 6.3 SISD, MIMD, SIMD, SPMD, ad Vector Chapter 6 Parallel Processors from Cliet to Cloud 10

6 Example: DAXPY (Y = a X + Y) Covetioal LEGv8 code: LDURD D0,[X28,a] ADDI X0,X19,512 //load scalar a loop: LDURD D2,[X19,#0] //load x(i) FMULD D2,D2,D0 LDURD D4,[X20,#0] FADDD D4,D4,D2 STURD D4,[X20,#0] ADDI X19,X19,#8 ADDI X20,X20,#8 CMPB X0,X19 //upper boud of what to load //a x x(i) //load y(i) //a x x(i) + y(i) //store ito y(i) //icremet idex to x //icremet idex to y //compute boud B.NE loop //check if doe Vector LEGv8 code: LDURD D0,[X28,a] //load scalar a LDURDV V1,[X19,#0] //load vector x FMULDVS V2,V1,D0 //vector-scalar multiply LDURDV V3,[X20,#0] //load vector y FADDDV V4,V2,V3 //add y to product STURDV V4,[X20,#0] //store the result Chapter 6 Parallel Processors from Cliet to Cloud 11 Vector Processors Highly pipelied fuctio uits Stream data from/to vector registers to uits Data collected from memory ito registers Results stored from registers to memory Example: Vector extesio to LEGv elemet registers (64-bit elemets) Vector istructios lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each elemet of vector of double Sigificatly reduces istructio-fetch badwidth Chapter 6 Parallel Processors from Cliet to Cloud 12

7 Vector vs. Scalar Vector architectures ad compilers Simplify data-parallel programmig Explicit statemet of absece of loop-carried depedeces Reduced checkig i hardware Regular access patters beefit from iterleaved ad burst memory Avoid cotrol hazards by avoidig loops More geeral tha ad-hoc media extesios (such as MMX, SSE) Better match with compiler techology Chapter 6 Parallel Processors from Cliet to Cloud 13 SIMD Operate elemetwise o vectors of data E.g., MMX ad SSE istructios i x86 Multiple data elemets i 128-bit wide registers All processors execute the same istructio at the same time Each with differet data address, etc. Simplifies sychroizatio Reduced istructio cotrol hardware Works best for highly data-parallel applicatios Chapter 6 Parallel Processors from Cliet to Cloud 14

8 Vector vs. Multimedia Extesios Vector istructios have a variable vector width, multimedia extesios have a fixed width Vector istructios support strided access, multimedia extesios do ot Vector uits ca be combiatio of pipelied ad arrayed fuctioal uits: Chapter 6 Parallel Processors from Cliet to Cloud 15 Multithreadig Performig multiple threads of executio i parallel Replicate registers, PC, etc. Fast switchig betwee threads Fie-grai multithreadig Switch threads after each cycle Iterleave istructio executio If oe thread stalls, others are executed Coarse-grai multithreadig Oly switch o log stall (e.g., L2-cache miss) Simplifies hardware, but does t hide short stalls (eg, data hazards) 6.4 Hardware Multithreadig Chapter 6 Parallel Processors from Cliet to Cloud 16

ad register reamig Example: Itel Petium-4 HT Two threads: duplicated registers, shared fuctio uits ad caches Chapter

9 Simultaeous Multithreadig I multiple-issue dyamically scheduled processor Schedule istructios from multiple threads Istructios from idepedet threads execute whe fuctio uits are available Withi threads, depedecies hadled by schedulig ad register reamig Example: Itel Petium-4 HT Two threads: duplicated registers, shared fuctio uits ad caches Chapter 6 Parallel Processors from Cliet to Cloud 17 Multithreadig Example Chapter 6 Parallel Processors from Cliet to Cloud 18

simple cores might share resources more effectively Chapter 6 Parallel Processors from Cliet to Cloud 19 Shared Memory SMP: shared memory multiprocessor

10 Future of Multithreadig Will it survive? I what form? Power cosideratios Þ simplified microarchitectures Simpler forms of multithreadig Toleratig cache-miss latecy Thread switch may be most effective Multiple simple cores might share resources more effectively Chapter 6 Parallel Processors from Cliet to Cloud 19 Shared Memory SMP: shared memory multiprocessor Hardware provides sigle physical address space for all processors Sychroize shared variables usig locks Memory access time UMA (uiform) vs. NUMA (ouiform) 6.5 Multicore ad Other Shared Memory Multiprocessors Chapter 6 Parallel Processors from Cliet to Cloud 20

Example: Sum Reductio Sum 100,000 umbers o 100 processor UMA Each processor has ID: 0 P 99 Partitio 1000 umbers per processor Iitial summatio o each processor sum[p] = 0; for (i = 1000*P; i <

11 Example: Sum Reductio Sum 100,000 umbers o 100 processor UMA Each processor has ID: 0 P 99 Partitio 1000 umbers per processor Iitial summatio o each processor sum[p] = 0; for (i = 1000*P; i < 1000*(P+1); i = i + 1) sum[p] = sum[p] + A[i]; Now eed to add these partial sums Reductio: divide ad coquer Half the processors add pairs, the quarter, Need to sychroize betwee reductio steps Chapter 6 Parallel Processors from Cliet to Cloud 21 Example: Sum Reductio half = 100; repeat sych(); if (half%2!= 0 && P == 0) sum[0] = sum[0] + sum[half-1]; /* Coditioal sum eeded whe half is odd; Processor0 gets missig elemet */ half = half/2; /* dividig lie o who sums */ if (P < half) sum[p] = sum[p] + sum[p+half]; util (half == 1); Chapter 6 Parallel Processors from Cliet to Cloud 22

History of GPUs Early video cards Frame buffer memory with address

for PCs ad game cosoles Graphics Processig Uits Processors orieted

12 History of GPUs Early video cards Frame buffer memory with address geeratio for video output 3D graphics processig Origially high-ed computers (e.g., SGI) Moore s Law Þ lower cost, higher desity 3D graphics cards for PCs ad game cosoles Graphics Processig Uits Processors orieted to 3D graphics tasks Vertex/pixel processig, shadig, texture mappig, rasterizatio 6.6 Itroductio to Graphics Processig Uits Chapter 6 Parallel Processors from Cliet to Cloud 23 Graphics i the System Chapter 6 Parallel Processors from Cliet to Cloud 24

GPU Architectures Processig is highly data-parallel GPUs are highly multithreaded Use thread switchig to hide memory latecy Less reliace o multi-level caches Graphics memory is wide ad high-badwidth

13 GPU Architectures Processig is highly data-parallel GPUs are highly multithreaded Use thread switchig to hide memory latecy Less reliace o multi-level caches Graphics memory is wide ad high-badwidth Tred toward geeral purpose GPUs Heterogeeous CPU/GPU systems CPU for sequetial code, GPU for parallel code Programmig laguages/apis DirectX, OpeGL C for Graphics (Cg), High Level Shader Laguage (HLSL) Compute Uified Device Architecture (CUDA) Chapter 6 Parallel Processors from Cliet to Cloud 25 Example: NVIDIA Tesla Streamig multiprocessor 8 Streamig processors Chapter 6 Parallel Processors from Cliet to Cloud 26

Example: NVIDIA Tesla Streamig Processors Sigle-precisio FP ad iteger uits Each SP is fie-graied multithreaded Warp: group of 32 threads Executed i parallel, SIMD style 8

Coditioal executio i a thread allows a illusio of MIMD But with performace degredatio Need to write geeral purpose code with care Istructio-Level Parallelism Data-Level

14 Example: NVIDIA Tesla Streamig Processors Sigle-precisio FP ad iteger uits Each SP is fie-graied multithreaded Warp: group of 32 threads Executed i parallel, SIMD style 8 SPs 4 clock cycles Hardware cotexts for 24 warps Registers, PCs, Chapter 6 Parallel Processors from Cliet to Cloud 27 Classifyig GPUs Do t fit icely ito SIMD/MIMD model Coditioal executio i a thread allows a illusio of MIMD But with performace degredatio Need to write geeral purpose code with care Istructio-Level Parallelism Data-Level Parallelism Static: Discovered at Compile Time VLIW SIMD or Vector Dyamic: Discovered at Rutime Superscalar Tesla Multiprocessor Chapter 6 Parallel Processors from Cliet to Cloud 28

GPU Memory Structures Chapter 6 Parallel Processors from Cliet to Cloud 29 Puttig GPUs ito Perspective Feature Multicore with SIMD GPU SIMD processors 4 to 8 8 to 16 SIMD laes/processor 2 to 4 8 to

15 GPU Memory Structures Chapter 6 Parallel Processors from Cliet to Cloud 29 Puttig GPUs ito Perspective Feature Multicore with SIMD GPU SIMD processors 4 to 8 8 to 16 SIMD laes/processor 2 to 4 8 to 16 Multithreadig hardware support for SIMD threads Typical ratio of sigle precisio to double-precisio performace 2 to 4 16 to 32 2:1 2:1 Largest cache size 8 MB 0.75 MB Size of memory address 64-bit 64-bit Size of mai memory 8 GB to 256 GB 4 GB to 6 GB Memory protectio at level of page Yes Yes Demad pagig Yes No Itegrated scalar processor/simd processor Cache coheret Yes No Yes No Chapter 6 Parallel Processors from Cliet to Cloud 30

Hardware seds/receives messages betwee processors 6.

16 Guide to GPU Terms Chapter 6 Parallel Processors from Cliet to Cloud 31 Message Passig Each processor has private physical address space Hardware seds/receives messages betwee processors 6.7 Clusters, WSC, ad Other Message-Passig MPs Chapter 6 Parallel Processors from Cliet to Cloud 32

17 Loosely Coupled Clusters Network of idepedet computers Each has private memory ad OS Coected usig I/O system E.g., Etheret/switch, Iteret Suitable for applicatios with idepedet tasks Web servers, databases, simulatios, High availability, scalable, affordable Problems Admiistratio cost (prefer virtual machies) Low itercoect badwidth c.f. processor/memory badwidth o a SMP Chapter 6 Parallel Processors from Cliet to Cloud 33 Sum Reductio (Agai) Sum 100,000 o 100 processors First distribute 100 umbers to each The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; Reductio Half the processors sed, other half receive ad add The quarter sed, quarter receive ad add, Chapter 6 Parallel Processors from Cliet to Cloud 34

18 Sum Reductio (Agai) Give sed() ad receive() operatios limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* sed vs. receive dividig lie */ if (P >= half && P < limit) sed(p - half, sum); if (P < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of seders */ util (half == 1); /* exit with fial sum */ Sed/receive also provide sychroizatio Assumes sed/receive take similar time to additio Chapter 6 Parallel Processors from Cliet to Cloud 35 Grid Computig Separate computers itercoected by log-haul etworks E.g., Iteret coectios Work uits farmed out, results set back Ca make use of idle time o PCs E.g., SETI@home, World Commuity Grid Chapter 6 Parallel Processors from Cliet to Cloud 36

19 Itercoectio Networks Network topologies Arragemets of processors, switches, ad liks Bus Rig N-cube (N = 3) 2D Mesh Fully coected 6.8 Itroductio to Multiprocessor Network Topologies Chapter 6 Parallel Processors from Cliet to Cloud 37 Multistage Networks Chapter 6 Parallel Processors from Cliet to Cloud 38

20 Network Characteristics Performace Latecy per message (uloaded etwork) Throughput Lik badwidth Total etwork badwidth Bisectio badwidth Cogestio delays (depedig o traffic) Cost Power Routability i silico Chapter 6 Parallel Processors from Cliet to Cloud 39 Parallel Bechmarks Lipack: matrix liear algebra SPECrate: parallel ru of SPEC CPU programs Job-level parallelism SPLASH: Staford Parallel Applicatios for Shared Memory Mix of kerels ad applicatios, strog scalig NAS (NASA Advaced Supercomputig) suite computatioal fluid dyamics kerels PARSEC (Priceto Applicatio Repository for Shared Memory Computers) suite Multithreaded applicatios usig Pthreads ad OpeMP 6.10 Multiprocessor Bechmarks ad Performace Models Chapter 6 Parallel Processors from Cliet to Cloud 40

21 Code or Applicatios? Traditioal bechmarks Fixed code ad data sets Parallel programmig is evolvig Should algorithms, programmig laguages, ad tools be part of the system? Compare systems, provided they implemet a give applicatio E.g., Lipack, Berkeley Desig Patters Would foster iovatio i approaches to parallelism Chapter 6 Parallel Processors from Cliet to Cloud 41 Modelig Performace Assume performace metric of iterest is achievable GFLOPs/sec Measured usig computatioal kerels from Berkeley Desig Patters Arithmetic itesity of a kerel FLOPs per byte of memory accessed For a give computer, determie Peak GFLOPS (from data sheet) Peak memory bytes/sec (usig Stream bechmark) Chapter 6 Parallel Processors from Cliet to Cloud 42

Rooflie Diagram Attaiable GPLOPs/sec = Max ( Peak

Chapter 6 Parallel Processors from Cliet to Cloud 43

Optero X4 2-core vs. 4-core, 2 FP performace/core, 2.

3GHz Same memory system To get higher performace o X4

22 Rooflie Diagram Attaiable GPLOPs/sec = Max ( Peak Memory BW Arithmetic Itesity, Peak FP Performace ) Chapter 6 Parallel Processors from Cliet to Cloud 43 Comparig Systems Example: Optero X2 vs. Optero X4 2-core vs. 4-core, 2 FP performace/core, 2.2GHz vs. 2.3GHz Same memory system To get higher performace o X4 tha X2 Need high arithmetic itesity Or workig set must fit i X4 s 2MB L-3 cache Chapter 6 Parallel Processors from Cliet to Cloud 44

Optimizig Performace Optimize FP performace Balace adds &

Optimize memory usage Software prefetch Avoid load stalls Memory

from Cliet to Cloud 45 Optimizig Performace Choice of optimizatio

always fixed May scale with problem size Cachig reduces memory

23 Optimizig Performace Optimize FP performace Balace adds & multiplies Improve superscalar ILP ad use of SIMD istructios Optimize memory usage Software prefetch Avoid load stalls Memory affiity Avoid o-local data accesses Chapter 6 Parallel Processors from Cliet to Cloud 45 Optimizig Performace Choice of optimizatio depeds o arithmetic itesity of code Arithmetic itesity is ot always fixed May scale with problem size Cachig reduces memory accesses Icreases arithmetic itesity Chapter 6 Parallel Processors from Cliet to Cloud 46

i7-960 vs. NVIDIA Tesla 280/480 6.11 Real Stuff: Bechmarkig ad Rooflies i7 vs.

24 i7-960 vs. NVIDIA Tesla 280/ Real Stuff: Bechmarkig ad Rooflies i7 vs. Tesla Chapter 6 Parallel Processors from Cliet to Cloud 47 Rooflies Chapter 6 Parallel Processors from Cliet to Cloud 48

Bechmarks Chapter 6 Parallel Processors from Cliet to Cloud 49 Performace Summary GPU (480) has 4.4 X the memory badwidth Beefits memory boud kerels GPU has 13.1 X the sigle precisio throughout, 2.

25 Bechmarks Chapter 6 Parallel Processors from Cliet to Cloud 49 Performace Summary GPU (480) has 4.4 X the memory badwidth Beefits memory boud kerels GPU has 13.1 X the sigle precisio throughout, 2.5 X the double precisio throughput Beefits FP compute boud kerels CPU cache prevets some kerels from becomig memory boud whe they otherwise would o GPU GPUs offer scatter-gather, which assists with kerels with strided data Lack of sychroizatio ad memory cosistecy support o GPU limits performace for some kerels Chapter 6 Parallel Processors from Cliet to Cloud 50

26 Multi-threadig DGEMM Use OpeMP: void dgemm (it, double* A, double* B, double* C) { #pragma omp parallel for for ( it sj = 0; sj < ; sj += BLOCKSIZE ) for ( it si = 0; si < ; si += BLOCKSIZE ) for ( it sk = 0; sk < ; sk += BLOCKSIZE ) do_block(, si, sj, sk, A, B, C); } 6.12 Goig Faster: Multiple Processors ad Matrix Multiply Chapter 6 Parallel Processors from Cliet to Cloud 51 Multithreaded DGEMM Chapter 6 Parallel Processors from Cliet to Cloud 52

27 Multithreaded DGEMM Chapter 6 Parallel Processors from Cliet to Cloud 53 Fallacies Amdahl s Law does t apply to parallel computers Sice we ca achieve liear speedup But oly o applicatios with weak scalig Peak performace tracks observed performace Marketers like this approach! But compare Xeo with others i example Need to be aware of bottleecks 6.13 Fallacies ad Pitfalls Chapter 6 Parallel Processors from Cliet to Cloud 54

28 Pitfalls Not developig the software to take accout of a multiprocessor architecture Example: usig a sigle lock for a shared composite resource Serializes accesses, eve if they could be doe i parallel Use fier-graularity lockig Chapter 6 Parallel Processors from Cliet to Cloud 55 Cocludig Remarks Goal: higher performace by usig multiple processors Difficulties Developig parallel software Devisig appropriate architectures SaaS importace is growig ad clusters are a good match Performace per dollar ad performace per Joule drive both mobile ad WSC 6.14 Cocludig Remarks Chapter 6 Parallel Processors from Cliet to Cloud 56

29 Cocludig Remarks (co t) SIMD ad vector operatios match multimedia applicatios ad are easy to program Chapter 6 Parallel Processors from Cliet to Cloud 57

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance