Arquitecturas y Modelos de. Multicore

Size: px

Start display at page:

Download "Arquitecturas y Modelos de. Multicore"

Vivian Moore
5 years ago
Views:

1 Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez

2 Opening statements * Some visionaries already predicted multicores 30 years ago And they have been working on programming them since then However, false prophets have been predicting automatic parallelization for those 30 years and they still have to deliver Now it's too late Multicores are here but are they here to stay? * These statements are my own, and they do not necessarily represent anyone eles's opinion 2

3 Outline Multicore Architecture Shared memory vs. Distributed memory Homogeneous vs. Heterogeneous Overview of current commercial products Research proposals arallel rogramming Architecture support for parallel programming 3

4 Shared ory Multiprocessors All processors share the same memory Common memory access network Usually a bus Caches maintain coherent copies of data Snooping bus protocol Scalability issues Adding processors does not increase memory size Shared memory bandwidth Bus traffic due to coherency messages 4

5 Distributed ory Scalable solution NI NI NI NI Each processor has its own cache + memory Non-coherent memories rocessors explicitly send copies of data through interconnection network Copies are non-coherent NI NI rogrammability issues Data and processing must be explicitly distributed Communication is also explicit NI 5

6 Distributed Shared ory Best of both worlds: Each processor has its own memory Single address space Non-uniform memory access time (NUMA) Caches maintain coherent copies of data erformance scalability issues Data must still be distributed Excessive data sharing pushes NUMA too far Large amount of memory spent on directories for coherency protocol 6

7 Homogeneous CM All processor are equal Same architecture Same ISA Same caches Ease of programming No need to target a specific processor Ease of runtime management Threads allocated to any processor Alternative designs: Few complex cores Many simple cores 7

8 Few complex cores vs. Many simple cores Complex processors: high single thread performance Good for serial (non-parallel) code oor parallel performance: simply can't exploit parallelism Simple processors: high parallellism Good and efficient for parallel segment oor serial performance 8

9 Heterogeneous CM Single ISA Amdhal's law erformance ends up limited by serial phase Exploit both IL + TL Sequential phase runs on large core arallel phase runs on small cores Runtime management complexity Where to allocate each thread? 9

10 Big vs. Small vs. Heterogeneous Asummption: Big cores are 4x Size, 2X erformance of a Small core Small cores are better for fully parallel code Heterogeneous offers the best compromise 10

11 Heterogeneous CM Multiple ISA Two paths for increased performance arallelism Specialization One step beyod running serial part on complex processor Run every task on a custom processor Vector SIMD Multithreaded Special opcodes 11

12 Heterogeneous CM Multiple ISA, Distributed ory A M A M A M A M A M A M A M A A A A Caches are unpredictable Huge variations in memory latency Cache coherency is not fully scalable Snoopy protocols rely on bus Directories take lots of space General purpose processor "Classic" approach based on caches Accelerators Work on local (private) memories Move data in/out through DMA 12

13 On-chip shared resources SM CM SMT ory L2 Cache L1 Cache Funtional Units Register File Shared ory Multiprocessors Shared main memory (DRAM) Shared memory bandwidth Chip Multiprocessors Shared L2 / L3 cache Capacity Bandwidth Multithreaded rocessors Shared everything Who takes responsibility for managing shared resources? Hardware policy Demand-based 13

14 Commercial CM: Homogeneous + Shared Cache Intel Core2 Duo 2-cores IBM ower5 2-cores 4 threads 14

15 Commercial CM: Homogeneous + NUCA L2 Cache Sun Niagara 2 8-core 32 threads 15

16 Commercial CM: Homogeneous + rivate L2 (I) AMD Barcelona 4-cores IBM ower6 2-cores 4 threads 16

17 Commercial CM: Homogeneous + rivate L2 (II) Homogeneous 4-core Embedded owerc cores rivate L2 caches BG/ 17

18 Commercial CM: Distributed ory Intel olaris 80-Core 18

19 Commercial CM: Heterogeneous multi-isa IBM / Sony / Toshiba Cell 9-Core (1+8) 2-threads (+SE threads) 19

20 Commercial CM: Homogeneous + Accelerators Many small cores for high parallel efficiency Each core exploits DL with wide SIMD units Acc. L2 L2 L2 L2 L2 L2 L2 L2 MC Shared non-uniform L2 cache (?) Shared special purpose hardware for graphics processing Acc. Intel Larrabee 20

21 Where are we headed? "It is difficult to make predictions, specially about the future" --Mark Twain* Special purpose more active transistors, higher frequency Heterogeneous more active transistors, higher frequency erfo ormance Multicores more active transistors, higher frequency Single thread performance more active transistors, higher frequency ? 2025?? 2035??? Credits to eter Hofstee, IBM Austin, for the original slide ** Moore's Law is actually about transistor density, not performance * Also attributed to many others, including Yogui Berra, Albert Einstein, Confucius, Groucho Marx, 21

22 Implicit vs. Explicit Communication Explicit communication When data producer and consumer are know ahead of time MI, Streaming are well known examples but not limited to that Implicit communication When data to be cosumed is not known ahead of time 22

23 H.264: Mixing Explicit and Implicit Communication Inter-frame dependencies unknown until MB is partialy decoded Intra-frame dependencies are well known Frame N-1 Frame N H.264/AVC Macroblock decoding exhibits both types of communication Explicit: every MB depends on its neighboring MBs in the same frame Implicit: the precise MB in the reference frame(s) is unknown until motion vectors have been computed 23

24 On-chip DSM: Globally accessible on-chip local memories (COMA style) space DRAM 0 All memories are mapped to the same space Vitual memory system tranlates logical to phisical address All processors can access ALL memory locations DMA read / write Load / Store Accesses are routed based on their phisical address There is a single location for any piece of data No need for cache coherency Changes to memory management ory allocation age migration 24

25 On-chip DSM: Good for explicit communication roducer tasks can write data directly to consumer local memory Store latency is not critical Runtime software explicitly manages locality Minimizes data transfers + traffic Avoids local buffer at producer side Avoids DMA transfer of data Nice when dataflow is know at task generation time what if a task can't tell which data will be needed? Hardware (caches) automatically manage locality 25

26 On-chip DSM + Caches: best of both worlds? Add coherent caches to keep local copies of off-chip memory locations On-chip locations are not cacheable Small latency penalty Data is still on-chip Simplifies coherency 26

27 The SARC Architecture: Heterogeneous On-Chip DSM + Caches Take the On-Chip DSM + Caches idea, and take it to 2025 More transistors, not all of them can be active at the same time Code runs on the most efficient processor for the task Dual path for increased performance arallelism Specialization Explicit locality management in the presence of explicit communication Cache coherency in the presence of implicit communication 27

28 Research Challenges (I) "May you live in interesting times" Chinese curse Architecture challenges Designing the best set of processor cores Support a wide variety of applications Support a wide range of power / performance / area tradeoffs Network on Chip (NoC) On-chip memory controllers + DMA + Network Interfaces Scalable cache coherency protocol ory bandwidth!! Compiler challenges Retargetable / adaptive compiler optimization rocessor ISA is a moving target now Single source compiler targets multiple processors rogrammer is not fully aware of ISA heterogeneity Single binary portable to many architecture instances What happns if a particular accelerator is not present / available? 28

29 Research Challenges (II) Runtime system challenges Thread / Task management Runtime dependency checking Task allocation + scheduling Explicit locality management ory management And of course programming the beast 29

Architecture support for the rogramming Model Sample CellSS Trace: Execution time is dominated by task management overhad Faster processors / accelerators minimize the computation time arallel

30 Architecture support for the rogramming Model Sample CellSS Trace: Execution time is dominated by task management overhad Faster processors / accelerators minimize the computation time arallel applications can be dominated by runtime overheads Synchronization: barriers, semaphores, mutex, Dependency graph management: add, remove, schedule, Communication: programming the DMA, the NI, rovide hardware accelerators for runtime system code It's an appliction too :-) 30

31 Task Management: run OoO and using Local ory Detailed simulation of AddTask burst on alternate U architectures 31

32 Conclusions Some people say architecture is dead I've been hearing that for the last 10 years The immediate trend is clear Homogeneous multicores, shared memory, coherent caches Look ahead, beyod Moore's law, where the future is uncertain Architecure specialization On-chip local memories Runtime management decisions In the meantime, learn (and teach) parallel programming :-) Task abstraction Explicit communication 32

4. Shared Memory Parallel Architectures

4. Shared Memory Parallel Architectures Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures