MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I

OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory Architectures (5.2) Performance of SMP (5.3) 2

3 INTRODUCTION

INTRODUCTION Move to multi-processor CPE731 -Dr. -Dr. Iyad Iyad Jafar Jafar Power and ILP limitations? RISC 4 4 Technology Improvement New Architectures and Organization

INTRODUCTION Why multiprocessors? Increased costs of silicon and energy to exploit ILP Increasing performance of desktop is less important Advantage of replication rather than unique design Improved understanding on how to use multiprocessors effectively(especially in servers!) Growing interest in high end servers for cloud computing and SaaS A growth in data-intensive applications 5

INTRODUCTION Multiprocessor Tightly coupled processors whose coordination and usage are typically controlled by a single operating system and that share memory through a shared address space 2-32 processors Single-chip system (multicore) or multiple multicore chips Multiprocessors exploit thread-level parallelism Parallel programming execute tightly-coupled threads that collaborate on a single task Request-level parallelism execute multiple independent processes Single program or multiple applications (multiprogramming) Multicomputers? 6

INTRODUCTION To maximize the advantage of multiprocessors with n processors, we need n threads Independent threads are created by programmer or operating systems TLPmayexploitDLP A thread may have some iterations of a loop to exploit data-level parallelism Grain size must be sufficiently large to compensate for the thread overhead! 7

MULTIPROCESSORARCHITECTURE Symmetric Shared-Memory Multiprocessors (SMPs) Centralized shared-memory multiprocessors Smallnumberofcores Share single memory with uniform latency(uma) 8

MULTIPROCESSORARCHITECTURE Distributed Shared Memory Multiprocessors (DSMs) Larger number of processors Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and nondirect (multi-hop) interconnection networks 9

MULTIPROCESSORARCHITECTURE The term shared memory in both architectures implies that threads communicate with each other through the same address space i.e. Any processor can reference any memory locationaslongasithasaccessrights In DSM, the distributed memory adds the communication complexities and overhead 10

CHALLENGES Limited Parallelism in Programs Example. Suppose you want to achieve a speedup of 80 with 100 processors. What fraction of the original computation can be sequential? How this can be addressed? 11

CHALLENGES CommunicationOverhead Example. Suppose we have an application running on a 32-processor multiprocessor, which has a 200 ns time to handle reference to a remote memory. For this application, assume that all the references except those involving communication hit in the local memory hierarchy, which is slightly optimistic. Processors are stalled on a remote request, and the processor clock rate is 3.3 GHz. If the base CPI (assuming that all references hit in the cache) is 0.5, how much faster is the multiprocessor if there is no communication versus if 0.2% of the instructions involve a remote communication reference? 12

CHALLENGES Communication Overhead Example. CPI com = CPI ideal +misspenalty = 0.5 + remoterequestrate penalty = 0.5 + 0.002 200ns/0.3ns = 0.5 + 1.2 = 1.7 Speedup=1.7/0.5=3.4 The multiprocessor with all local references is 3.4 faster Howtoaddress?(SWandHW) 13

14 SMP ARCHITECTURES

SMP ARCHITECTURES Intel Nehalem (Nov 2008) 15

SMP ARCHITECTURES SMPs support caching of private and shared data Reduce latency, BW and contention Cachingprivatedataisnotaproblem!Like a uniprocessor! Caching shared data issues for memory system behavior Coherence;what values can be returned by a read Consistency; when a written value will be returned by a read 16

CACHECOHERENCE P1 5 P2 P3 3 4 X =? X = 5 X =? X = 5 X = 8 2 1 X = 5 Memory 17

CACHECOHERENCE Amemorysystemiscoherentif Preserve Program Order: A read by processor P to location XthatfollowsawritebyPtoX,withnowritesofXbyanother processor occurring between the write and the read by P, always returns the value written by P Coherent view of memory:readbyaprocessortolocationx that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in timeandnootherwritestoxoccurbetweenthetwoaccesses Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and thenlaterreaditas1 18

BASIC SCHEMES FOR ENFORCING COHERENCE A program running on multiple processors have copiesofthesamedatainseveralcaches In coherent multiprocessor, caches use migration and replication Migration Movedatatoalocalcacheandusetransparently ReducelatencyandBWdemand Replication Copy data to individual caches for simultaneous read Reduce latency and bus contention UseaHWprotocoltokeepcachescoherentinstead ofusingaswapproach 19

BASIC SCHEMES FOR ENFORCING COHERENCE Directory-based Protocols The sharing status of a shared block is kept in one (or more) location, i.e. Directory In SMP, centralized directory In DSM, distributed directories Snooping-based Protocols Everycachethathasacopyofasharedblockkeepstrackof the sharing status InSMP, Caches are accessible via some broadcast medium Each cache monitors or snoops the medium to determine whethertheyhaveacopyoftherequestedblock Can be used in multichip multiprocessor on top of directory protocol within each multicore 20

SNOOPINGCOHERENCEPROTOCOLS Write-update protocol(broadcast) A write to a cashed shared item updates all cached copies via the medium Less popular; consumes BW! Write-invalidate protocol A write to a shared cached item invalidates all cached copies(exclusive access) X 21

BASICIMPLEMENTATIONTECHNIQUES Abusorbroadcastmedium Perform invalidates by acquiring the bus first, then broadcasting the address Other processors snoop and check their caches for the broadcasted address Invalidation by different processors is serialized by bus arbitration Locatingshareditemsonamiss Simple in write-through! Write-back is more difficult! However, in write-back, caches can snoop for read requests as well and provide the data if they have it in dirty state Write buffers? 22

BASICIMPLEMENTATIONTECHNIQUES Tracking state Usecachetags,validanddirtybitstoimplement snooping 1-bittotrackthesharingstateofeachblock Exclusive/Modified state The processor has a modified copy of the block. No need to send invalidates on successive writes by the same processor Shared theblockisinmorethanprivatecaches. Finite state controller in each core Respondstorequestsfromthecoreandmedium Changethestateofacachedblock Invalid, modified or shared 23

EXAMPLEPROTOCOL (INVALIDATE& WB) 24 Why to write-back?

EXAMPLEPROTOCOL (INVALIDATE& WB) 25

EXAMPLEPROTOCOL (INVALIDATE& WB) 26

EXAMPLEPROTOCOL (INVALIDATE& WB) 27

EXTENSIONS TOMSI PROTOCOL The previous protocol is called MSI Many extensions exist Add states and/or transactions to improve performance MESI protocol(intel i7 MESIF) Exclusivestateaddedtoindicatecachelineisthesameas mainmemoryandistheonlycachedcopy When the state changes on Read Miss, no need to writeback block to memory MOESI protocol(amd Opteron) MSI and MESI update memory whenever changing the state to Shared. MOESI, a block can be changed from Modified to Owned without writing to memory. MOESI adds the Owner state to indicate that a block is owned by that cache and out-of-date in memory Theownershouldupdatetheblockinmemoryonamiss 28

LIMITATIONS Centralized memory can be become a bottleneck as the number of processors or their memory demands increase High BW connection to L3 cache allowed 4 to 8 cores.however,itisnotlikelytoscale! Multiple busses and interconnection networks such as cross-bar or small point-to-point Bankedmemoryorcache 29

LIMITATIONS Snooping BW could become a problem. Each processor must examine every miss Snooping may interfere with cache operation Duplicate cache tags Centralized directory in the outermost Doesnoteliminatethebottleneckatthebus 30

PERFORMANCE OFSMPS Performance is determined by Traffic caused by cache misses of processors Traffic of communication Both are affected by processor count, cache size and block size Adds the fourth C (coherence) for the 3Cs misses Types of coherence misses True Sharing Misses False Sharing Misses Singlevalidbitperblock. Writing a word in a block invalidates the entire block. 31

PERFORMANCE OFSMPS Coherence Misses Example Assume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of both P1 and P2. Assuming the following sequence of events, identify each miss as a true sharing miss, a false sharing miss, or a hit. 32

PERFORMANCE OFSMPS 1998Study Processor Alpha Server 4100 with four Alpha 21164 processors 4IPCat300MHz Workload TPC-B: online transaction processing(oltp) TPC-D: Decision support system(dss) AltaVista: Web index search 33

PERFORMANCE OFSMPS OLTP has the poorest performance due to memory hierarchy problems Consider evaluating the OLTP when varying L3cache size, block size and number of processors 34

PERFORMANCE OFSMPS Biggest improvement when moving from 1 to 2 MB L3? 35

PERFORMANCE OFSMPS Instruction and capacity misses drops but true sharing, false and compulsory misses are unaffected! 36

PERFORMANCE OFSMPS Increase of true sharing misses! 37

PERFORMANCE OFSMPS Reduce true sharing misses! 38