CS/COE1541: Intro. to Computer Architecture

Size: px

Start display at page:

Download "CS/COE1541: Intro. to Computer Architecture"

Kevin Nichols
5 years ago
Views:

1 CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2

2 Why multiprocessors? For improved latency Meteorologist running weather analysis program For improved throughput Transaction processing systems at banks Google search engine For improved reliability Computer system in a spacecraft; it should continue working correctly even when some CPUs go wrong 3 Is MP a single system? Boundary is blurred With fast advances in networking technology, physical dimensions become less of a concern Distributed systems look more like a single system (e.g., datacenter as a computer ) We will be more concerned in our discussions about processors placed physically close to each other Within a single building We will focus on systems where processors see a common memory space ( shared-memory multiprocessor ) Today, multicore processors are prevalent each desktop is a multiprocessor! 4

3 Topics Programming a parallel machine Parallel machine structures Shared-memory multiprocessor issues Memory consistency model Cache coherence enforcement 5 Parallelizing apps What is parallelization? Structuring a program to be suitable for execution on a parallel system Parallel algorithm + programming directives Where is parallelism? Transaction processing systems Scientific code (massive computations) Image/video compression Word processor Multiple levels or granularities of parallelism Task or thread Loop iteration Instruction We need to think parallel 6

not too difficult More challenging task is debugging Functional debugging Correctness of execution

Guaranteeing correctness with parallel codes can be hard!

4 (From Mattson Gentle Intro. to Parallel Programming ) 7 Tuning your parallel codes Perhaps writing parallel codes itself is not too difficult More challenging task is debugging Functional debugging Correctness of execution Performance debugging (or performance tuning) E.g., Result of parallelization with four processors is a slowdown! Guaranteeing correctness with parallel codes can be hard! May not be easy to identify necessary synchronization points May not be easy to reason about data access ordering Obtaining speedup may be hard! Understanding bottlenecks Resolving bottlenecks 8

5 Parallel programming paradigms Shared memory P P P Memory Processors have a view of globally shared memory space Each processor can access any address in the global address space This is a natural extension to our sequential programming model Communication between different processors is through memory How can we synchronize? Presence of caches poses the cache coherence problem Thread libraries (e.g., pthread) OMP (Open-MP) 9 Parallel programming paradigms Message passing Node P P P Processors do not share the address space; they have their own memory Communication is explicit E.g., Send(), Receive() Synchronization is implicit Memory Memory Memory Interconnect How do you partition the work and data and assign them to available nodes? Many high-performance cluster systems have this structure MPI (Message Passing Interface) 10

6 Shared memory primitives (OpenMP) #pragma omp parallel private To declare private variables to each thread #pragma omp barrier To synchronize all threads #pragma omp parallel To declare a parallel code section #pragma omp critical To declare a critical section #pragma omp atomic To declare an atomic operation 11 Shared memory primitives, cont d 12

7 Shared memory primitives (pthread) pthread = POSIX thread (library) Threads are light-weight; it takes much less time to create a thread than a process Threads share their memory space (except stack; why?) pthread_create() To create parallel threads pthread_join() To join parallel threads pthread_exit() pthread_self() pthread_mutex() pthread_barrier_wait() 13 Shared memory primitives, cont d 14

8 Message passing primitives (mpi) MPI_init() Initialize MPI_Comm_size() How many processors do we have? MPI_Comm_rank() Who am I? MPI_Send() MPI_Recv() MPI_Finalize() 15 Memory consistency Memory consistency is about what value you read from a shared variable in the presence of potentially conflicting memory accesses This is an issue in a shared-memory multiprocessor Not an issue for a message-passing machine (why?) Questions When must a processor see a value that has been updated by another processor? In what order must a processor observe the data writes of another processor? What properties must be enforced among reads and writes to different locations by different processors? Answering the above questions gives a model For programmers For machine designers 16

9 Intuition Value returned by a read (i.e., load instruction) should be the last value written However, last is not well defined; Last write issued to the memory system? Last in the program? Last write in time? Memory consistency model is about the program behavior; so last should be in terms of the program order In sequential program: order of operations in the machine binary presented to the processor In multi-threaded program: program order is defined within a process (or a thread) Need to make sense of orders across processes 17 Sequential consistency Most intuitive consistency model Definition (by Lamport) A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program Proc Proc Proc Proc Memory 18

10 Sequential consistency Write atomicity A write event is visible to all processes (or processors) immediately Serialization Memory access events within a process are presented in the order prescribed by the program Accesses are globally ordered The same global order is observed by all processes (or processors) 19 Sequential consistency Consider the order of memory operations Can both P1 and P2 enter their critical sections? 20

11 Sequential consistency Consider the atomicity of memory operations Can we have B = 1, register1 = 0? 21 Breaking SC 22

12 Sequential consistency summary Straightforward implementation of SC Program order requirement One memory operation should complete before the next one in program order Write atomicity (in cache-based systems) Write should invalidate/update all other copies and completes only after all others are invalidated/updated SC is intuitive; however it can limit hardware optimizations Today s processors already break sequential consistency Most notable example is various write buffers 23 Relaxing sequential consistency Relaxing the memory order: Relax the following if the two operations are for different locations Write-to-read program order Write-to-write program order Read-to-read or read-to-write program order Relaxing write atomicity Local update operation Can I read from my own write operation that is still pending in the write buffer (before it is propagated to other processors)? Remote update operation Can I read from a write value from Px before other processors such as Py sees the value? 24

13 Under a relaxed model There are a set of rules (specific to a memory consistency model) that must be observed for correct programming The rules are mostly about inserting synchronization primitives such as to get and release a lock For example, in the release consistency model acquire all ( shows a precedence relation) all release special special (special refers to synchronization operations) Programmers should place synchronization primitives such that certain ordering is enforced to a selected set of memory operations for the correctness of the program they write Programming complexity? Program portability? This complexity is typically hidden with the programming framework one writes a program on, e.g., pthread, OpenMP, MPI. 25 MP (hardware) design issues Most MPs are built on off-the-shelf commodity microprocessors How do we connect processors? Interconnection network How do we design microprocessors if we want to build MPs with them? Primitives to support? Support for interconnection network? Support for cache coherence? MP memory consistency models? MP cache coherence? Chip multiprocessors (CMP) vs. simultaneous multithreading (SMT) 26

14 A taxonomy (by Flynn) Count instruction streams and data streams SISD (Single Instruction stream Single Data stream) Uniprocessors (ILP) MISD Not really used SIMD Vector processor, multimedia processors MIMD This is multiprocessor 27 Interconnection network To connect processors and memories Issues Latency Bandwidth Cost Wires, switches, ports, Scalability Topology has been a focus of architects 28

15 Direct vs. indirect connection Direct connection Provides a direct inter-processor communication path Examples Complete connection, ring, mesh, bus, Indirect connection Provides a physically separate switching network for inter-processor communication Examples Crossbar, multistage network, 29 Important network properties Degree (relevant in direct connections) How many wires per node? Diameter Maximum distance between any pair Related with (worst-case) network latency Bisection width Given two parties of equal size, what s the minimum # of edges to remove to separate them? Related with aggregate bandwidth between two parties in the network 30

16 Cost issues How many wires? How many input/output ports per node? How many switches? How complex is each switch? How many ports? How ports are connected inside? 31 Bus interconnect Direct interconnection Endpoints connected directly Cost Bandwidth Scalability 32

17 Crossbar Indirect interconnect Switch implemented as big MUXes Latency Bandwidth Cost 33 Multistage network Indirect interconnect Routing done by address decoding Latency Bandwidth Cost 34

18 2-D torus Direct interconnect Latency Bandwidth Cost Variants 1-D (ring), 3-D, mesh (wraparound cut) 35 Hypercube Direct interconnect Latency Bandwidth Cost Good scalability, expensive switches 36

19 Fat tree 37 Real world examples IBM Blue Gene 3-D Torus interconnect SGI Altix Dual fat tree Earth simulator/dell PowerEdge cluster Crossbar 38

20 Network latency basics Latency Time for the first bit to hit the target Bandwidth How fast can we inject the entire message to the network? B bits per second To send a message of L bits Time = latency + L / B Total latency Time = sender overhead + latency + L / B + receiver overhead 39 Circuit switching link header acknowledgement data t r t s t setup t data Hardware path set up by a routing header End-to-end acknowledgement initiates data transfer at full hardware bandwidth time 40

21 Packet switching link packet header packet data t r t packet Blocking delays in circuit switching avoided Increased storage (full packets) requirements at the nodes Packetization and in-order delivery requirements time 41 Virtual cut-through link packet header packet data t blocking t r t s Messages cut through to the next router when possible Without blocking, messages are pipelined Pipelining cycle time is the larger of intra-router and inter-router flow control delays High load behavior approaches that of packet switching time 42

22 Wormhole switching link header flit single flit t r t s t wormhole Messages are pipelined but buffer space is on the order of a few flits Messages cannot be interleaved over a channel; routing information is only associated with the header flit Base latency is equivalent to that of virtual cut-through time 43 Interconnect routing Store-and-forward Switches buffer entire message before forwarding Latency = ((message_length / B) + fixed switch overhead) # of hops Buffer requirements? Wormhole Split a message into smaller flits (flow control digits) and pipeline their transmission Latency = (message_length / B) + fixed switch overhead # of hops 44

23 Programming multiprocessor Shared-memory model Implicit communication Explicit synchronization Message-passing model Explicit communication Implicit synchronization 45 Example 1 Find a largest number from 100,000 numbers with 10 processors Steps 46

24 Example 2 P processors process data from a shared queue When a processor becomes ready, it accesses the queue to do more work Queue data and queue pointers are shared 47 Critical section A piece of code that accesses shared data Example count = count + 1; // count is a shared variable Before entering a critical section Get permission to do so; if two processes enter the same critical section, inconsistency may occur Get a lock When existing a critical section Release the lock 48

Primitives Atomic exchange Test-and-set Fetch-and-increment Barrier MIPS R4000 example LL (load link) Load a value; set a link bit associated with the cache line containing the target lock (lock is

25 Primitives Atomic exchange Test-and-set Fetch-and-increment Barrier MIPS R4000 example LL (load link) Load a value; set a link bit associated with the cache line containing the target lock (lock is again a memory value) SC (store conditional) Returns 1 if store is done with the link bit set previously Otherwise returns 0 Link bit can be reset on external event (e.g., other processor gets the lock by performing SC before me) 49 Test-and-set 0 50

26 Multiprocessor cache coherence P0 Proc P1 Proc A=y A=x $$ A=x $$ T0: T2: T1: T3: T4: Cache coherence mechanism provides a way to keep the caches coherent with respect to each other It implements a set of memory consistency model semantics Efficiency of lock mechanisms determined by the cache coherence mechanism 51 Cache coherence solutions No caches? All traffic go to main memory (single coherent memory) Do not allow caching of shared data Simple software solution feasible What about performance penalty? Flushing of shared data from caches on important sync. points Relatively simple but potentially costly Hardware-based cache coherence Enforce coherence with explicit communications when needed Do this at the cache block granularity 52

27 Cache coherence protocols A set of rules to keep caches coherent Each processor node implements a FSM Monitor local references; if needed, tell other processors about its write operations Associate with each cache block sharing information Monitor external communications Invalidate or update local cache blocks Write is the problem Two basic approaches Write-update Write-invalidate 53 Write-invalidate vs. write-update Write-invalidate On a write, send invalidate command to all other processors that potentially have cached the same address This is an address -only transaction Write-update On a write, send update command to all other processors that potentially have cached the same address This is an address + data transaction Both the schemes can be efficiently implemented in a bus-based multiprocessor Bus is a broadcast medium; each bus transaction is seen by all Each processor snoops on the bus 54

28 MESI vs. MOESI MESI: cache blocks in one of the following four states M(odified): this block has been modified by the local processor; this is the up-to-date copy and no one else caches it E(xclusive): this block is an exclusive copy of the block S(hared): this block is shared; if you want to update the content of this block, make sure you invalidate or update other cached copies I(nvalid) MOESI: one more additional state Owner The node who has a block in the owner state is responsible for supplying data when requested 55 MESI example P0 Proc P1 Proc A=y A=x $$ ESM A=y A=x $$ SI Invalidate! T0: T2: T1: T3: T4: Cache miss! 56

29 Snooping vs. directory Snooping Resorts to broadcasting messages Each node monitors the broadcast network to capture coherence requests On each request, look up in the L2 cache tag to see if further actions are needed Directory The directory keeps track of who cached which cache blocks Central directory vs. distributed directory What is the storage overhead of directory? 57 Multi-threading Threads? Can be simply processes Can be more light-weight Shared data, shared heap, Multi-threading To better utilize available hardware When there is a long latency event (e.g., cache miss) Hardware supports fast switching between multiple threads 58

Multi-threading models Software-based context switching Hardware traps on a long-latency operation Software then switches the running process to another

extended to improve a single program performance improvement Simultaneous multithreading (SMT) proposed and implemented to support multiple threads at the

30 Multi-threading models Software-based context switching Hardware traps on a long-latency operation Software then switches the running process to another Issues with the overhead Hardware-based context switching Hardware has data structures and control to support fast switching Multi-threading has been extended to improve a single program performance improvement Simultaneous multithreading (SMT) proposed and implemented to support multiple threads at the same time to improve throughput (and possibly performance) IBM Power5 supports SMT Following slides from IBM Power 5 presentation 59 Simultaneous multithreading (SMT) 60

Superscalar to SMT Relatively straightforward Separate PC to share i-cache bandwidth GPR/FPR mapper expanded to handle two threads (single high-order bit tells which thread) Completion logic

31 Superscalar to SMT Relatively straightforward Separate PC to share i-cache bandwidth GPR/FPR mapper expanded to handle two threads (single high-order bit tells which thread) Completion logic replicated to track two threads Thread id bit added to most address/tag busses 61 SMT design Issues We have a single i-cache Caching multiple threads may thrash each other s code For fetching, we can take turn and fetch from different threads in different cycles For issuing, we can allow full competition or allocate issue slots to different threads We may need to monitor workloads to find available parallelism and allocate issue slots accordingly Data cache is also shared Do we need more associativity? Do we need more cache ports? 62

Thread priority Sometimes unbalance can help No work for opposite thread Thread waiting on a lock

32 Resource Sharing Resources are shared among threads # of resource an important parameter to tune Enforcing a balancing policy on resource usage may result in better utilization and performance 63 Thread priority Sometimes unbalance can help No work for opposite thread Thread waiting on a lock Software determined nonuniform balance Power management Solution: control instruction decode rate 8 priority levels 64

33 SMT summary SMT has been adopted in mainstream processor designs Intel Pentium 4 and others (hyperthreading) IBM Power 5 Implementing SMT using existing Superscalar processors was straightforward And we don t have applications that fully utilize all the resources anyway SMT poses an interesting cost-performance point Main goal is throughput, though There are proposals to use helper thread to improve main thread speed 65

Chapter 9 Multiprocessors

ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University