COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009

Size: px

Start display at page:

Download "COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009"

Barrie Clarke
5 years ago
Views:

1 COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall Vivek Sarkar Department of Computer Science Rice University COMP 322 Lecture October 2009

html Parallel Architectures, Calvin Lin Lectures 5 & 6, CS380P, Spring 2009, UT Austin http://www.cs.utexas.

2 Acknowledgments for todayʼs lecture Course text: Principles of Parallel Programming, Calvin Lin & Lawrence Snyder Includes resources available at 0,3110, ,00.html Parallel Architectures, Calvin Lin Lectures 5 & 6, CS380P, Spring 2009, UT Austin A Gentler, Kinder Guide to the Multi-core Galaxy ECE 4100/6100 guest lecture by Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech 2

3 Design Space How should we build parallel computers? Many dimensions to consider Here we discuss just a few of them Interconnection network Memory model Number of processors Can we build parallel computers that scale to large sizes? 3

4 Bus Interconnection Networks: Buses Simple design Each processor typically has its own cache These caches are typically kept coherent using some protocol Single shared resource limits scalability Memory P P P P Examples Core Duo, Sun Enterprise 5500, SGI Power Challenge, IBM Power4 4

5 Interconnection Networks: MIN MIN: Multi-stage interconnection networks O(logP) switches provide more concurrency than a bus, hence better scalability Longer latency since each transfer requires O(logP) hops P P M M P P M M P M P M P logp stages Examples IBM SP-2, IBM RP3, BBN Butterfly GP1000 M P M 5

6 Interconnection Networks: Meshes Meshes Fixed point to point communication arranged in a mesh or torus. Simpler to build than MIN. Bisection bandwidth does not scale well with P. Bisection bandwidth: the amount of data that can be transferred between any two partitions of the processors. Latency is O( P) P M P M P M P M P M P M P M P M bisection bandwidth Examples Intel Paragon, Cray T3E, Meiko 6

7 Hierarchical Interconnection Networks Fat Trees Thinking Machines CM-5 Ring of rings Kendall Square Research KSR-2 Clusters and hybrids 7

8 Modern Interconnection Networks Commercially available Myrinet Point-to-point routed network Quadrics Fat tree topology Infiniband Designed for I/O systems... 8

9 Interconnection Networks Is the topology important? Used to be the focus of portability studies e.g. How do you embed a tree structure in a mesh? In reality, the topology does not matter too much to programmers Instead, software overhead dominates What is important? Latency Bandwidth Topologies affect both of these characteristics As the number of processors grows, latency tends to grow 9

10 Impact of the Network Network performance Two components: bandwidth and latency Effects on programming model High cost of communication changes what you compute Compare the Pony Express with Instant Messaging 10

11 Dealing with Slow Communication Long latency, low bandwidth The Pony Express: Send as little mail as possible and don t send time-sensitive material Expend effort to reduce communication Long latency, high bandwidth Slow boat to China: Send as much material as you like, but avoid time-sensitive material Expend effort to reduce the number of communication operations Low latency, low bandwidth Instant messaging: Send whatever you want whenever you want, especially if Daddy is paying! 11

12 Memory Models Memory Model The view of memory that is presented to the programmer Shared Address Space All processors share a single address space Also known as shared memory (SM) Distributed Memory Each processor sees a disjoint view of memory Also known as non-shared memory (NSM) 12

13 Shared Address Space Machines Goal Simplify the programmer s task Disadvantage Difficult to build really large SM machines Symmetric vs. Asymmetric Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) Physically distributed memory More scalable These are performance characteristics Cache Coherence Cached values are transparently kept consistent by the hardware This is a semantic issue Some SM machines do not provide cache coherence (e.g., Cray T3E) 13

14 Intel Core Duo Single chip containing Two 32-bit Pentium processors (P0, P1) A 32KB Level 1 Instr cache (L1- I) and a 32KB Level 1 Data cache (L1-D) per processor A shared unified 2MB or 4MB Level 2 cache (L2) Supports fast on-chip communication between P0 and P1 Supports a sequential consistency memory model for sharing between P0 and P1 A memory bus controller which accesses off-chip memory via the Front Side Bus (FSB) 14

15 Homogeneous cores Bus based on chip interconnect Shared on-die Cache Memory Traditional I/O Intel Core 2 Duo Classic OOO: Reservation Stations, Issue ports, Schedulers etc Source: Intel Corp. Large, shared set associative, prefetch, etc

16 Core 2 Duo Microarchitecture 16 16

17 Cache Coherence on Bus-Based Machines Snooping caches Each cache snoops on the bus looking for accesses to data that it holds On a read, the cache can return the value faster than main memory On a write, the cache either invalidates or updates the value that it holds. Invalidates are the norm because they reduce network traffic Memory P P P P snooping caches Can we provide cache coherence with other types of interconnection networks? Yes, use broadcast and multicast operations to support snooping 17

18 AMD Dual Core Opteron Single chip containing Two AMD64 processors (P0, P1) A 64KB Level 1 I-cache (L1-I) and a 64KB Level 1 I-cache (L1-D) per processor Separate 1MB Level 2 cache (L2) per processor System Request Interface (SRI) handles coherence between P0 s and P1 s caches, and also supports fast intra-chip communication between them A memory bus controller which accesses off-chip memory via the HyperTransport interface 18

19 Directory-Based Cache Coherency Use indirection A directory manages access to each page of memory Maintains the state of each page (e.g., shared, exclusive, dirty) Keeps track of the various cached copies All memory accesses go through the directory The directory can be distributed to increase concurrency and reduce contention Added indirection increases latency Scalability? Early studies (DASH) used 64 processors Few studies on larger numbers of processors 19

Case Study: The KSR-2 Goal: the best of both worlds Provide the scalability of distributed physical memory Provide the programmability of an SM machine COMA

20 Case Study: The KSR-2 Goal: the best of both worlds Provide the scalability of distributed physical memory Provide the programmability of an SM machine COMA Cache-Only Memory Architecture Instead of allocating each memory location to a fixed home, allow the data to move to where it s used, as is done with caches 20

21 The KSR-2 (cont) Performance Typically exhibits poor performance when more than one ring is used, i.e., not very scalable Problems: the worst of both worlds? Distributed physical memory implies large non-local access times The COMA protocol makes it impossible for the programmer to control locality Ping pong effect can kill performance 21

22 Case Study: The Tera MTA (aka Cray MTA) The logical extreme in SM computers: Provide the illusion of uniform access to memory even as P scales to large values The key idea Use parallelism to hide latency Each processor supports multiple threads. At each clock cycle, the processor switches to another thread. Latency is hidden because by the time a thread executes its next cycle, any expensive memory access had already completed. Multithreaded Processor processor register file register file register file 128 threads 22

23 The Tera MTA (cont) Massive parallelism How do you get so much parallelism? Exploit parallelism at many levels Instruction level Within basic blocks Across different processes Between user code and OS code Advantage Supports hard-to-parallelize applications Disadvantage Everything is custom designed GaAs instead of CMOS technology 23

24 The Tera MTA (cont) Interconnection Topology Sparsely populated 3D Torus Memory Randomized memory allocation to reduce contention No caches Why? P processors with latency L to memory network must hold P L messages if each processor will be busy each cycle As L grows, we need to reduce P This is why urban sprawl is bad 24

25 The Tera Computer Epiloque MTA-1 Delivered in late 1990 s Set record for integer sort in 1997 MTA-2 Follow-on to MTA-1 implemented in CMOS technology Impressive speedups on hard problems [Anderson, et al SC2003] Lessons With a good design, good performance can be delivered for a variety of application domains Aside Recognizes the importance of good tools Large compiler effort with excellent personnel In 2000, Tera Computer Co. bought Cray, Inc. from SGI 25

26 Distributed Memory Goal Provide a scalable architecture Processes communicate through messages Disadvantage Often considered more difficult to program The distributed memory model is often mistakenly used synonymously with message passing This is a short-sighted view, as we can imagine divorcing the programming model from the hardware substrate Examples Most of the larger machines are distributed memory machines 26

27 Big fish eat little fish The Law of Nature 27

28 The Killer Micros Economies of scale Sales of microprocessors took off in the 80 s Supercomputers with custom-designed processors found it difficult to compete against those with commodity processors 28

29 Networks of Workstations (NOW, COW ) Use distributed system as a supercomputer Don t just reuse the CPU, reuse the entire workstation, including the CPU, memory, and I/O interface Views parallel computing as an extension of distributed computing Some claim that Networks of Workstations provide parallel computing for free Problems Software is still not a commodity part Moreover, the simpler the hardware, the more the software needs to do Workstations typically not designed with NOW s in mind, so some components are not quite right e.g., Need to redesign the network interface 29

30 Clusters Basic idea Build distributed memory machines from commodity parts, perhaps with some new redesign e.g., different form factors for rack-mounting Connect these workstations with high-speed commodity networks Advantages Scalable price/performance Can grow the system incrementally Relatively low cost Disadvantages Relatively high communication latency compared to CPU speed 30

31 The Landscape of Parallel Architectures Vector Cray Y-MP Vector supercomputer Memory Model Core Duo KSR- 2 Cray MTA-2 Coherent Shared Memory Shared Address Space Cell GPUs Clusters Earth Simulator BlueGene/L NSM 16 Number of Processors 64,

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD: