Parallel Programming Platforms

Size: px

Start display at page:

Download "Parallel Programming Platforms"

Brittany Cain
5 years ago
Views:

1 arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University Reference: Introduction to arallel Computing, Ananth Grama, Anshul Gupta, Vipin Kumar, George Karypis, Addison Wesley, ISBN: , 2003

2 Motivating arallelism The Computational Speed Argument: For some applications, this is the only means of achieving needed performance The Memory/Disk Speed Argument: For some other applications, the needed I/O throughput can be provided only by a collection of nodes The Data Communication Argument: In yet other applications, the distributed nature of data implies that it is unreasonable to collect data to process it at a single location

3 Implicit arallelism: Trends in Microprocessor Architectures All microprocessors currently available rely on parallelism to varying degrees These are generally hidden from the programmer ipelining and Superscalar Execution Very Long Instruction Word rocessors

4 Superscalar Execution 1 load 2 load 3 add 4 add 5 add R1, R2 6 store (i) 1 load 2 add 3 add 4 add 5 store (ii) 1 load 2 add 3 load 4 add 5 add R1, R2 6 store (iii) (a) Three different code fragments for adding a list of four numbers Instruction cycles IF IF ID ID OF OF load load IF ID OF E IF ID OF E add add IF: Instruction Fetch ID: Instruction Decode OF: Operand Fetch E: Instruction Execute WB: Write back NA: No Action IF ID NA E add R1, R2 IF ID NA WB store (b) Execution schedule for code fragment (i) above Clock cycle Horizontal waste Vertical waste Full issue slots Empty issue slots Adder Utilization (c) Hardware utilization trace for schedule in (b)

5 Limitations of Memory System erformance Often, the primary bottleneck to performance is not the CU, rather, it is the memory system Typical computations only run at 10-50% of peak CU utilization because of memory bottlenecks The key question here is how to connect a 50 ns latency memory to a processor that runs a 05 ns clock! Improving Effective Memory Latency Using Caches: A hierarchy of small, fast stores bridge the gap between processor and memory These stores rely on repeated accesses to data to deliver higher aggregate performance Improving Memory System erformance by Threading: Find something else to do while you are waiting for data to arrive from memory

6 Impact of Strided Access on rogram erformance // Fragment 1: Summing columns of a matrix for (i = 0; i < 1000; i++) column_sum[i] = 00; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; // Fragment 2: Fragment 1, rewritten for (i = 0; i < 1000; i++) column_sum[i] = 00; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; = A b A b A b A b (a) Column major data access = = = = A b A b A b A b (b) Row major data access

7 Dichotomy of arallel Computing latforms Control Structure of arallel latforms: What is the nature of concurrent tasks? Communication Model of arallel latforms: How do multiple tasks cooperate with each other? Message assing latforms Shared-Address-Space latforms

8 Control Structure of arallel Machines: E: rocessing Element Global control unit E E E E INTERCONNECTION NETWORK E + control unit E + control unit E + control unit INTERCONNECTION NETWORK E E + control unit (a) (b) In a Single Instruction Multiple Data (SIMD) paradigm (a), all processing elements execute the same instruction In a Multiple Instruction Multiple Data paradigm (b), all processing elements execute possibly different instructions, independently An intermediate paradigm, called Single rogram Multiple Data (SMD) is the most popular programming paradigm

9 Communication Model of arallel latforms Interconnection Network (a) M M M C C C Interconnection Network (b) M M M C M C M C M (c) Interconnection Network In a uniform memory access (UMA) shared memory model (a), all processors access memory through an interconnect In (b), we show a UMA shared memory machine with caches, and in (c), we show a non-uniform memory access (NUMA) shared memory machine with private memories only The logical name for all of these platforms is a shared address space machine

10 hysical Organization of arallel latforms The characterizing feature of parallel platforms is the underlying interconnection network These networks can be static (a) or dynamic (b) Static network Indirect network Network interface/switch rocessing node Switching element

11 Direct Interconnection Networks: Address Data Shared Memory rocessor 0 rocessor 1 (a) Address Data Shared Memory Cache / Local Memory Cache / Local Memory rocessor 0 rocessor 1 (b) Bus-based interconnects (without (a) and with caches (b)) were the first networks in early commercially available platforms (Sequent Symmetry/Balance)

12 Direct Interconnection Networks: Memory Banks b 1 A switching element rocessing Elements p 1 The other extreme in terms of performance and cost compared to buses, is the crossbar network

13 Direct Interconnection Networks rocessors Multistage interconnection network Memory banks Stage 1 Stage 2 Stage n p-1 b-1 Multistage networks such as the Omega network fall between buses and crossbars in terms of cost and performance

14 Static Interconnection Networks: (a) (b) A completely connected network (a) is the strongest model for a static network A star connected network (b) is the other extreme (a) (b) (c) Meshes (2 and 3-D) are popular interconnects because of their desirable layout properties and performance for physical simulations

15 Static Interconnection Networks: Hypercubes provide another popular interconnect D hypercube 1-D hypercube 2-D hypercube 3-D hypercube D hypercube

16 Metrics for Interconnection Networks erformance Metrics - Link Bandwidth: how fat is each link? - Arc Connectivity: how many links do I have to remove to separate a network - Bisection Bandwidth: how many pairs of people can have conversations at any given time, independently A B C Cost Metrics - Number of Links - Layout Costs

17 Cache Coherence in Multiprocessor Systems: How do you deal with multiple copies of the same data item, being manipulated by different processors? 0 load x 1 load x 0 write #3, x 1 x = 1 x = 1 x = 3 x = 1 x = 1 Memory Invalidate x = 1 Memory 0 load x 1 load x (a) 0 write #3, x 1 x = 1 x = 1 x = 3 x = 3 x = 1 Memory Update x = 3 Memory (b) We can invalidate copies (a), or update them (b) when a processor changes a copy

18 Coherence rotocols: read Shared read write C_read C_write Invalid write Dirty read/write C_write flush A simple three-state protocol for implementing cachecoherence

19 Coherence rotocols: An example of the coherence protocol Time Instruction at rocessor 0 Instruction at rocessor 1 Variables and their states at rocessor 0 Variables and their states at rocessor 1 Variables and their states in Global mem x = 5, D y = 12, D read x x = 5, S x = 5, S read y y = 12, S y = 12, S x = x + 1 y = y + 1 x = 6, D y = 13, D x = 5, I y = 12, I read y y = 13, S y = 13, S y = 13, S read x x = 6, S x = 6, S x = 6, S x = x + y x = 19, D x = 6, I x = 6, I y = x + y y = 13, I y = 19, D y = 13, I x = x + 1 x = 20, D x = 6, I y = y + 1 y = 20, D y = 13, I

20 Implementing Coherence rotocols: rocessor rocessor rocessor Snoop H/W Tags Cache Snoop H/W Tags Cache Snoop H/W Tags Cache Dirty Address/dat Memory Using a snoopy bus to implement the coherence protocol Snoopy buses do not scale to large configurations

21 Implementing Coherence rotocols: Directories provide a more scalable solution than snoopy buses rocessor rocessor Cache Cache rocessor Cache Interconnection Network Memory rocessor resence bits / State Interconnection Network rocessor resence Bits Data Cache Cache State Directory Memory resence bits / State (a) (b)

22 Routing Messages in arallel latforms: - Store-and-forward routing - acket Routing - Cut-Through routing Time 0 1 (a) A single message sent over a store-and-forward network 2 3 Time 0 1 (b) The same message broken into two parts and sent over the network 2 3 Time 0 1 (c) The same message broken into four parts and sent over the network 2 3

23 Routing in arallel latforms: - Dimension ordered / E-cube routing pd pd pd ps ps ps 011 Step 1 ( ) Step 2 ( ) - Hot-potato routing - Randomized Routing

24 Embeddings and Overhead: It is often possible to map a weaker architecture on to a stronger one with no performance overhead Conversely, mapping a stronger architecture on to a weaker one results in performance penalties, depending on the program Algorithms for mapping between structured networks are well studied (see handout)

25 Case Studies: The IBM Blue Gene (a) CU (1GF) (b) Chip (32 GF) (c) Board (2 TF) (d) Tower (16 TF) (e) Blue Gene (1 F)

26 The Cray T3E Memory Control Router (a) (b)

27 SGI Origin 3000 I//D/X Brick C-Brick C-Brick R-Brick rocessor Crossbar C-Brick C-Brick C-Brick Memory/ Directory C-Brick R-Brick C-Brick C-Brick C-Brick 32 rocessor Configuration R-Brick To 4 C-Bricks (16 processors) Metarouter To 8 other R-Bricks To meta-router 1 R-Brick, 4 C-Bricks, and 16 processors at each vertex 128 rocessor Configuration 128 processors 512 rocessor Configuration

28 Sun Enterprise Servers Address bus System Board System Board System Board 32 byte data bus Four address buses System Board System Board System Board 16 x 16 non-blocking crossbar Sun Ultra 6000 (6-30 processors) Starfire Ultra 1000 (up to 64 processors)

29 Commonly used networks: Myrinet: Interface Interface Interface Interface Interface Switch Switch Switch Switch Switch Switch Interface T Fiber to external network Interface Switch Switch Switch Interface Interface Interface Interface Other networks include Gigabit Ethernet, FiberChannel, HiI, etc

Limitations of Memory System Performance

Limitations of Memory System Performance Slides taken from arallel Computing latforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar! " To accompany the text ``Introduction to arallel Computing'', Addison Wesley, 2003. Limitations