Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Elements of a Parallel Computer Hardware Multiple processors Multiple memories Interconnection network System software Parallel operating system Programming constructs to express/orchestrate concurrency Application software Parallel algorithms Goal: utilize the hardware, system and application software to Achieve speedup: T p = T s /p Solve problems requiring a large amount of memory 2

Parallel Computing Platform Logical organization The user s view of the machine as it is being presented via its system software Physical organization The actual hardware architecture Physical architecture is to a large extent independent of the logical architecture Ex) message passing on shared memory architecture, distributed shared memory system 3

Logical Organization Elements Control mechanism Flynn s taxonomy Single-core processor not covered SISD Single Instruction stream Single Data stream MISD Multiple Instruction stream Single Data stream SIMD Single Instruction stream Multiple Data stream MIMD Multiple Instruction stream Multiple Data stream Multi-core processor 4

SIMD vs. MIMD SIMD architecture MIMD architecture 5

SIMD Exploit data parallelism The same instruction on multiple data items 16-byte boundaries for (i=0; i<n; i++) a[i] = b[i] + c[i]; SIMD unit b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11...... vr1 vr2 b0 b1 b2 b3 c0 c1 c2 c3 b0+ b1+ b2+ b3+ c0 c1 c2 c3 vr3 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11... 6

SIMD Exploit data parallelism The same instruction on multiple data items SIMD units in processors Supercomputers: BlueGene/Q PC: MMX/SSE/AVX (x86), AltiVec/VMX (PowerPC), Embedded systems: Neon (ARM), VLIW+SIMD DSPs Co-processors: nvidia GPGPU 7

MIMD Multiple instructions on multiple data items A collection of independent processing elements (or cores) Usually exploits thread-level parallelism Modern parallel computing platforms SIMD can also work on this system 8

Programming Model What programmer uses in coding applications Specifies communication and synchronization Instructions, APIs, defined data structure Programming model examples Shared address space Load/store instructions to access the data for communication Message passing Special system library, APIs for data transmission Data parallel Well-structured data, same operation to multiple data in parallel Implemented with shared address space or message passing 9

Shared Address Space Architecture Shared address space Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Location transparency (flat address space) Similar programming model to time-sharing on uniprocessors Except processes run on different processors Good throughput on multi-programmed workloads Popularly known as shared memory machine/model Memory may be physically distributed among processors 10

Shared Address Space Architecture Multi-Processing One or more thread on a virtual address space Portion of address spaces of processes are shared Writes to shared address visible to other threads/processes Natural extension of uniprocessor model Conventional memory operations for communication Special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space 11

x86 Examples Shared Address Space Quad core processors Highly integrated, commodity systems Multiple cores on a chip low-latency, high bandwidth communication via shared cache Core Core Core Core Core Core Shared L2 Cache Core Core Shared L3 Cache Intel i7 (Nehalem) AMD Phenom II (Barcelona) 12

Earlier x86 Example Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module High latency and low bandwidth CPU Interrupt controller 256-KB L 2 $ P-Pro module P-Pro module P-Pro module Bus interface P-Pro bus (64-bit data, 36-bit address, 66 MHz) PCI bridge PCI bridge Memory controller PCI I/O cards PCI bus PCI bus MIU 1-, 2-, or 4-way interleaved DRAM 13

Shared Address Space Architecture Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems Hardware- or software-based distributed shared memory (DSM) UMA system NUMA system Distributed memory system 14

Scaling Up M M M Network Network $ $ $ M $ M $ M $ P P P P P P Dance Hall (UMA) Distributed Memory (NUMA) Problem is interconnect - cost (crossbar) or bandwidth (bus) Share memory (uniform memory access, UMA) Latencies to memory uniform, but uniformly large Distributed memory (non-uniform memory access, NUMA) Construct shared address space out of simple message transactions across a general-purpose network Cache: keeps shared data (local, and non-local data in NUMA) 15

Example: SGI Altix UV 1000 Scale up to 262,144 cores 16TB shared memory 15 GB/sec links Multistate interconnection network Hardware cache coherence ccnuma 16

Parallel Programming Models Shared Address Space Message Passing Data Parallel 17

Message Passing Architectures Message passing architectures Complete computer as building block Communication via explicit I/O operations Programming model Directly access only private address space (local memory) Communicate via explicit messages (send/receive) High-level block diagram similar to distributedmemory shared address space system But communication integrated to I/O level, not memory-level Easier to build than scalable SAS 18

Message Passing Abstraction Match Receive Y, P, t Send X, Q, t Addr ess Y Addr ess X Local pr ocess addr ess space Local pr ocess addr ess space Pr ocess P Pr ocess Q Message passing Send specifies buffer to be transmitted and receiving process Recv specifies sending process and buffer to receive Can be memory to memory copy, but need to name processes Optional tag on send and matching rule on receive Many overheads: copying, buffer management, protection 19

Message Passing Architectures Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems UMA system NUMA system Distributed memory system 20

Example: IBM Blue Gene/L Nodes: 2 PowerPC 400s Everything (except DRAM) on one chip 21

Example: IBM SP-2 Made out of essentially complete RS6000 workstation Network interface integrated in I/O bus Bandwidth limited by I/O bus Power 2 CPU IBM SP-2 node L 2 $ Memory bus General interconnection network formed from 8-port switches Memory controller 4-way interleaved DRAM MicroChannel bus NIC I/O i860 DMA NI DRAM 22

Taxonomy of Common Systems Large-scale shared address space and message passing systems Large multiprocessors Shared address space Distributed address space aka message passing Symmetric shared memory (SMP) Ex) IBM eserver, SUN Sunfire Distributed shared memory (DSM) Cache coherent (ccnuma) Commodity clusters Ex) Beowulf, Custom clusters Uniform cluster Ex) SGI Origin/Altix Ex) IBM Blue Gene Non-cache coherent Ex) Cray T3E, X1 Constellation cluster of DSMs or SMPs Ex) SGI Altix, ASC Purple 23

Parallel Programming Models Shared Address Space Message Passing Data Parallel 24

Data Parallel Systems Programming model Operations performed in parallel on each element of data structure Logically single thread of control Alternate sequential steps and parallel steps Architectural model Array of many simple, cheap processors with little memory each Attached to a control processor that issues instructions Cheap global synchronization Centralize high cost of instruction fetch & sequencing Perfect fit for differential equation solver 25

Evolution and Convergence Architecture converge to SAS/DAS architecture Rigid control structure is minus for general purpose Simple, regular app s have good locality, can do well anyway Loss of applicability due to hardwired data parallelism Programming model converges with SPMD Single Program Multiple Data (SPMD) Contributes need for fast global synchronization Can be implemented on either shared address space or message passing systems Same program on different PEs, behavior conditional on thread ID 26