Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
Elements of a Parallel Computer Hardware Multiple processors Multiple memories Interconnection network System software Parallel operating system Programming constructs to express/orchestrate concurrency Application software Parallel algorithms Goal: utilize the hardware, system and application software to Achieve speedup: Tp = Ts/p Solve problems requiring a large amount of memory SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 2
Parallel Computing Platform Logical organization The user s view of the machine as it is being presented via its system software Physical organization The actual hardware architecture Physical architecture is to a large extent independent of the logical architecture Ex) message passing on shared memory architecture, distributed shared memory system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 3
Logical Organization Elements Control mechanism Flynn s taxonomy Single-core processor SISD Single Instruction stream Single Data stream not covered MISD Multiple Instruction stream Single Data stream SIMD Single Instruction stream Multiple Data stream MIMD Multi-core processor Multiple Instruction stream Multiple Data stream SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 4
SIMD vs. MIMD SIMD architecture MIMD architecture SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 5
SIMD Exploit data parallelism The same instruction on multiple data items 16-byte boundaries for (i=0; i<n; i++) a[i] = b[i] + c[i]; b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11... c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11... SIMD unit vr1 vr2 b0 b1 b2 b3 c0 c1 c2 c3 b0+ c0 b1+ c1 b2+ c2 b3+ c3 vr3 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11... SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 6
SIMD Exploit data parallelism The same instruction on multiple data items SIMD units in processors Supercomputers: BlueGene/Q PC: MMX/SSE/AVX (x86), AltiVec/VMX (PowerPC), Embedded systems: Neon (ARM), VLIW+SIMD DSPs Co-processors: GPGPUs SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 7
MIMD Multiple instructions on multiple data items A collection of independent processing elements (or cores) Usually exploits thread-level parallelism Modern parallel computing platforms E.g., multicore processors SIMD can also work on this system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 8
Programming Model What programmer uses in coding applications Specifies communication and synchronization Instructions, APIs, defined data structure Programming model examples Shared address space Load/store instructions to access the data for communication Message passing Special system library, APIs for data transmission Data parallel Well-structured data, same operation to multiple data in parallel Implemented with shared address space or message passing SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 9
Shared Address Space Architecture Shared address space Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Location transparency (flat address space) Similar programming model to time-sharing on uniprocessors Except processes run on different processors Good throughput on multi-programmed workloads Popularly known as shared memory machine/model Memory may be physically distributed among processors SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 10
Shared Address Space Architecture Multi-Processing One or more thread on a virtual address space Portion of address spaces of processes are shared Writes to shared address visible to other threads/processes Natural extension of uniprocessor model Conventional memory operations for communication Special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 11
x86 Examples Shared Address Space Quad core processors Highly integrated, commodity systems Multiple cores on a chip low-latency, high bandwidth communication via shared cache Core Core Core Core Core Core Shared L2 Cache Core Core Shared L3 Cache Intel i7 (Nehalem) AMD Phenom II (Barcelona) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 12
Earlier x86 Example Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module High latency and low bandwidth CPU Interrupt controller 256-KB L 2 $ P-Pro module P-Pro module P-Pro module Bus interface P-Pro bus (64-bit data, 36-bit address, 66 MHz) PCI bridge PCI bridge Memory controller PCI I/O cards PCI bus PCI bus MIU 1-, 2-, or 4-way interleaved DRAM SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 13
Shared Address Space Architecture Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems Hardware- or software-based distributed shared memory (DSM) UMA system NUMA system Distributed memory system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 14
Scaling Up M M M Network Network $ $ $ M $ M $ M $ P P P P P P Dance Hall (UMA) Distributed Memory (NUMA) Problem is interconnect - cost (crossbar) or bandwidth (bus) Share memory (uniform memory access, UMA) Latencies to memory uniform, but uniformly large Distributed memory (non-uniform memory access, NUMA) Construct shared address space out of simple message transactions across a general-purpose network Cache: keeps shared data (local, and non-local data in NUMA) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 15
Example: SGI Altix UV 1000 Scale up to 262,144 cores 16TB shared memory 15 GB/sec links Multistate interconnection network Hardware cache coherence ccnuma SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 16
Parallel Programming Models Shared Address Space Message Passing Data Parallel SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 17
Message Passing Architectures Message passing architectures Complete computer as building block Communication via explicit I/O operations Programming model Directly access only private address space (local memory) Communicate via explicit messages (send/receive) High-level block diagram similar to distributedmemory shared address space system But communication integrated to I/O level, not memory-level Easier to build than scalable SAS SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 18
Message Passing Abstraction Match Receive Y, P, t Send X, Q, t Address Y Address X Local pr ocess address space Local pr ocess address space Pr ocess P Message passing Send specifies buffer to be transmitted and receiving process Recv specifies sending process and buffer to receive Process Q Can be memory to memory copy, but need to name processes Optional tag on send and matching rule on receive Many overheads: copying, buffer management, protection SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 19
Message Passing Architectures Physical organization Shared memory system Uniform memory access (UMA) Non-uniform memory access (NUMA) Distributed memory system Cluster of shared memory systems UMA system NUMA system Distributed memory system SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 20
Example: IBM Blue Gene/L Nodes: 2 PowerPC 400s Everything (except DRAM) on one chip SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 21
Example: IBM SP-2 Made out of essentially complete RS6000 workstation Network interface integrated in I/O bus Bandwidth limited by I/O bus Power 2 CPU IBM SP-2 node L 2 $ Memory bus Memory controller 4-way interleaved DRAM MicroChannel bus NIC I/O i860 DMA NI DRAM SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 22
Taxonomy of Common Systems Large-scale shared address space and message passing systems Large multiprocessors Shared address space Distributed address space aka message passing Symmetric shared memory (SMP) Ex) IBM eserver, SUN Sunfire Distributed shared memory (DSM) Cache coherent (ccnuma) Commodity clusters Ex) Beowulf, Custom clusters Uniform cluster Ex) SGI Origin/Altix Ex) IBM Blue Gene Non-cache coherent Constellation cluster of DSMs or SMPs Ex) Cray T3E, X1 Ex) SGI Altix, ASC Purple SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 23
Parallel Programming Models Shared Address Space Message Passing Data Parallel SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 24
Data Parallel Systems Programming model Operations performed in parallel on each element of data structure Logically single thread of control Alternate sequential steps and parallel steps Architectural model Array of many simple, cheap processors with little memory each Attached to a control processor that issues instructions Cheap global synchronization Centralize high cost of instruction fetch & sequencing Perfect fit for differential equation solver SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 25
Evolution and Convergence Architecture converge to SAS/DAS architecture Rigid control structure is minus for general purpose Simple, regular app s have good locality, can do well anyway Loss of applicability due to hardwired data parallelism Programming model converges with SPMD Single Program Multiple Data (SPMD) Contributes need for fast global synchronization Can be implemented on either shared address space or message passing systems Same program on different PEs, behavior conditional on thread ID SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 26