Parallel Computing Convergence of Parallel Architecture Hwansoo Han
History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty of direction paralyzed parallel software development Application Software Systolic Arrays Dataflow System Software Architecture SIMD Shared Memory Message Passing 2
Today Extension of computer architecture OLD: Instruction Set Architecture NEW: Communication Architecture Communication architecture Organizational structures which implement interfaces Can be implemented with HW or SW Compilers, libraries and OS are important bridges to communication architecture 3
Modern Layered Framework CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel Pr ogramming models Compilation or library Communication abstraction User/system boundary Operating systems support Communication har dwar e Har dwar e/softwar e boundary Physical communication medium 4
Programming Model What programmer uses in coding applications Specifies communication and synchronization Instructions, APIs, defined data structure Programming model examples Shared address space Load/store instructions to access the data for communication Message passing Special system library, APIs for data transmission Data parallel Well-structured data, same operation to multiple data in parallel Implemented with shared address space or message passing 5
Shared Address Space Architecture Shared address space Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Location transparency (flat address space) Similar programming model to time-sharing on uniprocessors Except processes run on different processors Good throughput on multiprogrammed workloads Popularly known as shared memory machine/model Memory may be physically distributed among processors 6
Shared Address Space Architecture Multi-Processing One or more thread on a virtual address space Portion of address spaces of processes are shared Writes to shared address visible to other threads/processes Natural extension of uniprocessor model Conventional memory operations for communication Special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space 7
x86 Examples Shared Address Space Quad core processors Highly integrated, commodity systems Multiple cores on a chip low-latency, high bandwidth communication via shared cache Core Core Core Core Core Core Shared L2 Cache Core Core Shared L3 Cache Intel i7 (Nehalem) AMD Phenom II (Barcelona) 8
P C I b u s P C I b u s Earlier x86 Example Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module high latency and low bandwidth CPU Interrupt contr oller 256-KB L 2 $ P-Pr o module P-Pr o module P-Pr o module Bus interface P-Pr o bus (64-bit data, 36-bit addr ess, 66 MHz) PCI bridge PCI bridge Memory contr oller PCI I/O car ds MIU 1-, 2-, or 4-way interleaved DRAM 9
Example: Sun SPARC Enterprise M9000 64 SPARC64 VII+ quad-core processors (256 cores) Crossbar bandwidth: 245GB/sec (snoop bandwidth) Memory latency: 437~532 ns (1050~1277 cycles @ 2.4GHz) Higher bandwidth but higher latency 10
Scaling Up M M M Network Network $ $ $ M M $ $ M $ P P P P P P Dance Hall (UMA) Distributed Memory (NUMA) Problem is interconnect - cost (crossbar) or bandwidth (bus) Share memory (uniform memory access, UMA) Latencies to memory uniform, but uniformly large Distributed memory (non-uniform memory access, NUMA) Construct shared address space out of simple message transactions across a general-purpose network Cache: keeps shared data (local, and non-local data in NUMA) 11
Example: SGI Altix UV 1000 Scale up to 262,144 cores 16TB shared memory 15 GB/sec links Multistate interconnection network Hardware cache coherence ccnuma 12
Parallel Programming Models Shared Address Space Message Passing Data Parallel Dataflow Systolic Arrays 13
Message Passing Architectures Message passing architectures Complete computer as building block Communication via explicit I/O operations Programming model Directly access only private address space (local memory) Communicate via explicit messages (send/receive) High-level block diagram similar to distributed-memory SAS But communication integrated to I/O level, not memory-level Easier to build than scalable SAS 14
Message Passing Abstraction Match Receive Y, P, t Send X, Q, t Addr ess Y Addr ess X Local pr ocess addr ess space Local pr ocess addr ess space Pr ocess P Pr ocess Q Message passing Send specifies buffer to be transmitted and receiving process Recv specifies sending process and buffer to receive Can be memory to memory copy, but need to name processes Optional tag on send and matching rule on receive Many overheads: copying, buffer management, protection 15
Example: IBM Blue Gene/L Nodes: 2 PowerPC 400s Everything (except DRAM) on one chip 16
D R A M Example: IBM SP-2 Made out of essentially complete RS6000 workstation Network interface integrated in I/O bus BW limited by I/O bus Power 2 CPU IBM SP-2 node L 2 $ Memory bus General inter connection network formed fr om 8-port switches Memory contr oller 4-way interleaved DRAM Micr ochannel bus NIC I/O DMA i860 NI 17
Taxonomy of Common Systems Large-Scale SAS and MP systems Shared address space Large multiprocessors Distributed address space aka message passing Symmetric shared memory (SMP) Ex) IBM eserver, SUN Sunfire Distributed shared memory (DSM) Cache coherent (ccnuma) Commodity clusters Ex) Beowulf, Custom clusters Uniform cluster Ex) SGI Origin/Altix Ex) IBM Blue Gene Non-cache coherent Constellation cluster of DSMs or SMPs Ex) Cray T3E, X1 Ex) SGI Altix, ASC Purple 18
Parallel Programming Models Shared Address Space Message Passing Data Parallel Dataflow Systolic Arrays 19
Data Parallel Systems Programming model Operations performed in parallel on each element of data structure Logically single thread of control Alternate sequential steps and parallel steps Architectural model Array of many simple, cheap processors with little memory each Attached to a control processor that issues instructions Cheap global synchronization Centralize high cost of instruction fetch & sequencing Perfect fit for differential equation solver 20
Evolution and Convergence Architecture converge to SAS/DAS architecture Rigid control structure is minus for general purpose Simple, regular app s have good locality, can do well anyway Loss of applicability due to hardwired data parallelism Programming model converges with SPMD Single program multiple data (SPMD) Contributes need for fast global synchronization Can be implemented on either SAS or MP 21
Parallel Programming Models Shared Address Space Message Passing Data Parallel Dataflow Systolic Arrays 22
Dataflow Architectures Dataflow architecture Represent computation as a graph of essential dependences Logical processor at each node, activated by availability of operands Message (tokens) carrying tag of next instruction sent to next processor Network 1 b + - a c d e Dataflow graph a = (b+1) (b-c) d = c e f = a d T oken stor e W aiting Matching Pr ogram stor e Instruction fetch Execute Form token Network f T oken queue Network 23
Systolic Architecture VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern Replace single processor with array of regular processing elements Orchestrate data flow for high throughput with less memory access Systolic array for 1D convolution y ( i ) = w 1 x ( i ) + w 2 x ( i + 1) + w 3 x ( i + 2) + w 4 x ( i + 3) x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 y 3 y 2 y 1 w 4 w 3 w 2 w 1 x in y in x w x out y out x out = x x = x in y out = y in + w x in 24
Generic Parallel Architecture Convergence to a generic parallel multiprocessor Node: processor(s), memory system, communication assist Scalable network Convergence allows lots of innovation Integration of communication Efficient operation across nodes Network Mem Communication assist (CA) $ P 25