Learning Curve for Parallel Applications. 500 Fastest Computers

Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor Cray T3D 11 500 Fastest Computers Number of systems 350 313 300 250 200 187 150 100 50 239 198 63 284 V S 319 110 106 106 73 0 11/93 11/94 11/95 11/96 33 1

Shared Address Space odel rocess: virtual address space plus one or more threads of control ortions of address spaces of processes are shared Virtual address spaces for a collection of processes communicating via shared addresses achine physical address space n pr i vat e Load n 1 2 Common physical addresses 0 St or e Shared portion of address space 2 pr i vat e rivate portion of address space 1 pr i vat e 0 pr i vat e Writes to shared address visible to other threads (in other processes too) Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization OS uses shared memory to coordinate processes 44 Communication Hardware Also natural extension of uniprocessor Already have processor, one or more memory modules and controllers connected by hardware interconnect of some sort devices em em em em ctrl ctrl Interconnect Interconnect rocessor rocessor emory capacity increased by adding modules, by controllers Add processors for processing! For higher-throughput multiprogramming, or parallel programs 45 2

History ainframe approach otivated by multiprogramming Extends crossbar used for mem bw and Originally processor cost limited to small later, cost of crossbar Bandwidth scales with p High incremental cost; use multistage instead C C inicomputer approach Almost all microprocessor systems have bus otivated by multiprogramming, T Used heavily for parallel computing Called symmetric multiprocessor (S) Latency larger than for uniprocessor Bus is bandwidth bottleneck caching is key: coherence problem Low incremental cost C C 46 Example: Intel entium ro Quad CU Interrupt 256-KB controller L 2 -ro module -ro module -ro module Bus interface -ro bus (64-bit data, 36-bit addr ess, 66 Hz) CI bridge CI bridge emory controller CI cards CI bus CI bus IU 1-, 2-, or 4-way interleaved DRA All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth 47 3

Example: SUN Enterprise CU/mem cards 2 2 em ctrl Bus interface/switch Gigaplane bus (256 data, 41 address, 83 Hz) Bus interface cards 100bT, SCSI SBUS SBUS SBUS 2 FiberChannel 16 cards of either type: processors + memory, or All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus 48 Scaling Up Network Network Dance hall Distributed memory roblem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NUA) Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) Caching shared (particularly nonlocal) data? 49 4

Example: Cray T3E External em em ctrl and NI XY Switch Z Scale up to 1024 processors, 480B/s links emory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this) 50 essage assing Architectures Complete computer as building block, including Communication via explicit operations rogramming model: directly access only private address space (local memory), comm. via explicit messages (send/receive) High-level block diagram similar to distributed-memory SAS But comm. integrated at IO level, needn t be into memory system Like networks of workstations (clusters), but tighter integration Easier to build than scalable SAS rogramming model more removed from basic hardware operations Library or OS intervention 51 5

essage-assing Abstraction atch Receive Y,, t Send X, Q, t AddressY Address X Local process address space Local process address space rocess rocess Q Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into emory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Other variants too any overheads: copying, buffer management, protection 52 Evolution of essage-assing achines 101 100 001 000 111 110 Early machines: FIFO on each link Hw close to prog. odel; synchronous ops Replaced by DA, enabling non-blocking ops Buffered by system at destination until recv Diminishing role of topology Store&forward routing: topology important Introduction of pipelined routing made it less so Cost is in node-network interface Simplifies programming 011 010 53 6

Example: IB S-2 ower 2 CU IB S-2 node L 2 emory bus General inter connection network formed fr om 8-port switches emory controller 4-way interleaved DRA icrochannel bus NIC i860 DA NI DRA ade out of essentially complete RS6000 workstations Network interface integrated in bus (bw limited by bus) 54 Example Intel aragon i860 L 1 i860 L 1 Intel aragon node emory bus (64-bit, 50 Hz) em ctrl DA Sandia s Intel aragon X/S-based Supercomputer 4-way interleaved DRA Driver NI 2D grid network with processing node attached to every switch 8 bits, 175 Hz, bidirectional 55 7

Toward Architectural Convergence Evolution and role of software have blurred boundary Send/recv supported on SAS machines via buffers Can construct global address space on using hashing age-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for (low-latency, high-bandwidth) At lower level, even hardware SAS passes hardware messages Even clusters of workstations/ss are parallel systems Emergence of fast system area networks (SAN) rogramming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines 56 Convergence: Generic arallel Architecture A generic modern multiprocessor Network em Communication assist (CA) Node: processor(s), memory system, plus communication assist Network interface and communication controller Scalable network Convergence allows lots of innovation, now within framework Integration of assist with node, what operations, how efficiently... 64 8