CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann Introduction to arallel Computing: Design and Analysis of Algorithms Kumar, Grama, Gupta, Karypis, Benjamin Cummings 2
erformance goals 3 Microprocessor performance 4 2
What is a arallel Computer? Almasi-Gotllib 989: A parallel computer is a "collection of processing elements that communicate and cooperate to solve large problems fast". Why parallel architecture? Add new dimension to design space -- number of processors. In principle, achieve higher performance by using more processors. How much additional performance is gained and at what additional cost depends on several factors. 5 Questions How large is the collection? How powerful are the individual processing elements (pe)? Can the number be increased in a straightforward manner? How do they communicate and cooperate? How is data transmitted between pe's? What interconnection topology? 6 3
Taxonomy of arallel Architectures I. By control mechanism - instruction stream and data stream II. III. IV. By process granularity - coarse vs fine grain By address space organization - shared vs distributed memory By interconnection network - dynamic vs static 7 (I) Control Mechanism (Flynn s taxonomy) SISD: Single Instruction stream Single Data stream, e.g. conventional sequential computers. SIMD: Single Instruction stream Multiple Data stream MIMD: Multiple Instruction stream Multiple Data stream MISD: Multiple Instruction stream Single Data stream 8 4
SIMD Multiple processing elements are under the supervision of a control unit Thinking Machine CM-2, Masar M-2, Quadrics SIMD extensions are also present in commercial microprocessors (MMX or Katmai in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4) 9 MIMD Each processing elements is capable of executing a different program independent of the other processors Most multiprocessor can be classified in this category) 0 5
(II) rocess Granularity Coarse grain: Cray C90, Fujitsu small number of very powerful processors Fine grain: CM-2, Quadrics large number of relatively less powerful processors Medium grain: IBM S2, CM-5 between the two extremes. Commuication cost >> computational cost coarse grain Commuication cost << computational cost fine grain (III) Address Space Organization Single/shared address space Uniform Memory Address:SM (UMA) Non Uniform Memory Address (NUMA) Message passing Distributed memory 2 6
Shared Memory SIMD Vector processors Some Cray machines 3 SM Architecture Bus or Crossbar Switch Memory I/O SM uses shared system resources (memory, I/O) that can be accessed equally from all the processors coherence is maintained Expensive to build with many procs. Compaq GS AlphaServers. 4 7
NUMA Architecture Memory Memory Memory Memory Bus or Crossbar Switch Shared address space Memory latency varies whether access local or remote memory coherence (ccnuma) is maintained using hardware or software protocol Can afford more procs. than SM. SGI Origin 2000/3000, Sun Ultra HC servers. 5 Message-assing SIMD Cambridge parallel processing Gamma II, Quadrics 6 8
Message-assing MIMD Memory Memory Memory Memory Communication network Local address space No issue of cache coherence IBM S 7 Dynamic (IV) Interconnection Networks Switches and communication links. Communication links are connected to one another dynamically by switches. Static oint-to-point communication links. Message-passing computers. 8 9
Dynamic Interconnections Crossbar switching : Most expensive and extensive interconnection. 2 M M2 Bus connected : rocessors are connected to memory through a common datapath Multistage interconnection: Butterfly,Omega network, perfect shuffle, etc Butterfly 9 Static Interconnection Completely-connected Star-connected Linear array Mesh: 2D/3D mesh, 2D/3D torus Tree and fat tree network Hypercube network 20 0
Characteristics of Static Networks Diameter: maximum distance between any two processors in the network D= complete connection D=N- linear array D=N/2 ring D=2( N -) 2D mesh D=2 ( (N/2)) 2D torus D=log N hypercube 2 Characteristics of Static Networks (cont.) Bisection width: the minimum number of communications links that have to be removed to partition the network in half. Channel rate: peak rate at which a single wire can deliver bits. Channel bandwidth: it is the product of channel rate and channel width. Bisection bandwidth B: it is the product of bisection width and channel bandwidth. 22
Linear Array, Ring, Mesh, Torus rocessors are arranged as a d-dimensional grid or torus 23 Tree, Fat-tree Tree network: there is only one path between any pair of processors. Fat tree network: increase the number of communication links close to the root. 24 2
Hypercube -D 2-D 3-D 25 Binary Reflected GRAY Code G(i,d) denotes the i-th entry in a sequence of Gray codes of d bits. G(i,d+) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with and the original entry with 0. 26 3
Example of BRG Code -bit 2-bit 3-bit 8p ring 8p hyper 0 0 0 0 0 0 0 0 0 0 0 0 2 3 0 0 3 2 0 0 0 0 0 4 5 6 7 6 7 5 4 27 Topology Embedding Mapping a linear array into an hypercube: A linear array (or ring) of 2 d processors can be embedded into a d-dimensional hypercube by mapping processor i onto processor G(i,d) of the hypercube. Mapping a 2 r 2 s mesh on an hypercube: processor(i,j)---> G(i,r) G(j,s) ( denote concatenation). 28 4
Trade-off Among Different Networks Network Minimum latency Maximum Bw per roc Wires Switches Example Completely connected Constant Constant O(p*p) - - Crossbar Constant Constant O(p) O(p*p) Cray Bus Constant O(/p) O(p) O(p) SGI Challenge Mesh O(sqrt p) Constant O(p) - Intel ASCI Red Hypercube O(log p) Constant O(p log p) - Sgi Origin Switched O(log p) Constant O(p log p) O(p log p) IBM S-2 29 Beowulf Cluster built with commodity hardware components C hardware (x86,alpha,owerc) Commercial high-speed interconnection (00Base-T, Gigabit Ethernet, Myrinet,SCI) Linux, Free-BSD operating system http://www.beowulf.org 30 5
Clusters of SM The next generation of supercomputers will have thousand of SM nodes connected. Increase the computational power of the single node Keep the number of nodes low New programming approach needed, MI+Threads (OpenM,threads,.) ASCI White, CompaqSC, IBM S3. http://www.llnl.gov/asci 3 6