Parallel Architectures

Part 1: The rise of parallel machines

Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores )

Lab Cluster

Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36 cores ) + multi socket boards

SUN UltraSPARC T3 16 CPU cores 8 hardware thread per core (128 cores )

IBM Power 8

GPUs 2,000+ cores on one chip

NVIDIA TITAN Z

Top500.org

Part 2: Taxonomies for Parallel Architectures

Taxonomies for Parallel Architectures Floyd s Taxonomy - program control and memory access Taxonomy Based on Memory Organization Taxonomy Based on Processor Granularity Taxonomy Based on Processor Synchronization Taxonomy Based on Interconnection Architecture

Floyd s Taxonomy Computer architectures: SISD MISD SIMD MIMD Based on method of program control and memory access

SISD Computers Standard sequential computer. A single processing unit receives a single stream of instructions that operate on a single stream of data.

MISD Computers p processors, each with its own control unit, share a common memory.

SIMD Computers All p identical processors operate under the control of a single instruction stream issued by a central control unit. There are p data streams, one per processor so different data can be used in each processor.

MIMD Computers p processors p streams of instructions p streams of data

Taxonomy Based on Memory Organization Distributed memory Shared memory UMA NUMA

Distributed Memory Each processor has its own memory Communication is usually performed by message passing Each processor can access its own memory, directly memory of another processor, via message passing Interconnect

Shared Memory provides hardware support for read/write to a shared memory space has a single address space shared by all processors I/O devices Mem Mem Mem Interconnect Processor Mem I/O ctrl Interconnect Processor I/O ctrl

Scaling Up Problem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NUMA) Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) Caching shared (particularly nonlocal) data?

Taxonomy Based on Processor Granularity Coarse Grained: Few powerful processors Fined Grained: Many small processors (massively parallel) Medium Grained: between the two...

Taxonomy Based on Processor Synchronization Asynchronous: Processors run on independent clocks. User has synchronize via message passing or shared variable. Fully Synchronous: Processors run in sync on one global clock. Bulk-synchronous: Hybrid. Processors have independent clocks. Support is provided for global synchronization to be called by the user s application program.

Taxonomy Based on Interconnection Architectures Static Point to point connections Dynamic Network with switches Crossbars Buses Interconnect Network

Static Interconnection Topologies Linear Array Ring Diameter (Max distance between processors) Bisection Width (Min cuts to break into equal halves) Cost (number of links)

Static Interconnection Topologies Mesh Torus Diameter? Bisection Width? Cost?

Static Interconnection Topologies Tree Diameter? Bisection Width? Cost?

Static Interconnection Topologies Complete Network Diameter? Bisection Width? Cost?

Static Interconnection Topologies d-dim Hypercube 2d processors d=4 d=0 d=1 d=2 d=5 d=3 Diameter? Bisection Width? Cost?

Static Interconnection Topologies Fat Tree Diameter? Bisection Width? Cost?

Switch based interconnection network

Summary

Taxanomy of parallel machines Fine grained Coarse grained Distributed memory Shared memory coarse grained clusters massively parallel clusters GPU multi-core MIMD SIMD Massively parallel cluster (MIMD, distributed memory, fine grained) Coarse grained cluster (MIMD, distributed memory, coarse grained) Multi-core processor (MIMD, shared memory, coarse grained) GPU (SIMD, shared memory, fine grained)