Shared-Memory Hardware

Size: px

Start display at page:

Download "Shared-Memory Hardware"

Juliet Bishop
5 years ago
Views:

1 Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube

2 Shared-Memory Hardware Hardware architecture: Processor(s), memory system(s), data path(s) Each component may become the performance bottleneck Each component can be replicated Each parallelization target must be handled separately Modern processor Multiple instructions in the same cycle Multiple concurrent instruction streams per functional unit Multiple functional units (cores) Combination of multiple processors with one shared memory Logical hardware setup as seen by the software, physical hardware organization typically differs

3 Shared-Memory Hardware 3 Fetch instruction, update program counter (PC) Decode instruction Execute instruction Write Back result

access, computation, and / or I/O I/O devices and memory controller may

4 Shared-Memory Hardware 4 [Stallings] Central Processing Units (CPUs) + volatile memory + I/O devices Fetch instruction and execute it - typically memory access, computation, and / or I/O I/O devices and memory controller may interrupt the instruction processing Improve processor utilization by asynchronous operations

5 RISC vs. CISC RISC - Reduced Instruction Set Computer MIPS, ARM, DEC Alpha, Sparc, IBM 801, Power, etc. Small number of instructions Few data types in hardware Instruction size constant, few addressing modes Relies on optimization in software CISC - Complex Instruction Set Computer VAX, Intel X86, IBM 360/370, etc. Large number of complex instructions, may take multiple cycles Variable length instructions Smaller code size Focus on optimization in hardware RISC designs lend themself to exploitation of instruction level parallelism

Shared-Memory Hardware 6 Major constraints of memory are amount, speed, and costs Faster access time results in greater costs per bit Greater capacity results in smaller costs per bit Greater

6 Shared-Memory Hardware 6 Major constraints of memory are amount, speed, and costs Faster access time results in greater costs per bit Greater capacity results in smaller costs per bit Greater capacity results in slower access Going down the memory hierarchy Decreasing costs per bit Increasing capacity for fixed costs Increasing access time I/O devices provide non-volatile memory on lower levels, which is an additional advantage

7 Shared-Memory Hardware 7 Principle of Locality Memory referenced by a processor (program and data) tends to cluster (e.g. loops, subroutines) Operations on tables and arrays involve access to clustered data sets Temporal locality: If a memory location is referenced, it will tend to be referenced again soon Spatial locality: If a memory location is referenced, locations whose addresses are close by will tend to be referenced soon Data should be organized so that the percentage of accesses to lower levels is substantially less than to the level above Typically implemented by caching concept [Stallings]

Leverages the principle of locality Processor caches work

8 Shared-Memory Hardware 8 Caching Offer a portion of lower level memory as copy in the faster smaller memory Leverages the principle of locality Processor caches work in hardware, but must be considered by an operating system [Stallings]

Shared-Memory Hardware 9 Conflicting caching design goals Cache size per level Number of cache levels Block size exchanged with lower level memory Replacement algorithm Mapping function Write policy

9 Shared-Memory Hardware 9 Conflicting caching design goals Cache size per level Number of cache levels Block size exchanged with lower level memory Replacement algorithm Mapping function Write policy for modified cache lines All decisions made by hardware vendor, considerable by software Cache-optimized software needed when parallelization improvements start to depend on memory bottlenecks [Stallings]

10 Parallel Processing 10 Inside the processor Instruction-level parallelism (ILP) Multicore Shared memory With multiple processing elements in one machine Multiprocessing Shared memory With multiple processing elements in many machines Multicomputer Shared nothing (in terms of a globally accessible memory)

11 Instruction-Level Parallelism 11 Hardware optimizes sequential instruction stream execution Pipelining architecture Sub-steps of sequential instructions are overlapped in their execution to increase throughput Traditional concept in processor hardware design Relies on mechanisms such as branch prediction or the out-of-order execution of instructions Superscalar architecture Execution of multiple instructions in parallel, based on redundant functional units of the processor Very Long Instruction Word (VLIW) Explicitly parallel instruction computing (EPIC) SIMD vectorization support with special instructions

12 12 Pipelining

13 Pipelining 13 Pipelining overlaps various stages of instruction execution Fetch, decode and execute happen in parallel Increases instruction throughput with the same clock speed Analogue to assembly line concept Pipelining hazard Temporal dependencies between sub-steps Influence on speedup Structural hazard: Multiple instructions access the same resource Data hazard: Instruction needs the result of the previous instruction Control hazard: Instruction result changes the control flow (interrupt, branch)

14 14 Multi-Cycle Pipelining

15 Pipelining Conflicts 15 Conflict solution strategies Multiply commonly needed hardware units Inclusion of NOPs for changing the timing Reorder the instructions for changing the timing Write-after-read effect: Instruction uses a register value that is overwritten by a subsequent instruction Write-after-write effect: Two subsequent instructions write into the same register Read-After-write effect: Reliance on previous result Stall the pipeline until the conflict is solved Control conflicts are targeted by branch prediction Static branch prediction (forward never - backward ever) vs. dynamic branch prediction (based on previous jumps) Issues typically targeted by compiler and processor hardware

16 16 Superscalar Architectures

17 Superscalar Architectures 17 ILLIAC IV (1974) Good for problems with high degree of regularity, such as graphics/image processing Typically exploit data parallelism Today: GPGPU Computing, Cell processor, SSE, AltiVec Cray Y-MP Thinking Machines CM-2 (1985) Fermi GPU

Superscalar Architectures 18 Vector instructions for high-level operations on data sets Became famous with Cray architecture in the 70 s Today, vector instructions are part of the standard

18 Superscalar Architectures 18 Vector instructions for high-level operations on data sets Became famous with Cray architecture in the 70 s Today, vector instructions are part of the standard instruction set AltiVec Streaming SIMD Extensions (SSE) Example: Vector addition vec_res.x = v1.x + v2.x;! vec_res.y = v1.y + v2.y;! vec_res.z = v1.z + v2.z;! vec_res.w = v1.w + v2.w;! movaps xmm0,address-of-v1! (xmm0=v1.w v1.z v1.y v1.x)!! addps xmm0,address-of-v2! (xmm0=v1.w+v2.w v1.z+v2.z v1.y+v2.y v1.x+v2.x)!! movaps address-of-vec_res,xmm0!

19 Streaming SIMD Extensions (SSE) 19 Introduced by Intel with the Pentium III (1999) Specifically designed for floating point and vector operations New 128 Bit registers can be packed with four 32 bit scalars Operation is performed simultaneously on all of them Typical operations Move data between SSE registers and 32b registers / memory Add, subtract, multiply, divide, square root, maximum, minimum, reciprocal, compare, bitwise AND / OR / XOR Available as compiler intrinsic Function known by the compiler that maps to assembler Better performance than with linked library

20 Other Instruction Set Extensions 20 Fused Multiply-Add instructions (FMA) Supported in different variations by all processors Floating point multiply-add operation performed in one step Improves speed and accuracy of product accumulation Scalar product Matrix multiplication Efficient software implementation of square root and division Intel Advanced Vector Extensions (AVX) Extension of SSE instruction set Introduced with Sandy Bridge architecture (2011) Registers are now 256 Bit wide 512 bit support announced for 2015 version of Xeon Phi

21 Very Long Instruction Word (VLIW) 21 Very Long Instruction Word (VLIW), Fisher et al., 1980 s Compiler identifies instructions to be executed in parallel One VLIW instruction encodes several operations (at least one for each redundant execution unit) Less hardware complexity, higher compiler complexity VLIW processors typically designed with multiple RISC execution units Very popular in the embedded market and in GPU hardware Explicitely Parallel Instruction Computing (EPIC) Coined by HP-Intel alliance since 1997 Foundational concept for the Intel Itanium architecture Extended version of VLIW concept Turned out to be extremely difficult for compilers

22 EPIC bit register-rich explicitly-parallel architecture Implements predication, speculation, and branch prediction Hardware register renaming for parameter passing Parallel execution of loops Speculation, prediction, and renaming controlled by compiler Each 128-bit instruction word contains three instructions Stop-bits control parallel execution Processor can execute six instructions per clock cycle Thirty execution units for subsets of instruction set in eleven groups Each unit executes at a rate of one instruction per cycle (unless stall) Common instructions can be executed in multiple units

23 23 Itanium 30 Functional Units

24 Simultaneous Multi-Threading (SMT) 24 Reasons for bad performance in superscalar architectures depend on application Dynamically schedule the functional unit usage Support multiple instruction streams in one pipeline [Tullsen et al., 1995]

25 Hyperthreading 25 Intel s implementation of simultaneous multi-threading (SMT) Allows an execution core to function as two logical processors Main goal is to reduce the number of related instructions being in the pipeline at the same time Works nicely on cache miss, branch misprediction, or data dependencies in one of the threads Most core hardware resources are shared Caches, execution units, buses Each logical processor has an own architectural set Register bank is mirrored Mainly enables very fast thread context switch in pure hardware More than two logical threads per core would saturate the memory connection and pollute the caches

26 Hyperthreading [Intel]

27 Parallel Processing 27 Inside the processor Instruction-level parallelism (ILP) Multicore Shared memory With multiple processing elements in one machine Multiprocessing Shared memory With multiple processing elements in many machines Multicomputer Shared nothing (in terms of a globally accessible memory)

28 Chip Multi-Processing 28 One integrated circuit die (socket) contains multiple computational engines (core) Called many-core or multi-core architecture Cores share some / all cache levels and memory connection All other parts are dedicated per core (pipeline, registers,...) Increase in core count leads to resource contention problem with caches and memory Beside Intel / AMD, also available with ARM, MIPS, PPC Multi-Core vs. SMP SMP demands more replicated hardware (fans, bus,...) SMP is a choice, multi-core is given by default Cores typically have lower clock frequency Multi-Core and SMP programming problems are very similar Recent trends towards heterogeneous cores

29 Many-Core / Multi-Core Intel Core i7 SPARC64 VIIIfx

Parallel Processing 30 Inside the processor Instruction-level parallelism (ILP) Multicore Shared memory With multiple processing elements in one machine

30 Parallel Processing 30 Inside the processor Instruction-level parallelism (ILP) Multicore Shared memory With multiple processing elements in one machine Multiprocessing Shared memory With multiple processing elements in many machines Multicomputer Shared nothing (in terms of a globally accessible memory)

31 Multiprocessor: Flynn s Taxonomy (1966) 31 Classify multiprocessor architectures among instruction and data processing dimension Multiple Instruction, Single Data (MISD) Single Instruction, Single Data (SISD) Multiple Instruction, Multiple Data (MIMD) Single Instruction, Multiple Data (SIMD) (C) Blaise Barney

32 Multiprocessor Systems 32 Symmetric Multiprocessing (SMP) Set of equal processors in one system (more SM-MIMD than SIMD) Traditionally memory bus, today on-chip network Demands synchronization and operating system support Asymmetric multiprocessing (ASMP) Specialized processors for I/O, interrupt handling or operating system (DEC VAX 11, OS-360, IBM Cell processor) Typically master processor with main memory access and slaves

Symmetric Multi-Processing Two or more processors in one system, can perform the same operations (symmetric) Processors share the same main memory and all devices Increased performance and

33 Symmetric Multi-Processing Two or more processors in one system, can perform the same operations (symmetric) Processors share the same main memory and all devices Increased performance and scalability for multi-tasking No master, any processor can cause another to reschedule Challenges for an SMP operating system: Reentrant kernel, scheduling policies, synchronization, memory re-use,... [Stallings]

34 Shared Memory 34 All processors act independently and use the same global address space, changes in one memory location are visible for all others Uniform memory access (UMA) system Equal load and store access for all processors to all memory Default approach for SMP systems of the past Non-uniform memory access (NUMA) system Groups of physical processors (called nodes ) that have local memory, connected by some interconnect Still an SMP system (e.g. any processor can access all of memory), but node-local memory is faster OS tries to schedule close activities on the same node Became the default model in shared memory architectures Cache-coherent NUMA (CC-NUMA) in hardware

35 UMA Example 35 Two dual core chips (2 core/socket) P = Processor core L1D = Level 1 Cache Data (fastest) L2 = Level 2 Cache (fast) Memory = main memory (slow) Chipset = enforces cache coherence and mediates connections to memory

36 NUMA Example 36 Eight cores (4 cores/socket); L3 = Level 3 Cache Memory interface = establishes a coherent link to enable one logical single address space of physically distributed memory

37 L3 Cache Memory Controller Memory Controller L3 Cache Memory Memory Core Core Q P I Q P I Core Core Core Core Core Core Core Core Core Core Core I/O Q P I I/O Memory Q P I Memory Core Memory Controller Core Core Memory Controller I/O L3 Cache 37 L3 Cache NUMA Example: Intel Nehalem I/O

38 CC-NUMA 38 Central crossbar for interaction of cores, memory controller and other processors via QPI Similar approach by other vendors Extended versions of MESI cache coherence protocol being used for L3 management [Schöne et al.] UNCORE

39 CC-NUMA 39 Cache coherency in a multi-core multi-socket system Extended problem of traditional cache coherency problem in multi-socket SMP systems Application of extended MESI cache coherence protocol in QPI Each cache line has one state Modified Written by the local core Exclusive First read by the local core Shared Read by two cores (cache hit) Write attempt in this state lead to cache invalidation New state is modified Invalid Cache line contains no valid data (read miss) Forwarding (new) - Direct L3 exchange of data Can be optimized by snooping into other caches

Hypertransport 40 Specification of I/O interconnect, originally developed by AMD, Alpha and API Networks in 2001 Point-to-point unidirectional links between components At least one host device

40 Hypertransport 40 Specification of I/O interconnect, originally developed by AMD, Alpha and API Networks in 2001 Point-to-point unidirectional links between components At least one host device (typically processor) Bridge functionality to PCI, PCI-X, PCI Express,... Tunnel devices connect a link to other HAT devices Extremely low overhead, suitable for inter-processor communication in SMP hardware [hypertransport.org]

41 Hypertransport 41 [hypertransport.org]

42 Quick Path Interconnect (QPI) 42 Competing technology from Intel, since 2008 Result of a continuous improvement in Intel processor interconnect technology processor processor processor processor processor processor processor processor Up to 4.2GB/s Platform Bandwidth Up to 12.8GB/s Platform Bandwidth snoop filter Memory Interface chipset Memory Interface chipset [intel.com] I/O I/O Traditional Shared Frontside Bus (until 2004) Dual Independent Buses (until 2005)

43 Quick Path Interconnect (QPI) 43 I/O chipset processor processor processor processor Up to 34GB/s Platform Bandwidth Memory Interface processor processor Memory Interface Memory Interface snoop filter chipset Memory Interface processor processor Memory Interface [intel.com] I/O chipset I/O Legend: Bi-directional bus Uni-directional link Dedicated Interconnects (until 2007) Quick Path Interconnect

(I/O-window) Used as 2D / 3D torus Processor A Processor B Processor C

44 Scalable Coherent Interface 44 ANSI / IEEE standard for NUMA interconnect, used in HPC world 64bit global address space, translation by SCI bus adapter (I/O-window) Used as 2D / 3D torus Processor A Processor B Processor C Processor D Cache Cache Cache Cache Memory SCI Cache Memory SCI Cache SCI Bridge SCI Bridge...

45 Theoretical Models for Parallel Hardware 45 Better use simplified parallel machine model than real hardware specification for parallelization optimization Allows theoretical investigation of algorithms Allows generic optimization, regardless of products Should improve algorithm robustness by avoiding optimizations to hardware layout specialties (e.g. network topology) Became popular in the 70 s and 80 s, due to large diversity in parallel hardware design Resulting computational model is independent from the programming model for the implementation Vast body of theoretical research results Typically, formal models adopt to hardware developments

(Parallel) Random Access Machine 46 RAM assumptions: Constant memory access time, unlimited memory PRAM assumptions: Non-conflicting shared bus, no

46 (Parallel) Random Access Machine 46 RAM assumptions: Constant memory access time, unlimited memory PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors Alternative models: BSP, LogP CPU CPU CPU CPU Shared Bus Input Memory Output Input Memory Output

47 PRAM Extensions 47 Rules for memory interaction to classify hardware support of a PRAM algorithm Memory access assumed to be in lockstep (synchronous PRAM) Concurrent Read, Concurrent Write (CRCW) Multiple tasks may read from / write to the same location at the same time, can be simulated with EREW Concurrent Read, Exclusive Write (CREW) One task may write to a given memory location at any time Exclusive Read, Concurrent Write (ERCW) One task may read from a given memory location at any time Exclusive Read, Exclusive Write (EREW) One task may read from / write to a memory location at any time, memory management must know concurrency

48 PRAM Extensions 48 Concurrent write scenario needs further specification by algorithm Ensures that the same value is written Selection of arbitrary value from parallel write attempts Priority of written value derived from processor ID Store result of combining operation (e.g. sum) into memory PRAM algorithm can act as starting point for a real implementation Unlimited resource assumption Allows to map,logical PRAM processors to a restricted number of physical processors Enables the design scalable algorithm based on unlimited memory assumption Focus only on concurrency opportunities, synchronization and communication later

nodes Internal nodes hold the sum, root node as global sum Additions on one level are independent from each other PRAM

49 Example: Parallel Sum 49 General parallel sum operation works with any associative and commutative combining operation Multiplication, maximum, minimum, logical operations, PRAM solution Build binary tree, with input data items as leaf nodes Internal nodes hold the sum, root node as global sum Additions on one level are independent from each other PRAM algorithm One processor per leaf node, in-place summation Computation in O(log2n) int sum=0; for (int i=0; i<n; i++) { sum += A[i]; }

relies on PRAM lockstep assumption (no synchronization) for all l levels (1.

50 Example: Parallel Sum 50 Example: n=8: l=1: Partial sums in X[1], X[3], X[5], [7] l=2: Partial sums in X[3] and X[7] l=3: Parallel sum result in X[7] Correctness relies on PRAM lockstep assumption (no synchronization) for all l levels (1..log2n){ for all i items (0..n-1) { if (((i+1) mod 2^l) = 0) then X[i] := X[i-2^(l-1)]+X[i] } }

51 Bulk-Synchronous Parallel (BSP) Model 51 Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 Success of von Neumann model Bridge between hardware and software High-level languages can be efficiently compiled on this model Hardware designers can optimize the realization of this model Similar model for parallel machines Should be neutral about the number of processors Program should be written for v virtual processors that are mapped to p physical ones When v >> p, the compiler has options BSP computation consists of a series of supersteps: 1.) Concurrent computation on all processors 2.) Exchange of data between all processes 3.) Barrier synchronization

52 Bulk-Synchronous Parallel (BSP) Model 52 Costs of a superstep depend on The costs for the slowest computation The costs for communication between all processes The costs for barrier synchronization Algorithm costs relate to the sum of all superstep costs Synchronization may only happen for some processes Long-running serial tasks are not slowed down from model perspective Recent industrial uptake with Pregel and ML language Apache Hama project implements BSP on top of Hadoop

53 Bulk-Synchronous Parallel (BSP) Model 53 Bulk-synchronous parallel computer (BSPC) is defined by: Components, each performing processing and / or memory functions Router that delivers messages between pairs of components Facilities to synchronize components at regular intervals L (periodicity) Computation consists of a number of supersteps Each L, global check is made if the superstep is completed Router concept splits computation vs. communication aspects, and models memory / storage access explicitly L is controlled by the application, even at run-time

54 LogP [Culler et al., 1993] 54 Criticism on overly simplification in PRAM-based approaches, encourage exploitation of,formal loopholes (e.g. communication) Trend towards multicomputer systems with large local memories Characterization of a parallel machine by: P: Number of processors g (gap): Minimum time between two consecutive transmissions Reciprocal corresponds to per-processor communication bandwidth L (latency): Upper bound on messaging time o (overhead): Exclusive processor time needed for send / receive operation L, o, G in multiples of processor cycles

55 55 LogP Architecture Model

56 LogP 56 Algorithm must produce correct results under all message interleaving, prove space and time demands of processors Simplifications With infrequent communication, bandwidth limits (g) are not relevant With streaming communication, latency (L) may be disregarded Convenient approximation: Increase overhead (o) to be as large as gap (g) Encourages careful scheduling of computation, and overlapping of computation and communication Can be mapped to shared-memory and shared nothing Reading a remote location requires 2L+4o processor cycles

57 LogP 57 Matching the model to real machines Saturation effects: Latency increases as function of the network load, sharp increase at saturation point - captured by capacity constraint Internal network structure is abstracted, so,good vs.,bad communication patterns are not distinguished - can be modeled by multiple g s LogP does not model specialized hardware communication primitives, all mapped to send / receive operations Separate network processors can be explicitly modeled Model defines 4-dimensional parameter space of machines Vendor product line can be identified by a curve in this space

58 58 LogP Optimal Broadcast Tree

59 59 LogP Optimal Summation

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month