COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Size: px

Start display at page:

Download "COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University"

Kelley Price
5 years ago
Views:

1 COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

2 2.1 Lecture Outline Review of Single Processor Design So we talk the same language Many things happen in parallel even on a single processor Identify potential issues for parallel hardware Why use 2 CPUs if you can double the speed on one! Multiple Processor Design Hardware models Shared/Distributed memory Hierarchical/flat memory Dynamic/static processor connectivity Evaluating static networks Routing mechanisms COMP4300/8300 Lecture 2-2 Copyright c 2015 The Australian National University

3 2.2 The Processor Performs: Floating point operations (FLOPS) - add, mult, division (sqrt maybe!) Integer operations (MIPS) - adds etc, also logical ops and instruction processing MIPS: Machine Instructions Per Second on very old VAX 100/780 Anyway, what is a machine instruction on different CPUs! Our primary focus will be in floating point operations Clock: All ops take a fixed number of clock ticks to complete Clock speed is measured in GHz (10 9 cycles/second) or nsec (10 9 seconds) Apple iphone 6 ARM A8 1.4GHz (0.71ns), NCI Raijin Intel Xeon Sandy Bridge 2.6GHz (0.38ns), IBM zec12 processor 5.5Ghz (0.18ns) Clock limited by etching+speed of light, hence motivates parallel (duo systems) (To my knowledge) IBM zec12 is fastest commodity processor at 5.5GHz Light travels about 10cm in.32ns, chip is a few cm! COMP4300/8300 Lecture 2-3 Copyright c 2015 The Australian National University

4 2.3 Processor Performance FLOPS/Sec Prefix Occurrence 10 3 kilo very badly written code 10 6 mega badly written code 10 9 giga single-core tera multiple chip (NCI) peta 23 machines in Top500 (Nov 2012, measured) exa around 2020! PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz 40GF Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz 105GF NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(ops)*2.6GHz 1.19PF COMP4300/8300 Lecture 2-4 Copyright c 2015 The Australian National University

5 2.4 Adding Numbers Consider adding two double precision (8 byte) numbers ± Exponent Mantissa Possible Steps Determine largest exponent Normalize smaller exponent to the larger Add mantissas Renormalize the mantissa and exponent of the result Multiple steps each taking 1 tick implies 4 ticks per addition (FLOP) COMP4300/8300 Lecture 2-5 Copyright c 2015 The Australian National University

6 2.5 Pipeline Operations#1 Step in Pipeline Waiting Done X(6) X(5) X(4) X(3) X(2) X(1) X(1) takes 4 ticks to appear (startup latency) X(2) appears 1 tick after X(1) Asymptotically achieve 1 result per clock tick Operation is said to be pipelined Steps in the pipeline are running in parallel COMP4300/8300 Lecture 2-6 Copyright c 2015 The Australian National University

7 2.6 Pipeline Operations#2 Requires same op consecutively on different (independent) data items good for vector operations note limitations on chaining output data to input Tendency to increase number of stages in pipeline if each stage can run faster More stages in a pipeline the greater the startup latency UltraSPARC II has 9 stage pipeline, UltraSPARC III has a 14 stage pipeline Prescott Pentium 4 processor had a 31 stage pipeline Not all operations are pipelined, eg integer multiplication, division, sqrt Clock cycles for different operations on Alpha EV6 Operation Latency Repeat +,-,* 4 1 / sqrt COMP4300/8300 Lecture 2-7 Copyright c 2015 The Australian National University

8 2.7 Instruction Parallelism Processor issues multiple instructions per clock cycle that are executed in parallel on different parts of the chip hardware Grouping rules: restriction on what can be done in parallel, eg UltraSPARC: 4 from 2*Floating, 2*Integer, 1*load/store, 1*branch Input 1 Input2 Input 3 Multiply unit Addition unit Result Pentium III single FLOP per cycle Opteron, UltraSPARC and Alpha 2 (different) FLOPs per cycle Core2, Itanium2 and IBM Power5 4 (DP) FLOPs per cycle Xeon Sandy Bridge 8 (DP) FLOPs per cycle COMP4300/8300 Lecture 2-8 Copyright c 2015 The Australian National University

9 2.8 Memory Structure Consider DAXPY: Y(i) = X(i)+Y(i) If theoretically the CPU can perform the 2 FLOPs in 1 cycle Memory must deliver (load) two floats (X(i) and Y(i) or 16 bytes) and store one (Y(i) 8bytes) each clock cycle On 1GHz system this implies 1.0*16GB/sec 16GB/sec load traffic and 8GB/sec store Typically A processor core can only issue one load OR store instruction in a clock cycle DDR3-SDRAM memory is available clocked at 1066MHz with access times accordingly Memory Latency and Bandwidth are critical performance issues Caches: reduce latency and provide improved cache to CPU bandwidth Memory banks: improve bandwidth COMP4300/8300 Lecture 2-9 Copyright c 2015 The Australian National University

10 2.9 Cache Main Memory Cache CPU Registers large cheap memory large latency/small bandwidth small fast expensive memory lower latency/higher bandwidth Memory hierarchy or Non-Uniform Memory Access (NUMA) Cache Hit - data in cache and received in a few cycles Cache Miss - data fetched from main memory (or higher level cache) Try to ensure data is in cache (or as close to the CPU as possible) Can we block algorithm to minimize memory traffic Cache is effective because algorithms often use data that are close in memory. (Note duplication of data in cache will have implications for parallel systems!) COMP4300/8300 Lecture 2-10 Copyright c 2015 The Australian National University

11 2.10 Cache Mapping Blocks of main memory are mapped to a cache line Cache line typically bytes wide Mapping may be direct, or n-way associative Entire cache line is fetched from memory not just one element Structure code to try and use an entire cache line of data Best to have unit stride Pointer chasing is very bad Cache Main Memory Mapping Line 1 1 or 3 2 or 4 1 or 3 2 or 4 Line 2 1 or 3 2 or 4 1 or 3 2 or 4 Line 3 1 or 3 2 or 4 1 or 3 2 or 4 Line 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 COMP4300/8300 Lecture 2-11 Copyright c 2015 The Australian National University

12 2.11 Memory Banks Memory bandwidth improved by having multiple parallel paths to/from memory Bank 1 Bank 2 Bank 3 Bank 4 C P U Traditional solution used by vector processors High initial latency Good performance for unit stride Very bad performance if bank conflict COMP4300/8300 Lecture 2-12 Copyright c 2015 The Australian National University

13 2.12 Going Parallel Inevitably the performance of a single processor is limited by the clock speed Improved manufacturing increases clock but ultimately limited by speed of light Superscalar allows multiple ops at once, but not always applicable It s time to go parallel Hardware Issues Flynn s Taxonomy of parallel processors SIMD/MIMD Shared/distributed memory Hierarchical/flat memory Dynamic/static processor connectivity Characteristics of static networks COMP4300/8300 Lecture 2-13 Copyright c 2015 The Australian National University

14 2.13 Architecture Classification: Flynn s Taxonomy Why classify: What kind of parallelism is employed? Which architecture has the best prospect for the future? What has already been achieved by current architecture types? Reveal configurations that have not yet considered by system architect. Enable building of performance models. Flynn s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction and data streams Single Data Stream Multiple Single SISD SIMD Instruction 1CPU Array/Vector Processor Stream Multiple MISD MIMD (Pipelined?) Multiple Processor COMP4300/8300 Lecture 2-14 Copyright c 2015 The Australian National University

15 2.14 SIMD and MIMD SIMD: Single Instruction Multiple Data Also know as data parallel processors or array processors Vector processors (to some extent) Current examples include SSE instructions, SPEs on CellBE, GPUs NVIDIA s SIMT (T = Threads) is slight variation MIMD: Multiple Instruction Multiple Data Examples include quad-core PC, octa-core Xeons on Raijin Global Control Unit CPU CPU CPU CPU CPU and Control CPU and Control CPU and Control CPU and Control I N T E R C O N N E C T I N T E R C O N N E C T S I M D M I M D COMP4300/8300 Lecture 2-15 Copyright c 2015 The Australian National University

16 2.15 MIMD Most successful parallel model More general purpose than SIMD (eg CM5 could emulate CM2) Harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machines Scheduling: efficient allocation of processors to tasks in a dynamic fashion Synchronization: prevent processors accessing data simultaneously Interconnection design: processor to memory and processor to processor interconnects. Also I/O network - often processors dedicated to I/O devices Overhead: inevitably there is some overhead associated with coordinating activities between processors, eg resolve contention for resources Partitioning: identifying parallelism in processing algorithms that is capable of exploiting concurrent processing streams is non-trivial (Aside SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable. Simplifies use of shared address space.) COMP4300/8300 Lecture 2-16 Copyright c 2015 The Australian National University

17 2.16 Address Space Organization: Message Passing Each processor has local or private memory Interact solely by message passing Commonly known as distributed memory machines Memory bandwidth scales with number of processors Examples, between nodes on NCI Raijin System I N T E R C O N N E C T PROCESSOR PROCESSOR PROCESSOR PROCESSOR MEMORY MEMORY MEMORY MEMORY COMP4300/8300 Lecture 2-17 Copyright c 2015 The Australian National University

18 2.17 Address Space Organization: Shared Address Space Processors interact by modifying data objects stored in a shared address space Flat uniform memory access (UMA) Scalability of memory bandwidth and processor-processor communications a problem Example, dual/quad core PC (ignoring cache) MEMORY MEMORY MEMORY MEMORY I N T E R C O N N E C T PROCESSOR PROCESSOR PROCESSOR PROCESSOR COMP4300/8300 Lecture 2-18 Copyright c 2015 The Australian National University

19 2.18 Non-Uniform Memory Access (NUMA) Machine includes some hierarchy in memory structure All memory local to the programmer (single address space), but some memory takes longer to access than others Cache introduces one level of NUMA Between sockets on NCI Raijin system or in a multisocket Opteron systems MEMORY MEMORY MEMORY MEMORY I N T E R C O N N E C T I N T E R C O N N E C T MEMORY MEMORY MEMORY MEMORY Cache Cache Cache Cache Cache Cache Cache Cache PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR COMP4300/8300 Lecture 2-19 Copyright c 2015 The Australian National University

20 2.19 Shared Address Space Access Parallel Random Access Machine (PRAM): any shared memory machine What happens when multiple processors try to read/write to the same memory location at the same time PRAM models Exclusive-read, exclusive-write (EREW) PRAM Concurrent-read, exclusive-write (CREW) PRAM Exclusive-read, concurrent-write (ERCW) PRAM Concurrent-read, concurrent-write (CRCW) PRAM Concurrent read OK, but write requires arbitration: Common: allowed if all values being written are identical Arbitrary: an arbitrary processor is allowed to proceed and the rest fail Priority: processors are organized into a predefined prioritized list, process with highest priority succeeds the rest fail Sum: the sum of all quantities is written COMP4300/8300 Lecture 2-20 Copyright c 2015 The Australian National University

21 2.20 Dynamic Processor Connectivity: Crossbar Non-blocking network in that connection of two processors does not block connection between other processors Complexity grows as O(p 2 ) May be used to connect processors with its own local memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory COMP4300/8300 Lecture 2-21 Copyright c 2015 The Australian National University

22 2.21 Dynamic Processor Connectivity: Multistaged Networks Processors S W I T C H I N G N E T W O R K Memory O M E G A N E T W O R K Consist of log 2 p stages, where p is the number of processors s and t are binary representation of message source and destination at stage 1 Route through if most significant bits of s and t are the same Crossover if most significant bits of s and t are different Process repeated for next stage using the next most significant bit etc COMP4300/8300 Lecture 2-22 Copyright c 2015 The Australian National University

23 2.22 Dynamic Processor Connectivity: Bus Processor gains exclusive access to bus for some period Performance of BUS limits scalability MEMORY MEMORY MEMORY MEMORY B U S Cache Cache Cache Cache PROCESSOR PROCESSOR PROCESSOR PROCESSOR Performance Cost Cross Bar > Multistage > Bus Cross Bar > Multistage > Bus COMP4300/8300 Lecture 2-23 Copyright c 2015 The Australian National University

24 2.23 Static Processor Connectivity: Complete, Mesh, Tree Completely Connected (becomes very complex!) Linear Array/Ring, Mesh/2D Torus Tree (static if nodes are processors) Switches Processors COMP4300/8300 Lecture 2-24 Copyright c 2015 The Australian National University

25 2.24 Static Processor Connectivity: Hypercube Multidimensional mesh with exactly two processors in each dimension p = 2 d where d is the dimension of the hypercube Disadvantage is the number of connections per processor increases rapidly Examples: Intel ipsc Hypercube, NCube & SGI Origin COMP4300/8300 Lecture 2-25 Copyright c 2015 The Australian National University

26 2.25 Static Processor Connectivity: Hypercube Characteristics Two processors connected directly ONLY IF binary labels differ by one bit In a d-dimensional hypercube each processor directly connects to d others d-dimensional hypercube can be partitioned into two (d-1) subcubes etc The number of links in the shortest path between two processors is the Hamming distance between their labels The Hamming distance between two processor labeled s and t is the number of bits that are on in the binary representation of s t where is the bitwise exclusive or operation (eg 3 for and 2 for ) COMP4300/8300 Lecture 2-26 Copyright c 2015 The Australian National University

27 2.26 Evaluating Static Interconnection Networks#1 Diameter The maximum distance between any two processors in the network Diameter directly determines communication time Connectivity The multiplicity of paths between any two processors High connectivity desirable as it minimizes contention Arch connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks 1 for linear arrays and binary trees 2 for rings and 2-D meshes 4 for 2-D torus d for d-dimensional hypercubes COMP4300/8300 Lecture 2-27 Copyright c 2015 The Australian National University

28 2.27 Evaluating Static Interconnection Networks#2 Channel width The number of bits that can be communicated simultaneously over a link connecting two processors Bisection Width and Bandwidth Width is the minimum number of communication links that have to be removed to partition the network into two equal halves Bandwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processors Cost Many criteria can be used, we will use the number of communication links or wires required by the network. COMP4300/8300 Lecture 2-28 Copyright c 2015 The Australian National University

29 2.28 Summary Static Interconnection Characteristics Bisection Arc Cost Network Diameter Width Connectivity (No of Links) Completely connected 1 p 2 /4 p 1 p(p 1)/2 Binary Tree 2log 2 ((p+1)/2) 1 1 p 1 Linear array p p 1 Ring p/2 2 2 p 2-D Mesh 2( p 1) p 2 2(p p) 2-D Torus 2 p/2 2 p 4 2p Hypercube log 2 p p/2 log 2 p (plog 2 p)/2 COMP4300/8300 Lecture 2-29 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,