COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

Size: px

Start display at page:

Download "COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell"

Hilary Chandler
5 years ago
Views:

1 COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult, division (sqrt maybe!) Integer operations (MIPS) - adds etc, also logical ops instruction processing MIPS: Machine Instructions Per Second on very old VAX 100/780 Anyway, what is a machine instruction on different CPUs! Our primary focus will be in floating point operations Clock: All ops take a fixed number of clock ticks to complete Clock speed is measured in GHz (10 9 cycles/second) or nsec (10 9 seconds) Apple iphone 6 ARM A8 1.4GHz (0.71ns), NCI Raijin Intel Xeon Sy Bridge 2.6GHz (0.38ns), IBM zec12 processor 5.5Ghz (0.18ns) Clock limited by etching+speed of light, hence motivates parallel (duo systems) (To my knowledge) IBM zec12 is fastest commodity processor at 5.5GHz Light travels about 10cm in.32ns, chip is a few cm! COMP4300/8300 Lecture 2-3 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Design So we talk the same language Many things happen in parallel even on a single processor Identify potential issues for parallel hardware Why use 2 CPUs if you can double the speed on one! Multiple Design Hardware models Shared/Distributed memory Hierarchical/flat memory Dynamic/static processor connectivity Evaluating static networks Routing mechanisms COMP4300/8300 Lecture 2-2 Copyright c 2015 The Australian National University 2.3 Performance FLOPS/Sec Prefix Occurrence 10 3 kilo very badly written code 10 6 mega badly written code 10 9 giga single-core tera multiple chip (NCI) peta 23 machines in Top500 (Nov 2012, measured) exa around 2020! PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz 40GF Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz 105GF NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(ops)*2.6GHz 1.19PF COMP4300/8300 Lecture 2-4 Copyright c 2015 The Australian National University

2 2.4 Adding Numbers Consider adding two double precision (8 byte) numbers ± Exponent Mantissa Possible Steps Determine largest exponent Normalize smaller exponent to the larger Add mantissas Renormalize the mantissa exponent of the result Multiple steps each taking 1 tick implies 4 ticks per addition (FLOP) 2.6 Pipeline Operations#2 Requires same op consecutively on different (independent) data items good for vector operations note limitations on chaining output data to input Tendency to increase number of stages in pipeline if each stage can run faster More stages in a pipeline the greater the startup latency UltraSPARC II has 9 stage pipeline, UltraSPARC III has a 14 stage pipeline Prescott Pentium 4 processor had a 31 stage pipeline Not all operations are pipelined, eg integer multiplication, division, sqrt Clock cycles for different operations on Alpha EV6 Operation Latency Repeat +,-,* 4 1 / sqrt COMP4300/8300 Lecture 2-5 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-7 Copyright c 2015 The Australian National University 2.5 Pipeline Operations#1 Step in Pipeline Waiting Done X(6) X(5) X(4) X(3) X(2) X(1) X(1) takes 4 ticks to appear (startup latency) X(2) appears 1 tick after X(1) Asymptotically achieve 1 result per clock tick Operation is said to be pipelined Steps in the pipeline are running in parallel 2.7 Instruction Parallelism issues multiple instructions per clock cycle that are executed in parallel on different parts of the chip hardware Grouping rules: restriction on what can be done in parallel, eg UltraSPARC: 4 from 2*Floating, 2*Integer, 1*load/store, 1*branch Input 1 Input2 Input 3 Multiply unit Pentium III single FLOP per cycle Addition unit Result Opteron, UltraSPARC Alpha 2 (different) FLOPs per cycle Core2, Itanium2 IBM Power5 4 (DP) FLOPs per cycle Xeon Sy Bridge 8 (DP) FLOPs per cycle COMP4300/8300 Lecture 2-6 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-8 Copyright c 2015 The Australian National University

3 2.8 Structure Consider DAXPY: Y(i) = X(i)+Y(i) If theoretically the CPU can perform the 2 FLOPs in 1 cycle must deliver (load) two floats (X(i) Y(i) or 16 bytes) store one (Y(i) 8bytes) each clock cycle On 1GHz system this implies 1.0*16GB/sec 16GB/sec load traffic 8GB/sec store Typically A processor core can only issue one load OR store instruction in a clock cycle DDR3-SDRAM memory is available clocked at 1066MHz with access times accordingly Latency Bwidth are critical performance issues Caches: reduce latency provide improved cache to CPU bwidth banks: improve bwidth COMP4300/8300 Lecture 2-9 Copyright c 2015 The Australian National University 2.10 Cache Mapping Blocks of main memory are mapped to a cache line Cache line typically bytes wide Mapping may be direct, or n-way associative Entire cache line is fetched from memory not just one element Structure code to try use an entire cache line of data Best to have unit stride Pointer chasing is very bad Cache Main Mapping Line 1 1 or 3 2 or 4 1 or 3 2 or 4 Line 2 1 or 3 2 or 4 1 or 3 2 or 4 Line 3 1 or 3 2 or 4 1 or 3 2 or 4 Line 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 COMP4300/8300 Lecture 2-11 Copyright c 2015 The Australian National University 2.9 Cache Main Cache CPU Registers large cheap memory large latency/small bwidth small fast expensive memory lower latency/higher bwidth hierarchy or Non-Uniform Access (NUMA) Cache Hit - data in cache received in a few cycles Cache Miss - data fetched from main memory (or higher level cache) Try to ensure data is in cache (or as close to the CPU as possible) Can we block algorithm to minimize memory traffic Cache is effective because algorithms often use data that are close in memory. (Note duplication of data in cache will have implications for parallel systems!) 2.11 Banks bwidth improved by having multiple parallel paths to/from memory Bank 1 Bank 2 Bank 3 Bank 4 Traditional solution used by vector processors High initial latency Good performance for unit stride Very bad performance if bank conflict C P U COMP4300/8300 Lecture 2-10 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-12 Copyright c 2015 The Australian National University

4 2.12 Going Parallel Inevitably the performance of a single processor is limited by the clock speed Improved manufacturing increases clock but ultimately limited by speed of light Superscalar allows multiple ops at once, but not always applicable It s time to go parallel Hardware Issues Flynn s Taxonomy of parallel processors SIMD/MIMD Shared/distributed memory Hierarchical/flat memory Dynamic/static processor connectivity Characteristics of static networks COMP4300/8300 Lecture 2-13 Copyright c 2015 The Australian National University 2.14 SIMD MIMD SIMD: Single Instruction Multiple Data Also know as data parallel processors or array processors Vector processors (to some extent) Current examples include SSE instructions, SPEs on CellBE, GPUs NVIDIA s SIMT (T = Threads) is slight variation MIMD: Multiple Instruction Multiple Data Examples include quad-core PC, octa-core Xeons on Raijin Global Control Unit CPU CPU CPU CPU S I M D CPU Control CPU Control CPU Control M I M D CPU Control COMP4300/8300 Lecture 2-15 Copyright c 2015 The Australian National University 2.13 Architecture Classification: Flynn s Taxonomy Why classify: What kind of parallelism is employed? Which architecture has the best prospect for the future? What has already been achieved by current architecture types? Reveal configurations that have not yet considered by system architect. Enable building of performance models. Flynn s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction data streams Data Stream Single Multiple Single SISD SIMD Instruction 1CPU Array/Vector Stream Multiple MISD MIMD (Pipelined?) Multiple COMP4300/8300 Lecture 2-14 Copyright c 2015 The Australian National University 2.15 MIMD Most successful parallel model More general purpose than SIMD (eg CM5 could emulate CM2) Harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machines Scheduling: efficient allocation of processors to tasks in a dynamic fashion Synchronization: prevent processors accessing data simultaneously Interconnection design: processor to memory processor to processor interconnects. Also I/O network - often processors dedicated to I/O devices Overhead: inevitably there is some overhead associated with coordinating activities between processors, eg resolve contention for resources Partitioning: identifying parallelism in processing algorithms that is capable of exploiting concurrent processing streams is non-trivial (Aside SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable. Simplifies use of shared address space.) COMP4300/8300 Lecture 2-16 Copyright c 2015 The Australian National University

5 2.16 Address Space Organization: Message Passing Each processor has local or private memory Interact solely by message passing Commonly known as distributed memory machines bwidth scales with number of processors Examples, between nodes on NCI Raijin System 2.18 Non-Uniform Access (NUMA) Machine includes some hierarchy in memory structure All memory local to the programmer (single address space), but some memory takes longer to access than others Cache introduces one level of NUMA Between sockets on NCI Raijin system or in a multisocket Opteron systems MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY PROCESSOR PROCESSOR PROCESSOR PROCESSOR Cache Cache Cache Cache Cache Cache Cache Cache MEMORY MEMORY MEMORY MEMORY PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR COMP4300/8300 Lecture 2-17 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-19 Copyright c 2015 The Australian National University 2.17 Address Space Organization: Shared Address Space s interact by modifying data objects stored in a shared address space Flat uniform memory access (UMA) Scalability of memory bwidth processor-processor communications a problem Example, dual/quad core PC (ignoring cache) MEMORY MEMORY MEMORY MEMORY PROCESSOR PROCESSOR PROCESSOR PROCESSOR COMP4300/8300 Lecture 2-18 Copyright c 2015 The Australian National University 2.19 Shared Address Space Access Parallel Rom Access Machine (PRAM): any shared memory machine What happens when multiple processors try to read/write to the same memory location at the same time PRAM models Exclusive-read, exclusive-write (EREW) PRAM Concurrent-read, exclusive-write (CREW) PRAM Exclusive-read, concurrent-write (ERCW) PRAM Concurrent-read, concurrent-write (CRCW) PRAM Concurrent read OK, but write requires arbitration: Common: allowed if all values being written are identical Arbitrary: an arbitrary processor is allowed to proceed the rest fail Priority: processors are organized into a predefined prioritized list, process with highest priority succeeds the rest fail Sum: the sum of all quantities is written COMP4300/8300 Lecture 2-20 Copyright c 2015 The Australian National University

6 2.20 Dynamic Connectivity: Crossbar Non-blocking network in that connection of two processors does not block connection between other processors Complexity grows as O(p 2 ) May be used to connect processors with its own local memory 2.22 Dynamic Connectivity: Bus gains exclusive access to bus for some period Performance of BUS limits scalability MEMORY MEMORY MEMORY MEMORY B U S Cache Cache Cache Cache PROCESSOR PROCESSOR PROCESSOR PROCESSOR Performance Cost Cross Bar > Multistage > Bus Cross Bar > Multistage > Bus COMP4300/8300 Lecture 2-21 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-23 Copyright c 2015 The Australian National University 2.21 Dynamic Connectivity: Multistaged Networks s S W I T C H I N G N E T W O R K Static Connectivity: Complete, Mesh, Tree Completely Connected (becomes very complex!) Linear Array/Ring, Mesh/2D Torus O M E G A N E T W O R K Consist of log 2 p stages, where p is the number of processors s t are binary representation of message source destination at stage 1 Route through if most significant bits of s t are the same Crossover if most significant bits of s t are different Process repeated for next stage using the next most significant bit etc COMP4300/8300 Lecture 2-22 Copyright c 2015 The Australian National University Tree (static if nodes are processors) Switches s COMP4300/8300 Lecture 2-24 Copyright c 2015 The Australian National University

7 2.24 Static Connectivity: Hypercube Multidimensional mesh with exactly two processors in each dimension p = 2 d where d is the dimension of the hypercube Disadvantage is the number of connections per processor increases rapidly Examples: Intel ipsc Hypercube, NCube & SGI Origin 2.26 Evaluating Static Interconnection Networks#1 Diameter The maximum distance between any two processors in the network Diameter directly determines communication time Connectivity The multiplicity of paths between any two processors High connectivity desirable as it minimizes contention Arch connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks 1 for linear arrays binary trees 2 for rings 2-D meshes 4 for 2-D torus d for d-dimensional hypercubes COMP4300/8300 Lecture 2-25 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-27 Copyright c 2015 The Australian National University 2.25 Static Connectivity: Hypercube Characteristics Two processors connected directly ONLY IF binary labels differ by one bit In a d-dimensional hypercube each processor directly connects to d others d-dimensional hypercube can be partitioned into two (d-1) subcubes etc The number of links in the shortest path between two processors is the Hamming distance between their labels 1001 The Hamming distance between two processor labeled s t is the number of bits that are on in the binary representation of s t where is the bitwise exclusive or operation (eg 3 for for ) Evaluating Static Interconnection Networks#2 Channel width The number of bits that can be communicated simultaneously over a link connecting two processors Bisection Width Bwidth Width is the minimum number of communication links that have to be removed to partition the network into two equal halves Bwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processors Cost Many criteria can be used, we will use the number of communication links or wires required by the network. COMP4300/8300 Lecture 2-26 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-28 Copyright c 2015 The Australian National University

8 2.28 Summary Static Interconnection Characteristics Bisection Arc Cost Network Diameter Width Connectivity (No of Links) Completely connected 1 p 2 /4 p 1 p(p 1)/2 Binary Tree 2log 2 ((p+1)/2) 1 1 p 1 Linear array p p 1 Ring p/2 2 2 p 2-D Mesh 2( p 1) p 2 2(p p) 2-D Torus 2 p/2 2 p 4 2p Hypercube log 2 p p/2 log 2 p (plog 2 p)/2 COMP4300/8300 Lecture 2-29 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk