Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design

Size: px

Start display at page:

Download "Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design"

Lizbeth Pauline Warren
5 years ago
Views:

1 Overview: Classical Parallel Hardware Processor Performance Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues for parallel hardware why use 2 CPUs if you can double the speed on one! Multiple Processor Design architectural classifications shared/distributed memory hierarchical/flat memory dynamic/static processor connectivity FLOPS/Sec Prefix Occurrence 10 3 kilo very badly written code 10 6 mega badly written code 10 9 giga single-core tera multiple chip (NCI) peta 23 machines in Top500 (Nov 2012, measured) exa around 2020! PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz 40GF Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz 105GF NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(ops)*2.6GHz 1.19PF evaluating static networks case study; the NCI Raijin supercomputer COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware The Processor Adding Numbers Performs: floating point operations (FLOPS) - add, mult, division (sqrt maybe!) integer operations (MIPS) - adds etc, also logical ops and instruction processing MIPS: Machine Instructions Per Second (on a very old VAX 100/780) anyway, what is a machine instruction on different CPUs! our primary focus will be in floating point operations The clock: all ops take a fixed number of clock ticks to complete clock speed is measured in GHz (10 9 cycles/second) or nsec (10 9 seconds) Apple iphone 6 ARM A8 1.4GHz (0.71ns), NCI Raijin Intel Xeon Sandy Bridge 2.6GHz (0.38ns), IBM zec12 processor 5.5Ghz (0.18ns) clock speed is limited by etching+speed of light, hence motivates parallelism (to my knowledge) IBM zec12 is fastest commodity processor at 5.5GHz light travels about 1cm in 3.2ns, a chip is a few cm! COMP4300/8300 L2-3: Classical Parallel Hardware Consider adding two double precision (8 byte) numbers ± Exponent Significand Possible steps: determine largest exponent normalize significand of the smaller exponent to the larger add significand re normalize the significand and exponent of the result Multiple steps each taking 1 tick implies 4 ticks per addition (FLOP) COMP4300/8300 L2-3: Classical Parallel Hardware

2 Operation Pipelining Step in Pipeline Waiting Done X(6) X(5) X(4) X(3) X(2) X(1) X(1) takes 4 ticks to appear (startup latency); X(2) appears 1 tick after X(1) asymptotically achieves 1 result per clock tick the operation (X) is said to be pipelined: steps in the pipeline are running in parallel requires same op consecutively on different (independent) data items good for vector operations (note limitations on chaining output data to input) not all operations are pipelined, e.g. integer mult., division, sqrt. E.g. Alpha EV6 operation latency repeat +,-,* 4 1 / sqrt COMP4300/8300 L2-3: Classical Parallel Hardware Pipelining: Dependent Instructions principle: CPU must ensure result is the same as if no pipelining (or ism) instructions requiring only 1 cycle in the EX stage: r r s r can be solved by pipeline feedback from EX stage to next cycle (important) instructions requiring c cycles for the EX stage (ie. f.p. * +, load, store) are normally implemented by having c EX stages. This requires any dependent instruction to be delayed by c cycles. e.g. c = 3 r r r r r r I0: FI DI FO EX1 EX2 EX3 WB I1: FI DI FO EX1 EX2 EX3 WB I2: FI DI FO EX1 EX2 EX3 WB I3: FI DI FO EX1 EX2 EX3 WB COMP4300/8300 L2-3: Classical Parallel Hardware Instruction Pipelining Superscalar (Multiple Instruction Issue) break instruction execution into k stages; can get k-way ism (generally, the circuitry for each stage is independent) eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX = Execute Instrn., WB = Write Back (branch): FI DI FO EX WB (guess) FI DI FO EX WB (guess) FI DI FO EX WB (guess) FI DI FO EX WB (sure) FI DI FO EX WB note: FO & WB stages may involve memory accesses (and may possibly stall the pipeline) conditional branch instructions are problematic: the wrong guess may require flushing succeeding instructions from the pipeline tendency to increase the number of stages, but this increases the startup latency e.g. number of stages: UltraSPARC II (9), UltraSPARC III (14), Intel Prescott (31) A small number (w) of instructions are scheduled by the H/W to execute together. groups must have an appropriate instruction mix 2 different floating point e.g. UltraSPARC (w = 4): 1 load / store ; 1 branch 2 integer / logical have w-way ism over different types of instr ns generally requires: multiple ( w) instruction fetches an extra grouping (G) stage in the pipeline instr ns per group problem: amplifies dependencies and other problems of pipelining by a factor of w! issues: the instruction mix must be balanced for maximum performance! i.e. floating point, + must be balanced COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware

3 Review - Instruction Level Parallelism instruction level parallelism (ILP): pipelining and superscalar, offer kw-way ism branch prediction techniques alleviates issue of conditional branches uses a table recoding the result of recently-taken branches out-of-order execution: alleviates the issue of dependencies pulls fetched instructions into a buffer of size W, W w execute them in any order provided dependencies are not violated must keep track of all in-flight instructions and associated registers (O(W 2 ) area and power!) in most situations, the compiler can do as good a job as a human at exposing this parallelism (ILP was part of the Free Lunch ) An exception: how to get a UltraSPARC III to execute 1 f.p. multiply (and add!) per cycle COMP4300/8300 L2-3: Classical Parallel Hardware Cache Memory Main Memory large cheap memory large latency/small bandwidth Cache small fast expensive memory lower latency/higher bandwidth CPU Registers part of the memory hierarchy cache hit: data is in cache and received in a few cycles cache miss: data must be fetched from main memory (or higher level cache) try to ensure data is in cache (or as close to the CPU as possible) can we block algorithm to minimize memory traffic? Cache memory is effective because algorithms often use data that was recently accessed from memory (temporal locality) was close to other recently accessed data (spatial locality) Note: duplication of data in the cache will have implications for parallel systems! COMP4300/8300 L2-3: Classical Parallel Hardware Consider the DAXPY computation: Memory Structure y(i) = x(i) + y(i) If theoretically the CPU can perform the 2 FLOPs in 1 cycle the memory system must load two s (x(i) and y(i) 16 bytes) and store one (y(i) 8 bytes) each clock cycle on a 1 GHz system this implies 16 GB/s load traffic and 8 GB/s store traffic typically: a processor core can only issue one load OR store instruction in a clock cycle DDR3-SDRAM memory is available clocked at 1066 MHz with access times accordingly Memory latency and bandwidth are critical performance issues. caches: reduce latency and provide improved cache to CPU bandwidth multiple memory banks: improve bandwidth (by parallel access) COMP4300/8300 L2-3: Classical Parallel Hardware Cache Mapping Blocks of main memory are mapped to a cache line cache lines are typically bytes wide mapping may be direct, or n-way associative. E.g. 2-way associative Cache Main Memory Mapping Line 1 1 or 3 2 or 4 1 or 3 2 or 4 Line 2 1 or 3 2 or 4 1 or 3 2 or 4 Line 3 1 or 3 2 or 4 1 or 3 2 or 4 Line 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 entire cache line is fetched from memory, not just one element (why?) try to structure code to use an entire cache line of data best to have unit stride pointer chasing is very bad! (large address strides guaranteed!) COMP4300/8300 L2-3: Classical Parallel Hardware

4 Memory Banks Going Parallel Memory bandwidth can be improved by having multiple parallel paths to/from memory, organized in b banks. Bank 1 Bank 2 Bank 3 Bank 4 traditional solution used by vector processors good performance for unit stride very bad performance if bank conflicts occur worst case: power-of-2 strides, where b is a power 2 C P U inevitably the performance of a single processor is limited by the clock speed improved manufacturing increases clock but ultimately limited by speed of light ILP allows multiple ops at once, but is limited It s time to go parallel Hardware Issues Flynn s Taxonomy of parallel processors SIMD/MIMD shared/distributed memory hierarchical/flat memory dynamic/static processor connectivity characteristics of static networks COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware Parallelism in Memory Access Architecture Classification: Flynn s Taxonomy Dual Inline Memory Module (DIMM) a processor may have 2 or 4 DIMMs (i.e. b = 16,32), depending on its lowest-level cache line size each DRAM chip access is pipelined (select row address, select column address, access byte) why classify: what kind of parallelism is employed? Which architecture has the best prospect for the future? What has already been achieved by current architecture types? Reveal configurations that have not yet considered by system architect. Enable building of performance models. Flynn s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction and data streams Single Data Stream Multiple Single SISD SIMD Instruction 1CPU Array/Vector Processor Stream Multiple MISD MIMD (Pipelined?) Multiple Processor COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware

5 SIMD and MIMD Address Space Organization: Message Passing SIMD: Single Instruction Multiple Data also known as data parallel processors or array processors vector processors (to some extent) current examples include SSE instructions, SPEs on CellBE, GPUs NVIDIA s SIMT (T = Threads) is slight variation MIMD: Multiple Instruction Multiple Data examples include quad-core PC, octa-core Xeons on Raijin each processor has local or private memory interact solely by message passing commonly known as distributed memory parallel computers memory bandwidth scales with number of processors example: between nodes on the NCI Raijin system P P P P COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware MIMD Address Space Organization: Shared Address Space Most successful parallel model more general purpose than SIMD (e.g. CM5 could emulate CM2) harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machines scheduling: efficient allocation of processors to tasks in a dynamic fashion synchronization: prevent processors accessing the same data simultaneously interconnect design: processor to memory and processor to processor interconnects. Also I/O network - often processors dedicated to I/O devices overhead: inevitably there is some overhead associated with coordinating activities between processors, e.g. resolve contention for resources partitioning: identifying parallelism in algorithms that is capable of exploiting concurrent processing streams is non-trivial processors interact by modifying data objects stored in a shared address space simplest solution is a flat or uniform memory access (UMA) scalability of memory bandwidth and processor-processor communications (arising from cache line transfers) are problems so is synchronizing access to shared data objects example: dual/quad core PC (ignoring cache) (Aside: SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable. Simplifies use of shared address space.) P P P P COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware

6 Non-Uniform Memory Access (NUMA) Machine includes some hierarchy in its memory structure all memory is visible to the programmer (single address space), but some memory takes longer to access than others Dynamic Processor Connectivity: Crossbar is a non-blocking network, in that the connection of two processors does not block connection between other processors complexity grows as O(p 2 ) may be used to connect processors each with their own local memory in this sense, cache introduces one level of NUMA between sockets on NCI Raijin system or in a multi-socket Opteron system (some ANU research on an Opteron (2011)) P P P P may also be used to connect processors with various memory banks e.g. the 8 cores on an UltraSPARC T2 access the 8 banks of its level 2 cache COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware Shared Address Space Access Dynamic Processor Connectivity: Multi-staged Networks Parallel Random Access Machine (PRAM): any shared memory machine P what happens when multiple processors try to read/write to the same memory location at the same time? PRAM models Exclusive-read, exclusive-write (EREW PRAM) Concurrent-read, exclusive-write (CREW PRAM) Exclusive-read, concurrent-write (ERCW PRAM) Concurrent-read, concurrent-write (CRCW PRAM) Concurrent read OK, but write requires arbitration: Common: allowed if all values being written are identical Arbitrary: an arbitrary processor is allowed to proceed and the rest fail Priority: processors are organized into a predefined prioritized list, process with highest priority succeeds the rest fail Sum: the sum of all quantities is written COMP4300/8300 L2-3: Classical Parallel Hardware consist of lg p stages, where p is the number of processors (lg p = log 2 (p)) s and t are binary representation of message source and destination at stage 1 route through if most significant bits of s and t are the same otherwise, crossover process repeats for next stage using the next most significant bit etc COMP4300/8300 L2-3: Classical Parallel Hardware

7 Dynamic Processor Connectivity: Bus Static Processor Connectivity: Hypercube processor gains exclusive access to the bus for some period the performance of the bus limits scalability P P P P a multidimensional mesh with exactly two processors in each dimension Performance crossbar > multistage > bus Cost crossbar > multistage > bus Discussion point: which would you use for small, medium and large numbers of processors (p)? What values of p would (roughly) be the cuttoff points? COMP4300/8300 L2-3: Classical Parallel Hardware p = 2 d where d is the dimension of the hypercube advantages: d hops between any processors disadvantage: the number of connections per processor is d can be generalized to k-ary d-cubes, e.g. a 4 4 torus is a 4-ary 2-cube examples: Intel ipsc Hypercube, NCube, SGI Origin, Cray T3D, TOFU COMP4300/8300 L2-3: Classical Parallel Hardware Static Processor Connectivity: Complete, Mesh, Tree Static Processor Connectivity: Hypercube Characteristics Completely connected (becomes very complex!) Linear array/ring, mesh/2d torus two processors connected directly ONLY IF binary labels differ by one bit in a d-dimensional hypercube each processor directly connects to d others a d-dimensional hypercube can be partitioned into two d 1 sub-cubes etc Tree (static if nodes are processors) the number of links in the shortest path between two processors is the Hamming distance between their labels the Hamming distance between two processors labeled s and t is the number of bits that are on in the binary representation of s t where is the bitwise exclusive or operation (e.g. 3 for and 2 for ) COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware P

Diameter Evaluating Static Interconnection Networks#1 the maximum distance between any two processors in the network directly determines communication time (latency) Connectivity the multiplicity of

8 Diameter Evaluating Static Interconnection Networks#1 the maximum distance between any two processors in the network directly determines communication time (latency) Connectivity the multiplicity of paths between any two processors a high connectivity is desirable as it minimizes contention (also enhances fault-tolerance) arc connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks 1 for linear arrays and binary trees 2 for rings and 2D meshes 4 for a 2D torus d for d-dimensional hypercubes COMP4300/8300 L2-3: Classical Parallel Hardware Summary: Static Interconnection Characteristics Bisection Arc Cost Network Diameter width connectivity (no. of links) Completely-connected 1 p 2 /4 p 1 p(p 1)/2 Binary Tree 2lg((p + 1)/2) 1 1 p 1 Linear array p p 1 Ring p/2 2 2 p 2D Mesh 2( p 1) p 2 2(p p) 2D Torus 2 p/2 2 p 4 2p Hypercube lg p p/2 lg p (p lg p)/2 Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right sub-trees must pass through the root. The fat tree interconnect alleviates this: usually, only the leaf nodes contain a processor it can be easily partitioned into sub-networks without degraded performance (useful for supercomputers) Discussion point: which would you use for small, medium and large numbers of processors (p)? What values of p would (roughly) be the cuttoff points? COMP4300/8300 L2-3: Classical Parallel Hardware Evaluating Static Interconnection Networks#2 Channel width the number of bits that can be communicated simultaneously over a link connecting two processors Bisection width and bandwidth bisection width is the minimum number of communication links that have to be removed to partition the network into two equal halves bisection bandwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processors Cost many criteria can be used; we will use the number of communication links or wires required by the network COMP4300/8300 L2-3: Classical Parallel Hardware NCI s Raijin: A Petascale Supercomputer 57,472 cores (dual socket, 8 core Intel Xeon Sandy Bridge, 2.6 GHz) in 3592 compute nodes 160 TBytes (approx.) of main memory Mellanox Infiniband FDR interconnect (52 km cable), interconnects: ring (cores), full (sockets), fat tree (nodes) 10 PBytes (approx.) of usable fast filesystem (for shortterm scratch space apps, home directories) power: 1.5 MW max. load cooling systems: 100 tonnes of water 24th fastest in the world in debut (November 2012); first petaflop system in Australia fastest (parallel) filesystem in the s. hemisphere custom monitoring and deployment custom Linux kernel highly customised PBS Pro scheduler COMP4300/8300 L2-3: Classical Parallel Hardware

NCI s Integrated High Performance Environment COMP4300/8300 L2-3: Classical Parallel Hardware 2018 33 Further Reading: Parallel Hardware The Free Lunch Is Over! Ch 1, 2.1-2.

9 NCI s Integrated High Performance Environment COMP4300/8300 L2-3: Classical Parallel Hardware Further Reading: Parallel Hardware The Free Lunch Is Over! Ch 1, of Introduction to Parallel Computing Ch 1, 2 of Principles of Parallel Programming lecture notes from Calvin Lin: A Success Story: ISA Parallel Architectures COMP4300/8300 L2-3: Classical Parallel Hardware

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk