Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed emory Shared emory Clusters Summary odern HC Architectures 2 1
Building Blocks of arallel achines rocessors to calculate emory for temporary storage of data Interconnect so processors can talk to each other and the outside world Storage disks and tapes for long term archiving of data These are the basic components but how do we to put them together... odern HC Architectures 3 rocessors ost are RISC architecture Reduced Instruction Set Computer simplify instructions to maximise speed Calculations performed on values in registers separate integer and floating point loading and storing from memory must be done explicitly a = b + c is not an atomic operation involves 2 loads, an addition and a store odern HC Architectures 4 2
Clock Speed Rate at which instructions are issued modern chips are around 2-3 GHz integer and floating point calculations done in parallel can also have multiple issue, e.g. simultaneous add and multiply Whole series of hardware innovations pipelining out-of-order execution, speculative computation... Details become important for top performance most features are fairly generic odern HC Architectures 5 oore s Law CU power doubles every 24 months strictly speaking, applies to transistor density Held true for ~35 years now maybe self-fulfilling? eople have predicted its demise many times but it hasn t happened yet Increases in power are due to increases in parallelism as well as in clock rate fine grain parallelism (pipelining) medium grain parallelism (hardware multithreading) coarse grain parallelism (multiple processors on a chip) First two seem to be (almost) exhausted: main trend is now towards multicore odern HC Architectures 6 3
emory emory speed is often the limiting factor for HC applications keeping the CU fed with data is the key to performance emory is a substantial contributor to the cost of systems typical HC systems have a few Gbytes of memory per processor technically possible to have much more than this, but it is too expensive and power-hungry Basic characteristics latency: how long you have to wait for data to arrive bandwidth: how fast it actually comes in ballpark figures: 100 s of nanoseconds and a few Gbytes/s odern HC Architectures 7 Cache memory emory latencies are very long 100s of processor cycles fetching data from main memory is 2 orders of magnitude slower than doing arithmetic Solution: introduce cache memory much faster than main memory...but much smaller than main memory keeps copies of recently used data odern systems use a hierarchy of two or three levels of cache odern HC Architectures 8 4
emory hierarchy Speed (and cost) 1 cycle CU Registers ~1 Kb Capacity 2-3 cycles L1 Cache ~100 Kb ~20 cycles L2 Cache ~1-10 b ~50 cycles L3 Cache ~10-50 b ~300 cycles ain emory ~1 Gb odern HC Architectures 9 Serial v arallel Computers Serial computers are easier to program than parallel computers but there are limits on single processor performance physical: speed of light, uncertainty principle practical: design, manufacture arallel computers dominate HC because they allow highest performance they are more cost effective Achieving good performance requires high quality algorithms, decomposition and programming odern HC Architectures 10 5
Flynn's Taxonomy Classification of architectures by instruction stream and data stream SISD: Single Instruction Single Data serial machines ISD: ultiple Instructions Single Data (probably) no real examples SID: Single Instruction ultiple Data ID: ultiple Instructions ultiple Data odern HC Architectures 11 SID Architecture Single Instruction ultiple Data Every processor synchronously executes same instructions on different data Instructions issued by front-end Each processor has its own memory where it keeps its data rocessors can communicate with each other Usually thousands of simple processors Examples: DA, asar, C200 odern HC Architectures 12 6
SID Architecture Front-end Network eripherals odern HC Architectures 13 ID Architecture ultiple Instructions ultiple Data Several independent processors capable of executing separate programs Subdivision by relationship between processors and memory odern HC Architectures 14 7
Distributed emory ID-D each processor has its own local memory rocessors connected by some interconnect mechanism rocessors communicate via explicit message passing effectively sending emails to each other Highly scalable architecture allows assively arallel rocessing () Examples Cray XE, IB BlueGene, workstation/c clusters (Beowulf) odern HC Architectures 15 Distributed emory Interconnect odern HC Architectures 16 8
Distributed emory rocessors behave like distinct workstations each runs its own copy of the operating system no interaction except via the interconnect ros adding processors increases memory bandwidth can grow to almost any size Cons scalability relies on good interconnect jobs are placed by user and remain on the same processors potential for high system management overhead odern HC Architectures 17 Shared emory ID-S each processor has access to a global memory store Communications via write/reads to memory caches are automatically kept up-to-date or coherent Simple to program (no explicit communications) Scaling is difficult because of memory access bottleneck Usually modest numbers of processors odern HC Architectures 18 9
Symmetric ultirocessing Each processor in an S has equal access to all parts of memory same latency and bandwidth Bus Examples emory IB servers, Sun HC Servers, multicore Cs odern HC Architectures 19 Shared emory Looks like a single machine to the user a single operating system covers all the processors the OS automatically moves jobs around the CU cores ros simple to use and maintain CC-NUA architectures allow scaling to 100 s of CUs Cons potential problems with simultaneous access to memory sophisticated hardware required to maintain cache coherency scalability ultimately limited by this odern HC Architectures 20 10
Shared emory Cluster Interconnect odern HC Architectures 21 Shared emory Clusters Technology yramid HC S server workstation cluster encouraged clustering of S nodes. i.e.. top-end nodes are the mid-range systems Recent trend towards ulticore processors Low end clusters and Custom HC systems have S nodes. odern HC Architectures 22 11
Shared emory Clusters Combine features of two architectures shared-memory within a node distributed memory between nodes ros constructed as a standard distributed memory machine but with more powerful nodes Cons may be hard to take advantage of mixed architecture more complicated to understand performance combination of interconnect and memory system behaviour Examples clusters of Intel servers, Bull machines, all modern C clusters odern HC Architectures 23 HECToR: Cray XE6 Built from 16-core AD Interlagos CUs each a mini 16-way S with internal bus odern HC Architectures 24 12
A bespoke Cray interconnect essentially a high-end S cluster network is a 3D torus, not a switch 6.4 GB/sec direct connect HyperTransport 2 8 GB main memory Cray SeaStar2+ Interconnect 12.8 GB/sec direct connect memory (DDR 800) odern HC Architectures 25 HECToR System Specifications cont. Cray XE6 parallel processors 2816 compute nodes which contain two AD 2.3 GHz 16-core Opteron processors => 90,112 cores Theoretical peak of 827 Tflops 32 GB main memory per processor, shared between 32 cores => total memory of 90 TB 10 login nodes Gemini interconnect 12 IO nodes odern HC Architectures 26 13
Summary Flynn s taxonomy looks somewhat dated SID likely to remain a niche market Wide variety of memory architectures for ID need to sub-classify by memory any parallel systems based on commodity microprocessors or clusters of Ss providing leverage with commercial products arallel architectures appear to be the present and future of HC odern HC Architectures 27 essage assing odel The message passing model is based on the notion of processes can think of a process as an instance of a running program, together with the program s data In the message passing model, parallelism is achieved by having many processes co-operate on the same task Each process has access only to its own data rocesses communicate with each other by sending and receiving messages odern HC Architectures 28 14
rocess Communication rocess 1 rocess 2 rogram a=23 Recv(1,b) Send(2,a) a=b+1 Data 23 23 24 23 odern HC Architectures 29 Quantifying erformance Serial computing concerned with complexity how execution time varies with problem size N adding two arrays (or vectors) is O(N) matrix times vector is O(N 2 ), matrix-matrix is O(N 3 ) Look for clever algorithms naïve sort is O(N 2 ) divide-and-conquer approaches are O(N log (N)) arallel computing also concerned with scaling how time varies with number of processors different algorithms can have different scaling behaviour but always remember that we are interested in minimum time! odern HC Architectures 30 15
erformance easures T(N,) is time for size N on processors Speedup typically S(N,) < arallel Efficiency typically E(N,) < 1 Serial Efficiency typically E(N) <= 1 odern HC Architectures 31 The Serial Component Amdahl s law the performance improvement to be gained by parallelisation is limited by the proportion of the code which is serial Gene Amdahl, 1967 odern HC Architectures 32 16
Amdahl s law Assume a fraction a is completely serial time is sum of serial and potentially parallel arallel time parallel part 100% efficient arallel speedup for a = 0, S = as expected (ie E = 100%) otherwise, speedup limited by 1/ a for any impossible to effectively utilise large parallel machines? odern HC Architectures 33 Gustafson s Law Need larger problems for larger numbers of CUs odern HC Architectures 34 17
Utilising Large arallel achines Assume parallel part is O(N), serial part is O(1) time speedup Scale problem size with CUs, ie set N = speedup efficiency aintain constant efficiency (1-a) for large odern HC Architectures 35 Scaling Real Speed-up graphs Speed-up vs No of Es 300 250 200 Speed-up 150 100 linear actual 50 0 0 50 100 150 200 250 300 No of Es Improving load balance / algorithm increases the turn-over to a higher numbers of processors better scaling = ability to utilise larger computers odern HC Architectures 36 18
Summary Useful definitions Speed-up Efficiency Amdahl s Law the performance improvement to be gained by parallelisation is limited by the proportion of the code which is serial Gustafson s Law to maintain constant efficiency we need to scale the problem size with the number of CUs. odern HC Architectures 37 19