Modern computer architecture. From multicore to petaflops

Size: px

Start display at page:

Download "Modern computer architecture. From multicore to petaflops"

Margaret Daniella Harper
6 years ago
Views:

1 Modern computer architecture From multicore to petaflops

2 Motivation: Multi-ores where and why

3 Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed #transistors on microchip doubles every months omputer Architecture 3

4 Introduction: Moore s law faster cycles and beyond Moore s law transistors are getting smaller run them faster Faster clock speed Higher Throughput (Ops/s) Frequency [MHz] Intel x86 clock speed Increasing transistor count and clock speed allows / requires architectural changes: Pipelining Superscalarity SIMD / Vector ops 1 0, Year Multi-ore/Threading omplex on chip caches omputer Architecture 4

02x Power envelope: Max. 95 130 W Power consumption: P = f * (V core ) 2 V core ~ 0.9 1.

5 Welcome to the multi-/many-core era The game is over: But Moore s law continues By courtesy of D. Vrsalovic, Intel 1.13x N transistors 1.73x Dual-ore Performance Power 1.00x 2N transistors 1.73x 1.02x Power envelope: Max W Power consumption: P = f * (V core ) 2 V core ~ V Same process technology: P ~ f 3 Over-clocked (+20%) Max Frequency Dual-core (-20%) since minimum V core depends on f omputer Architecture 5

6 Multi-ore: Intel Xeon 2600 (2012) Xeon 2600 Sandy Bridge EP : 8 cores running at 2.7 GHz (max 3.2 GHz) Simultaneous Multithreading reports as 16-way chip 2.3 Billion Transistors / 32 nm Die size: 435 mm 2 2-socket server omputer Architecture 6

node: ache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the price of ccnuma

7 From UMA to ccnuma Basic architecture of commodity compute cluster nodes Yesterday (2006): Dual-socket Intel ore2 node: Uniform Memory Architecture (UMA) Flat memory ; symmetric MPs But: system anisotropy Today: Dual-socket Intel (Westmere) node: ache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the price of ccnuma architectures: Where does my data finally end up? On AMD it is even more complicated ccnuma within a socket! omputer Architecture 7

2 NUMA domains 2 socket server 4 NUMA domains 4 socket server: 8 NUMA domains WHY?

8 Back to the 2-chip-per-case age 12 core AMD Magny-ours a 2x6-core ccnuma socket AMD: single-socket ccnuma since Magny ours 1 socket: 12-core Magny-ours built from two 6-core chips 2 NUMA domains 2 socket server 4 NUMA domains 4 socket server: 8 NUMA domains WHY? Shared resources are hard two scale: 2 x 2 memory channels vs. 1 x 4 memory channels per socket omputer Architecture 8

9 urrent AMD design: AMD Interlagos / Bulldozer Up to 16 cores (8 Bulldozer modules) in a single socket Max. 2.6 GHz (+ Turbo ore) 2048 kb 16 kb shared P max = (2.6 x 8 x 8) GF/s dedicated L2 cache = GF/s L1D cache 8 (6) MB shared L3 cache Each Bulldozer module: 2 lightweight cores 1 FPU: 4 MULT & 4 ADD (double precision) / cycle Supports AVX Supports FMA4 2 DDR3 (shared) memory channel > 15 GB/s 2 NUMA domains per socket omputer Architecture 9

10 ray XE6 Interlagos 32-core dual socket node Two 8- (integer-) core chips per 2.3 GHz turbo) Separate DDR3 memory interface per chip ccnuma on the socket! Shared FP unit per pair of integer cores ( module ) 256-bit FP unit SSE4.2, AVX, FMA4 16 kb L1 data cache per core 2 MB L2 cache per module 8 MB L3 cache per chip (6 MB usable) omputer Architecture 10

Other socket Other socket Other socket Woodcrest ore2 Duo 65nm Other socket Other socket Harpertown ore2 Quad 45nm The x86 multicore evolution so far Intel Single-Dual-/Quad-/Hexa-/-ores (one-socket

11 Other socket Other socket Other socket Woodcrest ore2 Duo 65nm Other socket Other socket Harpertown ore2 Quad 45nm The x86 multicore evolution so far Intel Single-Dual-/Quad-/Hexa-/-ores (one-socket view) 2005: Fake dual-core 2006: True dual-core P P P P P P P P P hipset hipset hipset hipset Memory 2008: Simultaneous Multi Threading (SMT) Memory Approx. constant clock speed 2010: 6-core chip Memory Memory : Wider SIMD units AVX: 256 Bit P T 0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 MI MI MI Memory Memory Memory Nehalem EP ore i7 45nm Westmere EP ore i7 32nm Sandy Bridge EP ore i7 32nm omputer Architecture 11

MULT and 1 ADD) Intel Xeon Sandy Bridge EP socket 4,6,8 core variants available S FP ops / instruction: 4 (dp) / 8

12 There is no single driving force for chip performance! Floating Point (FP) Performance: P = n core * F * S * n n core number of cores: 8 F FP instructions per cycle: 2 (1 MULT and 1 ADD) Intel Xeon Sandy Bridge EP socket 4,6,8 core variants available S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers AVX ) n lock speed : 2.7 GHz TOP500 rank 1 (1995) P = 173 GF/s (dp) / 346 GF/s (sp) But: P=5.4 GF/s (dp) for serial, non-simd code omputer Architecture 12

Specifications of the NVIDIA Fermi GPU 14 Multiprocessors (MP); each with: 32 processors (SP) driven by : Single Instruction Multiple Data (SIMD) Single Instruction Multiple Thread (SIMT) Explicit

15 GHz 1030 GFLOP/s (single precision) 515 GFLOP/s (double precision) Up to 6 GB of global memory (DRAM) 1500 MHz DDR 384 bit bus Global gather/scatter 144 GB/s bandwidth 16 GB/s PIe 2.

13 Specifications of the NVIDIA Fermi GPU 14 Multiprocessors (MP); each with: 32 processors (SP) driven by : Single Instruction Multiple Data (SIMD) Single Instruction Multiple Thread (SIMT) Explicit in-order architecture 32K Registers 48 KB of local on-chip memory 1st and 2nd level cache hierarchy clock rate of 1.15 GHz 1030 GFLOP/s (single precision) 515 GFLOP/s (double precision) Up to 6 GB of global memory (DRAM) 1500 MHz DDR 384 bit bus Global gather/scatter 144 GB/s bandwidth 16 GB/s PIe 2.0x16 (bidirectional) lock (MHz) Peak (GFLOPs) Memory (GB) Memory lock (MHz) Memory Interface (bit) Memory Bandwidth (GB/sec) Tesla GeForce GTX GeForce 8800 GTX Host ( Westmere) *64 63 September 2012 Parallel multi-and manycore programming 13

14 Trading single thread performance for parallelism: GPGPUs vs. PUs GPU vs. PU light speed estimate: 1. ompute bound: 2-5 X 2. Memory Bandwidth: 1-5 X Intel ore i ( Sandy Bridge ) Intel Xeon E DP node ( Sandy Bridge ) NVIDIA 2070 ( Fermi ) ores@lock 3.3 GHz 2 x 2.7 GHz 1.1 GHz Performance + /core 52.8 GFlop/s 43.2 GFlop/s 2.2 GFlop/s Threads@stream <4 <16 >8000 Total performance GFlop/s 691 GFlop/s 1,000 GFlop/s Stream BW 18 GB/s 2 x 36 GB/s 90 GB/s (E=1) Transistors / TDP 1 Billion* / 95 W 2 x (2.27 Billion / 130W) 3 Billion / 238 W + Single Precision * Includes on-chip GPU and PI-Express omplete compute device omputer Architecture 14

Parallelism in a modern compute node Parallel and shared resources within a shared-memory node 2 GPU #1 1 3 4 5 6 9 10 Other I/O 8 7 PIe link GPU #2 Parallel resources: Shared resources:

15 Parallelism in a modern compute node Parallel and shared resources within a shared-memory node 2 GPU # Other I/O 8 7 PIe link GPU #2 Parallel resources: Shared resources: Execution/SIMD units 1 Outer cache level per socket 6 ores 2 Memory bus per socket 7 Inner cache levels 3 Intersocket link 8 Sockets / memory domains 4 PIe bus(es) 9 Multiple accelerators 5 Other I/O resources 10 How does your application react to all of those details? omputer Architecture 15

16 Distributed-memory computers & hybrid systems

Parallel distributed-memory computers: Basics Pure distributed-memory parallel computer: Each processor P is connected to exclusive local memory (MM) and a network interface (NI) Node A (dedicated)

17 Parallel distributed-memory computers: Basics Pure distributed-memory parallel computer: Each processor P is connected to exclusive local memory (MM) and a network interface (NI) Node A (dedicated) communication network connects alls nodes No global cache-coherent shared address space No Remote Memory Access (NORMA) Data exchange between nodes: Passing messages via network ( Message Passing ) Some architectures provide limited remote memory access for speeding up message passing, e.g. through a global NON- OHENRENT address space (NUMA) Prototype of first P clusters: Node: Single-core/PU P Network: Ethernet First Massively Parallel Processing architectures: RAY T3D/E, Intel Paragon omputer Architecture 17

Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical ompute node is a 2- or 4-socket shared memory compute nodes with a

18 Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical ompute node is a 2- or 4-socket shared memory compute nodes with a NI. ommunication network (GBit, Infiniband) connects the nodes Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse Parallel Programming? Pure Message Passing is standard. Hybrid programming? Today: GPUs / Accelerators are added to the nodes to further increase complexity Distributed-memory parallel omputer Architecture 18

19 Networks What are the basic ideas and performance characteristics of modern networks?

Networks Basic performance characteristics Evaluate the network capabilities to transfer

Bytes is: T = T L + N/B T L is the latency (transfer setup time [sec]) and B is

processors in different nodes communicate via network ( Point-to-point ) A single

20 Networks Basic performance characteristics Evaluate the network capabilities to transfer data Use the same idea as for main memory access: Total transfer time for a message of N Bytes is: T = T L + N/B T L is the latency (transfer setup time [sec]) and B is asymptotic (N ) network bandwidth [MBytes/sec] onsider simplest case ( Ping Pong ) Two processors in different nodes communicate via network ( Point-to-point ) A single message of N Bytes is sent forward and backward Overall data transfer is 2N Bytes! omputer Architecture 20

21 Networks Basic performance characteristics Ping-Pong benchmark (pseudo-code): myid = get_process_id() if(myid.eq.0) then targetid = 1 S = get_walltime() call Send_message(buffer,N,targetID) call Receive_message(buffer,N,targetID) E = get_walltime() MBYTES = 2*N/(E-S)/1.d6! Eff. BW: MBytes/sec rate TIME = (E-S)/2*1.d6! transfer time in microsecs! for single message else targetid = 0 call Receive_message(buffer,N,targetID) call Send_message(buffer,N,targetID) endif Effective BW: B eff = N T L + N B omputer Architecture 21

22 B eff = 2*N/(E-S)/1.d6 Networks Basic performance characteristics Ping-Pong benchmark for GBit-Ethernet (GigE) network N 1/2 : Message size where 50% of peak bandwidth is achieved Asymptotic bandwidth B=111 Mbytes/sec GBit/s Latency (N 0): Only qualitative agreement: 44 ms vs. 76 ms omputer Architecture 22

23 Networks Basic performance characteristics Ping-Pong benchmark for DDR Infiniband (DDR-IB) network Determine B and T L independently and combine them omputer Architecture 23

24 Networks Basic performance characteristics First Principles modeling of B eff (N) provides good qualitative results but quantitative description in particular of latency dominated region (N small) may fail because Overhead for transmission protocols, e.g. message headers Minimum frame size for message transmission, e.g. TP/IP over Ethernet does always transfer frames with N>1 Message setup/initialization involves multiple software layers and protocols; each software layer adds to latency; hardware only latency is often small As the message size increases the software may switch to a different protocol, e.g. from eager to rendezvous Typical message sizes in applications are neither small nor large N 1/2 value is also important: N 1/2 = B * T L Network balance: Relate network bandwidth (B or B eff (N 1/2 )) to computer power (or main memory bandwidth) of the nodes omputer Architecture 24

25 Latency and bandwidth in modern computer environments ns ms 1 GB/s ms 25

Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system Minimum sum of the bandwidths of all connections cut when

26 Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system Minimum sum of the bandwidths of all connections cut when splitting the system into two equal parts More meaningful metric when comparing systems: Bisection BW per core or per node, B b /N Bisection BW depends on Bandwidth per link Network topology Uni- or Bi-directional bandwidth?! omputer Architecture 26

27 Network topologies: Bus Bus can be used by one connection at a time Bandwidth is shared among all devices Bisection BW is constant: B b /N ~ 1/N ollision detection, bus arbitration protocols must be in place Examples: PI bus, memory bus of multi-core chips, diagnostic buses, internal ring bus of the ell processor, Advantages Low latency Easy to implement Disadvantages Shared bandwidth, not scalable Problems with failure resiliency (one defective agent may block bus) Fast buses for large N require large signal power omputer Architecture 27

28 Network topologies: Switches and Fat-Trees Standard clusters are built with switched networks ompute nodes ( devices ) are split up in groups each group is connected to a single (small) (non-blocking crossbar-)switch ( leaf switches ) Leaf switches are connected with each other using an additional switch hierarchy ( spine switches ) or directly (for small configs) In switched networks the distance between any two devices is heterogeneous (number of hops in switch hierarchy) Diameter of a network: The maximum number of hops required to connect two arbitrary devices Example: Diameter of bus = 1 Perfect world: Fully non-blocking, i.e. any choice of N/2 disjoint device pairs can communicate at full speed omputer Architecture 28

Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the

29 Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the diagonal) Switches can be cascaded to form hierarchies (common case) rossbars can also be used directly as interconnects in computer systems Example: Scalable UMA memory access (NE SX) (Historic) example: Hitachi SR8000 2x2 switching element omputer Architecture 29

= B/2 Sounds good, but see next slide B B Oversubscribed Spine does not support N/2 full BW

30 Fat tree switch hierarchies Fully non-blocking N/2 end-to-end connections with full bandwidth B b = B * N/2 B b /N = const. = B/2 Sounds good, but see next slide B B Oversubscribed Spine does not support N/2 full BW end-to-end connections B b /N = const. = B/2k, where k is the oversubscription factor Intelligent resource management is crucial k=3 leaf switch spine switch omputer Architecture 30

Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern

31 Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern here hange 2 6, 3 7 to 2 7, 3 6: has collisions if no other connections are re-routed at the same time Static routing is still a quasi-standard in commodity interconnects However, things are starting to improve slowly omputer Architecture 31

32 Full fat-tree: 288-port IB DDR-Switch SPINE switch level: 12 switches Basic building blocks: 24-port switches LEAF switch level: 24 switches with 24*12 ports to devices S = switches 288 ports omputer Architecture 32

33 Fat tree networks Examples Ethernet 1 Gbit/s &10 Gbit/s variants; 41% of all Top500 entries (June 2012) InfiniBand Dominant high-performance commodity interconnect (42% of Top500 entries) Myrinet SDR: 10 Gbit/s per link and direction (10 bits/byte) DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches) QDR: you figure that out by yourself QDR IB is used in the RRZE s TinyBlue and Lima clusters Building blocks: 36 port switches Large 36*18=648-port switches urrent version: 10 Gbit/s per link and direction Interoperable with 10 Gbit/s Ethernet Waning importance for HP Fat trees are expensive and complex to scale continuously to very high node counts omputer Architecture 33

Meshes Fat trees can become prohibitively expensive in large systems ompromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct

34 Meshes Fat trees can become prohibitively expensive in large systems ompromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct connections only between direct neighbors This is not a non-blocking crossbar! Intelligent resource management and routing algorithms are essential Example: 2D torus mesh Toruses are used in very large systems: ray XT, IBM Blue Gene B b ~ N (d-1)/d B b /N 0 for large N Sounds bad, but those machines show good scaling for many codes Well-defined and predictable bandwidth behavior! omputer Architecture 34

node with HyperTransport fabric This mesh is asymmetric since two sockets use one HT

35 Meshes Advantages of toroidal/cubic meshes Limited cabling required ables can be kept short Meshes can come in all shapes and sizes Example: 4-socket dual-core AMD Opteron node with HyperTransport fabric This mesh is asymmetric since two sockets use one HT link each for I/O 4-socket 2xhexa-core AMD Magny-ours: 3D cube omputer Architecture 35

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia