Coherent HyperTransport Enables The Return of the SMP

Size: px

Start display at page:

Download "Coherent HyperTransport Enables The Return of the SMP"

Irma Shelton
5 years ago
Views:

3 HPC History In the early 1990s, expensive SMPs ruled in HPC Cray MPs, Convex Exemplar, Sun ES The MPPs were in the shadow Intel Paragon, Thinking Machines CM, CRAY T3 Similar price, more complex programming Then came the Clusters (Early 2000s) 50x cheaper, complex programming (MPI) Clusters made HPC affordable and widespread Copyright All rights reserved. 3

Distributed (Clusters) Partial views of the Data Set

6 SMPs returning - Why? Compelling programming model Less code Large memories - less effort, no data domain decomposition Inexpensive multi-core Drives availability of threaded software NumaConnect Same programming model across an entire system Reduced effort for system management Virtualization More efficient utilization of Resources,, Copyright All rights reserved. 6

7 Coherent HT enables SMP- Cluster Multi-core with Coherent HT Cache Coherency for efficient scaling with shared memory beyond a single mainboard Use of SCI cache coherence protocol chtnot designed to scale (far) SCI means: Scalable Coherent Interface 64 bit architecture Cache coherence protocol Copyright All rights reserved. 7

8 NumaConnect - Real SMP System-wide cache coherence in hardware 64Byte cache line granularity Standard Linux (or any other x86-64 OS) Virtual Runs any application - shared memory or message passing Virtualization All system resources can be used by all processors Run any number of virtual OS instances Copyright All rights reserved. 8

Technology Background Convex Exemplar (Acquired by HP) First

Dolphin in 1994 Data General Aviion (Acquired by EMC) Designed

chips with 3 generations of processor/memory buses with Intel

9 Technology Background Convex Exemplar (Acquired by HP) First implementation of the SCI-based CC-NUMA architecture from Dolphin in 1994 Data General Aviion (Acquired by EMC) Designed in 1996 with deliveries from Used Dolphin s s SCI chips with 3 generations of processor/memory buses with Intel Processors Attached Products for Clustering Copyright All rights reserved. 9

10 Clusters vs. Mainframe Servers Distributed (clusters) Shared /Shared IO (mainframes - scalable servers) Network Price scales with number of Nodes at USD 1,500-5,000 per Node Mainframe price: USD 0.5M - 5M Copyright All rights reserved. 10

11 Clusters - NO Shared Resources Individual Instances of the Operating System OS OS instance instance 1 1 OS OS instance instance 2 2 OS OS instance instance 3 3 OS OS instance instance n n Network Switch Copyright All rights reserved. 11

Cache Coherent Shared Shared Everything -

Capabilities like Mainframe - Price like

13 Principal Operation Remote Cache (2 or 4GB) Remote Cache (2 or 4GB) NumaChip NumaChip Mem. Ctrl. C HT Interface L3 Cache (6MB) To/From Other nodes in the same dimension HT Interface Mem. Ctrl. C L3 Cache (6MB) 512kB 64kB 64kB L1&L2 CPU Cores L1&L2 CPU Cores 512kB 64kB 64kB Local Remote Access, L1 L2 L3 Cache Access, HT HIT Probe Remote for Shared Cache Hit Data Miss Copyright All rights reserved. 13

It s all about NUMA - LMBENCH LMBENCH - HP DL165 G6 w/numaconnect 1000 Nanoseconds 100 10 1 0 0,01 0,03 0,06 0,11 0,22 0,44 0,88 1,75 3,5 7 14 28 56

14 It s all about NUMA - LMBENCH LMBENCH - HP DL165 G6 w/numaconnect 1000 Nanoseconds ,01 0,03 0,06 0,11 0,22 0,44 0,88 1,75 3, Array Size (MB) NumaChip, Same Socket NumaChip, Remote Node NumaChip, Different Socket Standard, Same Socket Copyright All rights reserved. 14

15 NumaConnect Main Features 256TBytes physical address space Scalable, directory based cache coherency protocol Scalable On-Chip switch fabric (2-D, 3-D Torus) Configurable Cache for remote data (1-16GB/node) Copyright All rights reserved. 15

16 NumaChip Top Block Diagram SM SPI SPI Init Module HyperTransport ccht Cave Designed by Prof Brühning s team LC Config Data Microcode CSR H2S SRAM Fast Tag SDRAM Cache SDRAM Tags SCC Crossbar switch, LCs, SERDES XA XB YA YB ZA ZB Copyright All rights reserved. 16

18 NumaChip System Architecture Multi-socket Node DRAM DRAM Opteron Opteron DRAM 2-D Torus HT HT HT HT HT HT NumaChip HT HT HT HT HT HT Opteron Opteron DRAM DRAM 3-D Torus 6 links allow flexible system configurations in multi-dimensional topologies Copyright All rights reserved. 18

20 Operating Modes Operating System Single System Image Multiple System Image Partitions Individual nodes Multiple nodes User Applications Shared, ccnuma Shared, NUMA (non-coherent) Pure Message Passing (MPI or others) or any combination Copyright All rights reserved. 20

24 And so has the NumaConnect Card Remote Cache DRAM Remote Cache and Local Tag DRAM Switch Fabric Connectors Voltage Regulators NumaChip With Heatsink/Fan Fast Tag SRAM HyperTransport Connector Copyright All rights reserved. 24

Individual Instances of the Operating System

Clusters - NO Shared Resources Individual Instances of the Operating System Operating OperatingSystem SystemAA Operating OperatingSystem SystemBB Operating OperatingSystem SystemCC Operating OperatingSystem