Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems presented by Wayne Joubert

Motivation Design trends are leading to non-power of 2 core counts for multicore processors, due to layout constraints e.g., AMD Istanbul (6), Intel Dunnington (6),... This complicates memory configuration choices, which often have multiple of 2 restrictions on populating the slots for compute nodes In 2009 NICS will upgrade from Barcelona CPU (4 cores/socket) to Istanbul (6 cores/socket), to bring Kraken to a 1 PF system The purpose of this study is to evaluate memory configurations for the new system, to determine how to maintain at least 1 GB memory per processor core and also give good memory performance 2

Cray XT5 Compute Blade Each blade has 4 compute nodes Each node has 2 CPU sockets The 2 CPU sockets of a node each access 8 DDR2 memory slots 3

Cray XT5 Compute Node Architecture 2 CPU sockets of the node are connected to SeaStar2+ interconnect 2 CPU sockets of the node are connected to each other via HyperTransport link Each CPU has its own on-chip memory controller to access its memory 4

Cray XT5 Compute Node Memory Each CPU socket directly accesses 2 banks of 2 memory slots each, total 4 slots per socket NUMA configuration each socket can access the other socket s memory, but at lower bandwidth CPU CPU Each bank should be either empty or fully populated with s of the same capacity What is the best way to populate the slots? 5

Processor Types Considered AMD Barcelona, 4 cores/die, 2.3 GHz AMD Istanbul, 6 cores/die 6

Benchmark Systems For experiments, use ORNL Chester Cray XT5, 448 compute core TDS -- small version of JaguarPF Swap configurations, run experiments 7

Memory Types Chester: DDR2-800 4GB s DDR2-667 2GB s DDR2-533 8GB s Goal: evaluate performance of XT5 system for different configurations of memory s for each socket of a compute node 8

Memory Configurations Chester node: 4-4-0-0, 4-4-0-0, (2 GB/core balanced) 4-4-0-0, 2-2-0-0, (1.5 GB/core unbalanced) 2-2-2-2, 2-2-0-0, (1.5 GB/core unbalanced) 2-2-2-0, 2-2-2-0, (1.5 GB/core balanced) 4-4-2-2, 4-4-2-2, (3 GB/core balanced) 8-8-0-0, 8-8-0-0, (4 GB/core balanced) Our particular concern: What is the penalty of using off-socket memory for the unbalanced cases? 9

Codes for Benchmark Tests 1. STREAM Benchmark. Measures memory bandwidth for several kernels Use TRIAD kernel z = y + a x 2. DAXPY kernel y = y + a x. 3. LMBENCH measures memory latency. 4. S3D application code petascale combustion application that uses structured 3-D meshes. Performance is typically memory-bound. 10

Experiments: Memory Spillage Effects CPU CPU Execute benchmark on one CPU socket Ramp up problem size / memory usage until memory spills off-socket Measure effects of using off-socket memory 11

Memory Spillage Effects: STREAM Run on 1-8 cores 20,000 Chester/800, STREAM Triad, Total Memory Bandwidth Chester, balanced config -- 4-4-0-0, 4-4-0-0 idth, MB/Sec Bandwi 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 1 2 3 4 5 6 7 8 Peak memory bandwidth: 25.6 GB/sec theoretical 21.2 GB/sec actual See up to 15% decrease in performance when spilling memory references off-socket 0 Vector Length per Core, Bytes Note: STREAM puts related array entries z(i), x(i), y(i) all on same memory page Linux first touch policy 12

Memory Spillage Effects: STREAM Bandw width, MB/Sec 20,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 Chester/800, STREAM Triad, Total Memory Bandwidth 1 2 3 4 5 6 Same experiment, change STREAM memory initialization to put z(i), x(i), y(i) for same i on different pages, potentially different sockets 7 Performance uptick can get 8 higher performance from accessing on-socket and offsocket memory concurrently 0 Vector Length per Core, Bytes Not helpful for typical use case of using all cores for computation 13

Memory Spillage Effects: DAXPY Bandwidth, GB/s 25 20 15 10 5 1 2 3 4 5 6 7 8 Memory for y(i), x(i) on different pages Similar uptick in bandwidth for off-socket memory references 0 Vector Length 14

Memory Spillage Effects: S3D 170 S3D application code usec per cell per core 160 150 140 130 120 110 Chester/800, 8 cores Chester/800, 4 cores Chester/533, 8 cores Chester/533, 4 cores Chester 4-4-0-0, 4-4-0-0 (DDR2-800) Chester 8-8-0-0, 8-8-0-0 (DDR2-533) Vary grid cells per core 100 90 Cells per core Graph: wallclock time microseconds per gridcell per core Run on 1 socket or 2 sockets of node Observe spillage effects for 1 socket case Effects of memory spillage minimal 15

Memory Configuration Effects: S3D: Chester 4-4-0-0, 4-4-0-0 250 S3D Normalized Performance, 8 Cores Now compare different memory configurations usec per cell per core 200 150 100 50 Chester 4/4/0/0, 4/4/0/0 Run on both sockets Baseline case: 4-4-0-0, 4-4-0-0, balanced between sockets 0 Cells per core All memory references are onsocket 16

Memory Configuration Effects: S3D: Chester 4-4-0-0, 2-2-0-0 S3D Normalized Performance, 8 Cores Chester 4-4-0-0, 2-2-0-0 250 usec per cell per core 200 150 100 50 Using cross-socket memory Chester 4/4/0/0, 4/4/0/0 Chester 4/4/0/0, 2/2/0/0 Unbalanced between sockets For large memory cases, socket with less memory takes memory from other socket 0 Cells per core Memory performance slightly worse overall Memory performance slightly worse when thin-memory socket uses offsocket memory 17

Memory Configuration Effects: S3D: Chester 2-2-2-2, 2-2-0-0 S3D Normalized Performance, 8 Cores Chester 2-2-2-2, 2-2-0-0 250 usec per cell per core 200 150 100 50 6% slower 17% slower Using cross-socket memory Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/2, 2/2/0/0 Unbalanced between sockets Memory performance slightly worse overall 0 Cells per core Memory performance slightly worse when thin-memory socket uses offsocket memory 18

Memory Configuration Effects: S3D: Chester 2-2-2-0, 2-2-2-0 S3D Normalized Performance, 8 Cores Chester 2-2-2-0, 2-2-2-0 usec per cell per core 250 200 150 100 50 39% slower Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/0, 2/2/2/0 Balanced between sockets Unsupported memory configuration one bank is half-full 0 Cells per core Significantly worse memory performance Believed to be due to the way memory is striped across s in the bank 19

Memory Configuration Effects: S3D: Chester 4-4-2-2, 4-4-2-2 S3D Normalized Performance, 8 Cores Chester 4-4-2-2, 4-4-2-2 250 usec per cell per core 200 150 100 50 0 Chester 4/4/0/0, 4/4/0/0 Chester, 4/4/2/2, 4/4/2/2 Balanced between sockets Fat-memory configuration Similar performance to baseline case Cells per core 20

Memory Configuration Effects: S3D: Conclusions Performance loss from imbalanced configurations is at most ~ 17% Balanced (unsupported) memory configuration has much worse performance 21

Memory Configuration Effects: LMBENCH: Chester 4-4-0-0, 4-4-0-0 200 180 160 Chester 4/4/0/0, 4/4/0/0 Measures array load latency based on array length 140 120 100 80 Run on 1 core Latency, ns 60 40 Clearly see L-1/2/3 cache effects 20 0 Higher latency when some of array is off-socket 0.00049 0.00391 0.01172 0.01953 0.02734 0.03906 0.05469 0.07812 0.10938 0.15625 0.21875 0.3125 0.4375 0.625 0.875 1.25 1.75 2.5 3.5 5 7 10 14 20 28 40 56 80 112 160 224 320 448 640 896 1280 1792 2560 3584 5120 7168 10240 14336 Memory Used, MBytes Baseline case: 4-4-0-0, 4-4-0-0 22

Memory Configuration Effects: LMBENCH 200 180 160 140 Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/0, 2/2/2/0 Chester 2/2/2/2, 2/2/0/0, LO Socket Chester 2/2/2/2, 2/2/0/0, HI Socket Chester 4/4/2/2, 4/4/2/2 Balanced and unbalanced memory configurations 120 100 80 All cases have similar performance Latency, ns 60 40 20 0 Memory configuration has no significant impact on latency 0.00049 0.00391 0.01172 0.01953 0.02734 0.03906 0.05469 0.07812 0.10938 0.15625 0.21875 0.3125 0.4375 0.625 0.875 1.25 1.75 2.5 3.5 5 7 10 14 20 28 40 56 80 112 160 224 320 448 640 896 1280 1792 2560 3584 5120 7168 10240 14336 20480 Memory Used, MBytes 23

Memory Configuration Effects: LMBENCH 120 Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/0, 2/2/2/0 100 Chester 2/2/2/2, 2/2/0/0, LO Socket Chester 2/2/2/2, 2/2/0/0, HI Socket Chester 4/4/2/2, 4/4/2/2 Detail of previous graph On-socket latency ~ 86 ns 80 60 Off-socket latency ~ 102-108 ns depending on configuration 24 128 140 153 166 179 192 204 230 256 281 307 332 358 384 409 460 512 563 614 665 716 768 819 921 80 08 36 64 92 20 48 04 60 16 72 28 84 40 96 08 20 32 44 56 68 80 92 16 1024 1126 1228 1331 1433 1536 1638 1843 2048 2252 2457 40 64 88 12 36 60 84 32 80 28 76 Latency, ns Memory Used, MBytes

Conclusions Impact of unbalanced memory configuration on memory bandwidth less than expected: ~ 20% at worst Would not affect apps that don t use much memory Would not affect apps that are not memory-bound Balanced (but unsupported) memory configuration performs very poorly half-empty memory bank appears to run at half speed Memory latency is unaffected by any change in memory configuration In some rare cases there could be advantage to using on- and off-socket memory in parallel 25