Performance of Variant Memory Configurations for Cray XT Systems

Similar documents
Performance of Variant Memory Configurations for Cray XT Systems

Application Performance on Dual Processor Cluster Nodes

Performance of the AMD Opteron LS21 for IBM BladeCenter

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Почему IBM POWER8 оптимальная платформа для PostgreSQL

Master Informatics Eng.

Red Storm / Cray XT4: A Superior Architecture for Scalability

The Red Storm System: Architecture, System Update and Performance Analysis

Initial Performance Evaluation of the Cray SeaStar Interconnect

NVIDIA nforce IGP TwinBank Memory Architecture

Multicore Architecture and Hybrid Programming

Detecting Memory-Boundedness with Hardware Performance Counters

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks

Regression Testing on Petaflop Computational Resources. CUG 2010, Edinburgh Mike McCarty Software Developer May 27, 2010

Maximizing Memory Performance for ANSYS Simulations

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Early Experiences Writing Performance Portable OpenMP 4 Codes

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Lecture 13: March 25

Flexible Architecture Research Machine (FARM)

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Non-uniform memory access (NUMA)

Introduction to parallel computing

Advanced cache optimizations. ECE 154B Dmitri Strukov

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

2008 International ANSYS Conference

Cray XC Scalability and the Aries Network Tony Ford

Performance comparison between a massive SMP machine and clusters

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Best Practices for Setting BIOS Parameters for Performance

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

HPC Architectures. Types of resource currently in use

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Leveraging HyperTransport for a custom high-performance cluster network

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Comparison of XT3 and XT4 Scalability

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Six-Core AMD Opteron Processor

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Lecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Wednesday : Basic Overview. Thursday : Optimization

Lecture 18: DRAM Technologies

Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration

Main Memory Supporting Caches

XT Node Architecture

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

Instructor: Leopold Grinberg

Research on the Implementation of MPI on Multicore Architectures

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Dell EMC NUMA Configuration for AMD EPYC (Naples) Processors

KNL tools. Dr. Fabio Baruffa

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

CSC501 Operating Systems Principles. OS Structure

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

2008 International ANSYS Conference

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

NUMA replicated pagecache for Linux

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Mainstream Computer System Components

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Intel Xeon Scalable Family Balanced Memory Configurations

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Improving Real-Time Performance on Multicore Platforms Using MemGuard

IBM Information Technology Guide For ANSYS Fluent Customers

LECTURE 10: Improving Memory Access: Direct and Spatial caches

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Memory Selection Guidelines for High Performance Computing with Dell PowerEdge 11G Servers

Performance of Multicore LUP Decomposition

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

CUDA Performance Optimization. Patrick Legresley

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Overview: Classical Parallel Hardware

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Network Design Considerations for Grid Computing

Cray XE6 Performance Workshop

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Roadmapping of HPC interconnects

CME 213 S PRING Eric Darve

Transcription:

Performance of Variant Memory Configurations for Cray XT Systems presented by Wayne Joubert

Motivation Design trends are leading to non-power of 2 core counts for multicore processors, due to layout constraints e.g., AMD Istanbul (6), Intel Dunnington (6),... This complicates memory configuration choices, which often have multiple of 2 restrictions on populating the slots for compute nodes In 2009 NICS will upgrade from Barcelona CPU (4 cores/socket) to Istanbul (6 cores/socket), to bring Kraken to a 1 PF system The purpose of this study is to evaluate memory configurations for the new system, to determine how to maintain at least 1 GB memory per processor core and also give good memory performance 2

Cray XT5 Compute Blade Each blade has 4 compute nodes Each node has 2 CPU sockets The 2 CPU sockets of a node each access 8 DDR2 memory slots 3

Cray XT5 Compute Node Architecture 2 CPU sockets of the node are connected to SeaStar2+ interconnect 2 CPU sockets of the node are connected to each other via HyperTransport link Each CPU has its own on-chip memory controller to access its memory 4

Cray XT5 Compute Node Memory Each CPU socket directly accesses 2 banks of 2 memory slots each, total 4 slots per socket NUMA configuration each socket can access the other socket s memory, but at lower bandwidth CPU CPU Each bank should be either empty or fully populated with s of the same capacity What is the best way to populate the slots? 5

Processor Types Considered AMD Barcelona, 4 cores/die, 2.3 GHz AMD Istanbul, 6 cores/die 6

Benchmark Systems For experiments, use ORNL Chester Cray XT5, 448 compute core TDS -- small version of JaguarPF Swap configurations, run experiments 7

Memory Types Chester: DDR2-800 4GB s DDR2-667 2GB s DDR2-533 8GB s Goal: evaluate performance of XT5 system for different configurations of memory s for each socket of a compute node 8

Memory Configurations Chester node: 4-4-0-0, 4-4-0-0, (2 GB/core balanced) 4-4-0-0, 2-2-0-0, (1.5 GB/core unbalanced) 2-2-2-2, 2-2-0-0, (1.5 GB/core unbalanced) 2-2-2-0, 2-2-2-0, (1.5 GB/core balanced) 4-4-2-2, 4-4-2-2, (3 GB/core balanced) 8-8-0-0, 8-8-0-0, (4 GB/core balanced) Our particular concern: What is the penalty of using off-socket memory for the unbalanced cases? 9

Codes for Benchmark Tests 1. STREAM Benchmark. Measures memory bandwidth for several kernels Use TRIAD kernel z = y + a x 2. DAXPY kernel y = y + a x. 3. LMBENCH measures memory latency. 4. S3D application code petascale combustion application that uses structured 3-D meshes. Performance is typically memory-bound. 10

Experiments: Memory Spillage Effects CPU CPU Execute benchmark on one CPU socket Ramp up problem size / memory usage until memory spills off-socket Measure effects of using off-socket memory 11

Memory Spillage Effects: STREAM Run on 1-8 cores 20,000 Chester/800, STREAM Triad, Total Memory Bandwidth Chester, balanced config -- 4-4-0-0, 4-4-0-0 idth, MB/Sec Bandwi 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 1 2 3 4 5 6 7 8 Peak memory bandwidth: 25.6 GB/sec theoretical 21.2 GB/sec actual See up to 15% decrease in performance when spilling memory references off-socket 0 Vector Length per Core, Bytes Note: STREAM puts related array entries z(i), x(i), y(i) all on same memory page Linux first touch policy 12

Memory Spillage Effects: STREAM Bandw width, MB/Sec 20,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 Chester/800, STREAM Triad, Total Memory Bandwidth 1 2 3 4 5 6 Same experiment, change STREAM memory initialization to put z(i), x(i), y(i) for same i on different pages, potentially different sockets 7 Performance uptick can get 8 higher performance from accessing on-socket and offsocket memory concurrently 0 Vector Length per Core, Bytes Not helpful for typical use case of using all cores for computation 13

Memory Spillage Effects: DAXPY Bandwidth, GB/s 25 20 15 10 5 1 2 3 4 5 6 7 8 Memory for y(i), x(i) on different pages Similar uptick in bandwidth for off-socket memory references 0 Vector Length 14

Memory Spillage Effects: S3D 170 S3D application code usec per cell per core 160 150 140 130 120 110 Chester/800, 8 cores Chester/800, 4 cores Chester/533, 8 cores Chester/533, 4 cores Chester 4-4-0-0, 4-4-0-0 (DDR2-800) Chester 8-8-0-0, 8-8-0-0 (DDR2-533) Vary grid cells per core 100 90 Cells per core Graph: wallclock time microseconds per gridcell per core Run on 1 socket or 2 sockets of node Observe spillage effects for 1 socket case Effects of memory spillage minimal 15

Memory Configuration Effects: S3D: Chester 4-4-0-0, 4-4-0-0 250 S3D Normalized Performance, 8 Cores Now compare different memory configurations usec per cell per core 200 150 100 50 Chester 4/4/0/0, 4/4/0/0 Run on both sockets Baseline case: 4-4-0-0, 4-4-0-0, balanced between sockets 0 Cells per core All memory references are onsocket 16

Memory Configuration Effects: S3D: Chester 4-4-0-0, 2-2-0-0 S3D Normalized Performance, 8 Cores Chester 4-4-0-0, 2-2-0-0 250 usec per cell per core 200 150 100 50 Using cross-socket memory Chester 4/4/0/0, 4/4/0/0 Chester 4/4/0/0, 2/2/0/0 Unbalanced between sockets For large memory cases, socket with less memory takes memory from other socket 0 Cells per core Memory performance slightly worse overall Memory performance slightly worse when thin-memory socket uses offsocket memory 17

Memory Configuration Effects: S3D: Chester 2-2-2-2, 2-2-0-0 S3D Normalized Performance, 8 Cores Chester 2-2-2-2, 2-2-0-0 250 usec per cell per core 200 150 100 50 6% slower 17% slower Using cross-socket memory Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/2, 2/2/0/0 Unbalanced between sockets Memory performance slightly worse overall 0 Cells per core Memory performance slightly worse when thin-memory socket uses offsocket memory 18

Memory Configuration Effects: S3D: Chester 2-2-2-0, 2-2-2-0 S3D Normalized Performance, 8 Cores Chester 2-2-2-0, 2-2-2-0 usec per cell per core 250 200 150 100 50 39% slower Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/0, 2/2/2/0 Balanced between sockets Unsupported memory configuration one bank is half-full 0 Cells per core Significantly worse memory performance Believed to be due to the way memory is striped across s in the bank 19

Memory Configuration Effects: S3D: Chester 4-4-2-2, 4-4-2-2 S3D Normalized Performance, 8 Cores Chester 4-4-2-2, 4-4-2-2 250 usec per cell per core 200 150 100 50 0 Chester 4/4/0/0, 4/4/0/0 Chester, 4/4/2/2, 4/4/2/2 Balanced between sockets Fat-memory configuration Similar performance to baseline case Cells per core 20

Memory Configuration Effects: S3D: Conclusions Performance loss from imbalanced configurations is at most ~ 17% Balanced (unsupported) memory configuration has much worse performance 21

Memory Configuration Effects: LMBENCH: Chester 4-4-0-0, 4-4-0-0 200 180 160 Chester 4/4/0/0, 4/4/0/0 Measures array load latency based on array length 140 120 100 80 Run on 1 core Latency, ns 60 40 Clearly see L-1/2/3 cache effects 20 0 Higher latency when some of array is off-socket 0.00049 0.00391 0.01172 0.01953 0.02734 0.03906 0.05469 0.07812 0.10938 0.15625 0.21875 0.3125 0.4375 0.625 0.875 1.25 1.75 2.5 3.5 5 7 10 14 20 28 40 56 80 112 160 224 320 448 640 896 1280 1792 2560 3584 5120 7168 10240 14336 Memory Used, MBytes Baseline case: 4-4-0-0, 4-4-0-0 22

Memory Configuration Effects: LMBENCH 200 180 160 140 Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/0, 2/2/2/0 Chester 2/2/2/2, 2/2/0/0, LO Socket Chester 2/2/2/2, 2/2/0/0, HI Socket Chester 4/4/2/2, 4/4/2/2 Balanced and unbalanced memory configurations 120 100 80 All cases have similar performance Latency, ns 60 40 20 0 Memory configuration has no significant impact on latency 0.00049 0.00391 0.01172 0.01953 0.02734 0.03906 0.05469 0.07812 0.10938 0.15625 0.21875 0.3125 0.4375 0.625 0.875 1.25 1.75 2.5 3.5 5 7 10 14 20 28 40 56 80 112 160 224 320 448 640 896 1280 1792 2560 3584 5120 7168 10240 14336 20480 Memory Used, MBytes 23

Memory Configuration Effects: LMBENCH 120 Chester 4/4/0/0, 4/4/0/0 Chester 2/2/2/0, 2/2/2/0 100 Chester 2/2/2/2, 2/2/0/0, LO Socket Chester 2/2/2/2, 2/2/0/0, HI Socket Chester 4/4/2/2, 4/4/2/2 Detail of previous graph On-socket latency ~ 86 ns 80 60 Off-socket latency ~ 102-108 ns depending on configuration 24 128 140 153 166 179 192 204 230 256 281 307 332 358 384 409 460 512 563 614 665 716 768 819 921 80 08 36 64 92 20 48 04 60 16 72 28 84 40 96 08 20 32 44 56 68 80 92 16 1024 1126 1228 1331 1433 1536 1638 1843 2048 2252 2457 40 64 88 12 36 60 84 32 80 28 76 Latency, ns Memory Used, MBytes

Conclusions Impact of unbalanced memory configuration on memory bandwidth less than expected: ~ 20% at worst Would not affect apps that don t use much memory Would not affect apps that are not memory-bound Balanced (but unsupported) memory configuration performs very poorly half-empty memory bank appears to run at half speed Memory latency is unaffected by any change in memory configuration In some rare cases there could be advantage to using on- and off-socket memory in parallel 25