Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment

Size: px

Start display at page:

Download "Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment"

Geraldine Higgins
5 years ago
Views:

1 Silicon Graphics, Inc. Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Presented by: Jean-Pierre Panziera Principal Engineer

2 Altix 3700 SSSI - Architecture and Software Environment SP and NUA Directory based Shared emory SGI Altix 3700, scaling Hardware to 512p Scaling Linux to 512p Scaling Applications to 512p next

3 Symmetric ultiprocessor SP CPU CPU CPU CPU Interconnection CPU CPU CPU CPU Bus E E E E CPU CPU CPU CPU E E E E Switch E E E E

4 Non Uniform emory Access NUA node 0 node 1 node 2 node 3 node n CPU CPU CPU CPU CPU E E E E E Interconnection

5 Sharing data on a Cache Coherent NUA system ccnua Node m eg L2 L3 FSB em Node k Node n eg L2 L3 Interconnection Fabric FSB

6 ESI protocol Cache Coherence on the CPU side Node n eg L2 L3 FSB For each cacheline (128Bytes) coherency state is either: odified Dirty Exclusive DEX E xclusive Clean Exclusive CEX S hared Shared SHD I nvalid Invalid INV

7 Directory based coherency Cache Coherence on the emory side Node k em dir Directory information is in memory DIs Directory takes only 3% on total memory For each CacheLine available information : Line state (Unowned,Exclusive,Shared,Busy, ) 2 bits Sharing Vector, list of nodes (256 max) 24 bits Priority Protection ECC

8 Directory based Cache Coherence sequences of sharing messages 2 N 3 K NodeN CEX BL(shared) INTE DNGD emory N EXCL 1 0 NodeK EAD ESPEC SACK INV INV Node load SHD SHD 1 1 SHD N K INV load INV EAD ESPEC EPLY... EXCL 0 1 BUSY 0 0 INTE XFE... DEX store BL(shared) INV CEX EXCL 1 0

9 SGI Altix 3700 Hardware CPU-emory brick SC-Brick (3 U) Node 0 Node 1 emory emory Itanium 2 FSB Itanium 2 SHub emory 10.2GB/s NL4 6.4GB/s SHub 2x16 GB Itanium GB/s Itanium 2 Numalink T 3/4 3.2 / 6.4 GB/s XIO T 2.4 GB/s

10 SGI Altix Processors in a ack NUAlink 3 outers 2x1.6=3.2GB/s Bisection BW 12.8 GB/s 400 B/s/p

11 Altix 3700 fat-tree topology for 512p/4TB system Bisection BW GB/s 400 B/s/p Image courtesy: NASA Ames

12 Linux for Altix 3000 / Linux 2.4 kernel SGI Value-Added Enhancements SGI Open Source Enhancements Standard Linux Distribution Boot/Driver CD Differentiating Features and Functionality CPU sets / emory placement PT & Array Services Hierarchical storage management tools Partitioning XV, XSCSI PCP Enabling Features and Functionality Latest bugfixes, other supported device drivers, etc Comprehensive system accounting (CSA) Job containers (PAGG) XFS Base OS and Common Open Source Apps Platform support, error handling, scaling, NUA O(1) Scheduler Device Drivers, SGI XFS installer SCSL CHASE CXFS dplace runon

13 Linux for Altix 3000 NUA support discontiguous memory support V Support text replication process to processor binding, local memory allocation Partitioning support (shared memory clusters) e.g. configure 256p as 4x64p Scalability enhancements Fixes

14 O(1) Scheduler Scheduler improvements required Standard Linux scheduler poor cache usage; too many task migrations heavy contention for runqueue_lock O(1) scheduler tasks stick to processors no global runqueue_lock (multiple run queues) 6x improvement on some benchmarks Still needs work migration livelock with lots of idle processors (fixed) NUA awareness (being worked by us & community) Fairness in overcomitted workload situations no gang scheduling concept

15 CPU and emory Allocation equired to get good NUA performance Want memory allocated locally if possible exploit local memory bandwidth Want to bind processes to specific processors to avoid cache damage caused by migration to keep processes close to where data allocated Want processors working on single HPC application to be close to one another to minimize memory latency when cross-processor memory references are required Don t want to have to recompile the application

16 CPU and emory Allocation dplace command: dplace -c0-7 -s3 program assumes 9 processes will be created 8 will be bound to CPUs 0-7 3rd process created will be skipped not bound, e. g. for pthreads shepherd processes Storage will be allocated locally first-touch allocation rule cpusets provides finer grain control, e.g. : boot cpuset for interactive jobs and kernel data compute cpuset(s) to isolate applications

17 Stream Triad Benchmark esults (GB/s) SGI Altix 3000 (512P) NEC SX-7 (32P) NEC SX-5 (16P) SGI Altix 3000 (256P) NEC SX-4 (32P) HP AlphaServer GS1280 (64P) Cray T3E (32P) SGI Altix 3000 (128P) NEC SX-6 (8P) SGI Altix 3000 (64P) Cray C90 (16P) SGI Origin 3800 (256P) HP Integrity SuperDome (64P) SGI Altix 3000 (32P) IB eserver p690+ (32P) Sun Fire15K (72P) SGI Origin 2000 (256p) Cray SV1 (32P) HP AlphaServer ES80 (8P) IB eserver p670+ (16P) IB eserver p690 Turbo (32P) SGI Altix 3000 (16P) Cray Y-P (8P) HP SuperDome 750 (64P) IB eserver p690 HPC (16P) 127,3 103,8 99,7 84,2 63,7 58,9 50,7 49,3 47,8 44,5 36,8 32,2 31,9 26,8 26,5 25, ,7 359,3 488,3 583,1 872,3 1007,83 easures memory bandwidth performance Simple loops Embarrassingly parallel Easy for compiler to generate scalable code No fancy flags: -O3 STEA Triad:!$OP PAALLEL DO DO j = 1,n a(j)=b(j)+s*c(j) CONTINUE Source: SGI Altix 3000 Series Performance eport 1.6 (9/10/03), Customer provided data, & STEA website (11/13/03)

18 PI across partitions: Latency vs. Distance PI Point-to-Point Latency (from CPU 0) 2x64p 1.5 GHz Altix 3700 Supercluster Using PT 1.10 on ProPack 2.4 2,5 8 byte transfer size 2 Time (usec) 1,5 1 0,5 Host A Host B Destination CPU

19 IFS (ECWF) a scalable PI application Total Time Communication Time Parallel Speedup ideal speedup Time efficiency 78% Speedup Number of processors

Scaling OpenP Cart3D (Aerodynamics) on SGI Altix at NASA Ames 512 448 384 CPUs Achieved 320 256 192

20 Scaling OpenP Cart3D (Aerodynamics) on SGI Altix at NASA Ames CPUs Achieved % efficiency CPUs Used Source: NASA Ames esearch Center

21 Efficient Shared emory Parallelism SP Pthreads / OpenP Highest possible level parallelism (forget incremental) Limit global commons/arrays to necessary ake sharing explicit Use [thread]private local variables/arrays eplicate global data to local space emove unnecessary synchronization emove big global locks individual locks for critical code or data SP enables Dynamic Load Balancing

22 SGI Altix 3000 future evolutions Faster Itanium 2 processors NUAlink 4 routers (2xNL3 BW) Higher processor density educed emote Latency Faster/Denser emory DIs Faster FSB (front side bus) Larger Cache Coherence domains Larger Shared emory systems

23 Next generation: UV Architecture Vision PU PU IO APU UV Petascale GA. Globally Addressable emory. Low Latency. High Bandwidth. O(10K) Ports IO APU (GPU) APU APU APU (GPU)

24 Altix 3700 SSSI - Architecture and Software environment Questions?

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]