xsim The Extreme-Scale Simulator

Size: px

Start display at page:

Download "xsim The Extreme-Scale Simulator"

Bernadette George
5 years ago
Views:

1 xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa BSC, 28 Feb 2014

2 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of processors per node. Limited application scalability due to sequential parts, synchronizing communication and other bottlenecks Investigating performance of parallel applications at scale is an important component of HPC hardware/software co-design Behaviour on future architectures Performance impact of architecture choices

3 Overview Several existing simulators include JCAS, BigSim, Dimemas, MuPi Limitations (Run time, no of concurrent threads executed,..) Highly scalable solution trade off accuracy in exchange for scalability Nodes oversubscribed for simulation Highly accurate simulations are extremely slow and less scalable The Extreme-scale Simulator permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing.

Overview Parallel discrete event simulation (PDES) to emulate the behaviour of various architecture models Execution of real applications, algorithms or their models atop a simulated HPC environment

4 Overview Parallel discrete event simulation (PDES) to emulate the behaviour of various architecture models Execution of real applications, algorithms or their models atop a simulated HPC environment for: Performance evaluation, including identification of resource contention and underutilization issues Investigation at extreme scale, beyond the capabilities of existing simulation efforts S. Boehm and C. Engelmann. xsim: The Extreme-Scale Simulator. HPCS 2011, Istanbul, Turkey, July 4-8, 2011.

Overview Combining highly oversubscribed execution, a virtual MPI, and a timeaccurate PDES (Parallel discrete event simulation) PDES uses the native MPI and simulates virtual

5 Overview Combining highly oversubscribed execution, a virtual MPI, and a timeaccurate PDES (Parallel discrete event simulation) PDES uses the native MPI and simulates virtual processors The virtual processors expose a virtual MPI to applications Multithreaded MPI implementation needed (e.g. Open MPI --enable-mpi-threadmultiple ) 2010 IEEE Cluster Co-Design Workshop

6 Overview The simulator is a library Utilizes PMPI to intercept MPI calls and to hide the PDES Easy to use: Replace the MPI header Compile and link with the simulator library Run the MPI program Support for C and Fortran MPI applications 2010 IEEE Cluster Co-Design Workshop

7 Overview xsim is designed like a traditional performance tool Interposition library sitting between MPI application and MPI library Uses simulated wall clock time for measurement Performance data extracted based on processor and network model MPI performance tool interface (PMPI) Supports Simulated MPI point-to-point communication (essential calls) Simulated MPI data types, groups, communicators, collective communication (full) 81 simulated MPI calls for each C and Fortran ULFM MPI extensions

8 Comparison to Dimemas Online simulator, change in model requires rerun Batch simulations through scripts, configuration files, command line options Application model available Oversubscribes nodes Larger simulations than underlying system possible Support of multi threaded MPI implementations Fault Tolerance support through Open MPI ULFM Version with locks available, albeit significantly slower No two calls in MPI comm at the same time Non blocking calls -> blocking calls

9 Simulation Models Processor model Based on actual execution time of underlying hardware Scaled up for simulated processor speed Heterogeneous cores with differing speeds Support for various network architecture models Analyze existing hardware conditions / experiment with differing architectures Latency and bandwidth restrictions Hierarchical combinations Network on chip Network on node Sender/Receiver contention simulation Full contention not supported due to scalability reasons

10 Simulation Models Application model Similar to MPI trace replays Same timing and communication behaviour Certain resources not needed to scale with simulation Memory usage Advances virtual time for application between MPI calls Execute MPI calls without actually sending data (no need for buffers) Operating system noise simulation File system model Currently in development Read/Write delay, access time, congestion,

11 Network Models Unidirectional Ring Star Tree Mesh

12 Network Models Torus Twisted Torus

13 Network Models Twisted Torus with Toroidal Jump Twisted Torus with Toroidal Degree

14 General Usage of xsim Add header files #include xsim-c.h #include xsim-f.h Recompile and link with library Library flag -lxsim Programming language interface flags -lxsim-c or -lxsim-f Run application in the simulator mpirun -np <real process count> <application> <application args> -xsim-np <virtual process count> <xsim args>

15 Examples Hello World 936 xsim runtime 936 cores Simulated Time Scaling hello world from 1000 to 100,000,000 cores Native system : 12-core 2-processor 39-node Gig. Ethernet Simulated system : 100,000,000 processor Gigabit Ethernet xsim runs on up to 936 AMD Opteron cores and 2.5 TB RAM 468 or 936 cores needed for 100,000,000 simulated processes 100,000,000 x 8 kb = 800 GB in virtual MPI process stack

16 Examples Basic Network Model Model allows to define network architecture, latency and bandwidth Basic star network Model can be set to 0μs and Gbps as baseline 50μs and 1Gbps roughly represented the native test environment 4 Intel dual-core 2.13GHz nodes with 2GB of memory each Ubuntu bit Linux Open MPI with multi-threading support 2010 IEEE Cluster Co-Design Workshop

17 Example Processor Model Model allows to scale relative speed to different processor Basic scaling model Model can be set to 1.0x for baseline numbers MPI hello world scales to 1M+ VPs on 4 nodes with 4GB total stack (4kB/VP) Simulation (application) Constant execution time <1024 VPs: Noisy clock Simulator >256 VPs: Output buffer issues Simulated 0µs/ Gbps/1.0x xsim run time 0µs/ Gbps/1.0x 2010 IEEE Cluster Co-Design Workshop

18 Example Basic PI Monte Carlo Solver Network model: Star, 50μs and 1Gbps Processor model 1x (32kB stack/vp) 0.5x (32kB stack/vp) Simulated time 50µs/1Gbps/1.0x Simulated time 50µs/1Gbps/0.5x xsim run time 50µs/1Gbps/1.0x xsim run time 50µs/1Gbps/0.5x Simulation Perfect scaling Simulator 2010 IEEE Cluster Co-Design Workshop <= 8 VPs: 0% overhead on the 8 processor cores >= 4096 VPs: comm. load dominates

19 Examples NAS Parallel Benchmark Scaling CG and EP class B problems 1 to 128 simulated cores Native system: 4 core, 2 processor, 16 node

20 Examples NAS Parallel Benchmark CG.B Simulated time CG.B Total run time EP.B Simulated time EP.B Total run time Scaling CG and EP class A problems CG 1 to 4096 simulated cores EP 1 to cores Native system: 4 core, 2 processor, 16 node

21 Examples MCMI Core Scaling 960 Core system 240 cores for simulation due to memory bandwidth restrictions

22 Examples MCMI Problem Scaling Linear behaviour up to 2000x2000 matrix size Slight degradation for larger problem sizes

23 Examples MCMI MPI Message Count Scaling Simulator also gathers MPI statistics Linear increase of exchanged messages

25 Fault Tolerance Properties Fault tolerance is a property of a program, not of an API specification or an implementation. Within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.

26 Advanced Features Resilience and Fault Tolerance xsim fully supports error handling within simulated MPI Default MPI error handlers User-defined MPI error handlers MPI_Abort() Simulated abort terminates simulation and provides Performance results Source of abort Time of abort

27 Advanced Features Resilience and Fault Tolerance Developing and debugging of FT applications Permits injection of MPI process failures Propagate/detection/notification of failures within simulation Handle application-level checkpoint/restart Observation of application behaviour and performance under failure possible Support for User-Level Failure Mitigation (ULFM) extension Investigate and develop Algorithm-Based Fault Tolerance (ABFT) applications xsim is the first performance tool to support ULFM and ABFT

28 User-Level Failure Mitigation (ULFM) Fault-tolerant MPI extension Proposed by MPI 3.0 Fault Tolerance Working Group To be presented and voted upon in the MPI forum these coming months for integration in upcoming MPI 3.1 standard Minimal set of changes necessary for applications and libraries to include fault tolerance techniques and to construct more forms of fault tolerance (transactions, strongly consisten collectives, etc.)

29 User-Level Failure Mitigation (ULFM) Three main concepts Simplicity API easy to use and understand Flexibility API to allow for varied fault tolerant models to be built as external libraries Absence of deadlock No MPI call (point-to-point or collective) can block indefinitely after a failure Calls must either succeed or raise an MPI error Default error handler needs to be changed to use ULFM On at least MPI_COMM_WORLD from MPI_ERRORS_ARE_FATAIL to MPI_ERRORS_RETURN or custom MPI Errorhandler

30 User-Level Failure Mitigation (ULFM) Exceptions raised MPI_ERR_PROC_FAILED MPI_ERR_PROC_FAILED_PENDING MPI_ERR_REVOKED Acknowledge MPI_Comm_failure_ack MPI_Comm_failure_get_acked Handling MPI_Comm_shrink MPI_Comm_revoke MPI_Comm_agree MPI_Comm_iagree

31 ULFM in xsim MPI_Comm_revoke() Linear broadcast of failure notification through the simulated runtime No matching receive Releases any waited on send or receive request with MPI_ERR_PROC_FAILED MPI_ERR_PROC_FAILED_PENDING MPI_Comm_shrink() Two-phase commit protocol to establish the list of failed MPI ranks Fault-tolerant linear reduction and broadcast operations MPI_Comm_agree() Agreement on a single value (logical AND operation by live members) Fault-tolerant linear MPI_Allreduce() implementation MPI_Comm_failure_ack() and MPI_Comm_failure_get_acked() Failure registry per rank and per communicator with low memory overhead (bit arrays)

35 Conclusions The Extreme-scale Simulator (xsim) is a performance investigation toolkit Uses oversubscription to model systems larger than underlying hardware Supports processor, network, application and noise models File system model under development First performance toolkit to support MPI process failure injection, checkpoint/restart and ULFM Forecast behaviour on varying systems possible Time and resource saving via simulation

Scalable and Fault Tolerant Failure Detection and Consensus

EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann