Progress Towards Petascale Virtual Machines

Size: px

Start display at page:

Download "Progress Towards Petascale Virtual Machines"

Helen Briggs
5 years ago
Views:

1 Progress Towards Petascale Virtual Machines Al Geist Oak Ridge National Laboratory EuroPVM-MPI 2003 Venice, Italy September 30, 2003

(next generation of PVM) and its features to help scale to Petascale systems Distributed peer-to-peer control H2O the self

2 Petascale Virtual Machine Another kind of PVM This talk will describe: DOE Genomes to Life Project PVM use today in the Genomics Integrated Supercomputer Toolkit for fault tolerance, and high availability in a dynamic environment Harness Project (next generation of PVM) and its features to help scale to Petascale systems Distributed peer-to-peer control H2O the self adapting core of Harness FTMPI fault tolerant MPI Latest superscalable algorithms with natural fault tolerance for petascale environments.

DOE Genomes to Life Program Understanding the Essential Processes of Living Systems Follow-on to Human Genome Program Determined the entire DNA

Instructions to build a human fits on a DVD (3GB) Genomes to Life Program goal is to read the Instructions starting with simple single cell

3 DOE Genomes to Life Program Understanding the Essential Processes of Living Systems Follow-on to Human Genome Program Determined the entire DNA sequence for humans 24 chromosomes in 6 ft of DNA 3 billion nucleotides code for 35,000 genes Only % difference between people. Instructions to build a human fits on a DVD (3GB) Genomes to Life Program goal is to read the Instructions starting with simple single cell organisms - microbes Molecular Machines Regulatory Pathways Multi-cell Communities $100M effort Develop new computational methods to understand complex biological systems PVM

4 Molecular Machines Fill Cells Many interlinked proteins form interacting machines From The Machinery of Life, David S. Goodsell, Springer-Verlag, New York,

5 Regulatory Networks Control the Machines Gene regulation controls what genes are expressed - And - Proteome changes over time and due to environmental conditions

6 GTL will Require Petascale Systems Cell-based community simulation Protein machine Interactions 1000 TF 100 TF Cell, pathway, and network simulation Molecular machine classical simulation Molecule-based cell simulation 10 TF Current U.S. Computing 1 TF* Constrained rigid docking Genome-scale protein threading Community metabolic regulatory, signaling simulations Constraint-Based Flexible Docking Comparative Genomics *Teraflops Biological Complexity

Biology for the21st Century GTL is going to rely on high-performance computing and data analysis to process high-throughput experimental data The new

informatics methods, experiments, modeling, and simulation.

7 Biology for the21st Century GTL is going to rely on high-performance computing and data analysis to process high-throughput experimental data The new computational biology environments will be conceptually integrated knowledge enabling environments that couple diverse sets of distributed data, advanced informatics methods, experiments, modeling, and simulation. simulation genomes protein structure pathways Data analysis experiment regulatory elements Raw data models modeling 1 ERK?? 3 EGFR Src Eps15 AP-2 Early Endosomes 2 erbb-2 erbb-2 PLCγ Grb-2 4 Golgi Shc Cbl eps8 Late Endosomes Lysosomes?? 5 Annexin II? 6

Genome Integrated Supercomputer Toolkit GIST is a framework

transparent and high-performance interface to biological

sets utilizes PVM to launch and manage jobs across a wide

to dynamic changes in the environment using PVM next step

thousand of users for execution of genome analysis and

XML Web portal PVM across Heterogeneous Supercomputers

8 Genome Integrated Supercomputer Toolkit GIST is a framework for large-scale biological application deployment provides a transparent and high-performance interface to biological applications provides transparent access to distributed data sets utilizes PVM to launch and manage jobs across a wide diversity of supercomputers highly fault tolerant and adapts to dynamic changes in the environment using PVM next step deploy across ORNL, ANL, PNNL, SNL as a multi-site Bio-Grid thousand of users for execution of genome analysis and simulation. XML Web portal PVM across Heterogeneous Supercomputers pathways genomes Raw data Protein analysis engine XML P4 Cluster 64 proc Cray X1 256 proc IBM p proc SGI Altix 256 proc

9 The GIST Developers really want Harness They ask us regularly about the next generation of PVM called Harness because they want the increased adaptability and fault tolerance that Harness promises. Harness is being developed by the same team that developed PVM: Vaidy Sunderam Emory University Al Geist Oak Ridge National Lab Jack Dongarra University of Tennessee and ORNL

10 Harness II Design Goals Harness is a distributed virtual machine environment that goes beyond the features of PVM: Allow users to dynamically customize, adapt, and extend a virtual machine's features to more closely match the needs of their application to optimize the virtual machine for the underlying computer resources. Is being designed to scale to petascale virtual machines distributed control minimized global state no single point of failure Allows multiple virtual machines to join and split in temporary micro-grids

11 HARNESS II Architecture Daemon built on top of H2O kernel with DVM pluglet loaded Host A Merge/split with other VMs Host D Host B Virtual Machine Another VM Host C Operation within VM uses Distributed Control Component based daemon DVM FT-MPI Processes control user features HARNESS daemon Customization and extension by dynamically adding pluglets

12 Symmetric Peer-to-Peer Distributed Control Characteristics No single point (or set of points) of failure for Harness. It survives as long as one member still lives. All members know the state of the virtual machine, and their knowledge is kept consistent w.r.t. the order of changes of state. (Important parallel programming requirement!) No member is more important than any other (at any instant) i.e. here isn t a pass-around control token For Petascale Systems the control members can be a distributed subset of all the processors in the system

13 Harness Distributed Control Control is Asynchronous and Parallel add host Fast host delete or recovery from fault Supports fast host adding Parallel recovery from multiple host failures Supports multiple simultaneous updates

14 HARNESS: Petascale Virtual Machine Variable Distributed Control Loop Size Virtual machine Size of the Control Loop 1 <= S <= (size of VM) For small VM and ultimate fault tolerance S = (size of VM) For large VM a random selection of a few hosts (f.e. S = 10) gives a balance of multi-point failure and performance. For S = 1, distributed control becomes simple client/server model.

15 H2O kernel - Overview H2O is multithreaded lightweight kernel that is dynamically configured by loading pluglets Resources provided as services through pluglets. Services may be deployed by any authorized party: provider, client, or third-party reseller H2O is stateless and resources independent Functional interfaces [Suspendible] Clients Pluglet Pluglet Kernel In Harness the DVM service, which includes distributed control of services, must be installed on host Pluglets can provide Multiple programming models Java and C implementations being developed FT-MPI Java RMI OGSA PVM Active objects P2P Programming models

16 H2O kernel RMIX Communication H2O is built on top of a flexible P2P communication layer called RMIX Provides interoperability between kernels and other web services Adopts common RMI semantics Designed for easy porting between protocols Dynamic protocol negotiation Scalable P2P design Java Web Services RPC clients H2O kernel H2O kernel... SOAP clients A C B D E F RMIX Networking RPC, IIOP, JRMP, SOAP, RMIX Networking

17 H2O can support a wide range of distributed computing models Flexibility beyond the PVM/MPI model Grid Web portal Like Genome Channel Biology workbench Web service Internet Computing Like SETI at HOME Entropia, United Devices Cluster computing Like PVM Harness LAM/MPI Registration and Discovery , UDDI JNDI LDAP DNS GIS phone, Publish Find Deploy Provider A Client Provider A native code B Deploy Client Provider Provider... A A B B Deploy Client Legacy App Repository Repository Reseller A C B Developer A C B

Harness Fault Tolerant MPI Plug-in FT-MPI built in layers with tuned collectives, tuned derived data type handling and good point2point bandwidth.

18 Harness Fault Tolerant MPI Plug-in FT-MPI built in layers with tuned collectives, tuned derived data type handling and good point2point bandwidth. MPI application libftmpi Startup plugin H2O MPI application libftmpi Startup plugin H2O Name Service Ftmpi_notifier Works with MPE profiling and tools such as JUMPSHOT from ANL. Application performance on par with MPICH-2. FTMPI available SC2003

19 Harness Fault Tolerant MPI Plug-in FT-MPI is a system level Fault Tolerant full MPI 1.2 implementation. Process failures are detected & passed back to the users application using MPI objects. The users application decides how best to reconfigure the system and continue. Recovery Options for affected communicators: ABORT: just do as other implementations i.e.checkpoint restart BLANK: leave hole SHRINK: re-order processes to make a contiguous communicator REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD Communicator Options X X 5 6 X

20 Large-scale Fault Tolerance Taking fault tolerance beyond checkpoint/restart. Developing fault tolerant algorithms is not trivial. Anything beyond simple checkpoint/restart is beyond most scientists. Many recovery issues must be addressed Doing a restart of 90,000 tasks because of the failure of 1 task, may be very inefficient use of resources. When and what are the recovery options for large-scale simulations?

21 Fault Tolerance a petascale perspective Future systems are being designed with 100,000 processors. The time before some failure will be measured in minutes. Checkpointing and restarting this large a system could take longer than the time to the next failure! Autonomic? Self-healing? What to do? Development of algorithms that can be naturally fault tolerant I.e. failure anywhere can be ignored? And still get the right answer. No monitoring No notification No recovery Is this possible? YES!

22 Progress on Super-scalar algorithms Demonstrated that the scale invariance and natural fault tolerance can exist for local and global algorithms Finite Difference (Christian Engelman) Demonstrated natural fault tolerance w/ chaotic relaxation, meshless, finite difference solution of Laplace and Poisson problems Global information (Kasidit Chancio) Demonstrated natural fault tolerance in global max problem w/random, directed graphs Gridless Multigrid (Ryan Adams) Combines the fast convergence of multigrid with the natural fault tolerance property. Hierarchical implementation of finite difference above. Three different asynchronous updates explored local global

23 Further Information Genomes to Life Harness Naturally Fault tolerant Algoritnms Questions?

From Parallel Virtual Machine to Virtual Parallel Machine: The Unibus System

From Parallel Virtual Machine to Virtual Parallel Machine: The Unibus System Vaidy Sunderam Emory University, Atlanta, USA vss@emory.edu Credits and Acknowledgements Distributed Computing Laboratory, Emory