Failover procedure for Grid core services

Failover procedure for Grid core services Kai Neuffer COD-15, Lyon www.eu-egee.org EGEE and glite are registered trademarks

Overview List of Grid core services Top level BDII Central LFC VOMS server WMS-LB/RB FTS? Metadata servers (AMGA, 3d, etc.) MyProxy Site Grid Services CE SiteBDII Local LFC MON-Boc UIs/VOBOX Local Metadata servers 2

Failover levels Failover levels Central service failover without shared data dependence BDII, WMS,... Central service failover with shared data dependence LGC,VOMS,... On site service failover (could be combined with load balancing) All Grid services and also site services (pbs, DNS, etc.) 3

Failover scheme: Independent central services Service recovery Site 1 Site 2 BDII 1 BDII 2 4

Dependent services 1 Failover scheme with service redirection: DNS alias, etc. Virtual LFC Site 1 Site 2 LFC 1 LFC 2 DB Backend 1 DB synchronization DB Backend 2 5

Dependent services 2 Failover scheme with service recovery: Service recovery Site 1 Site 2 LFC 1 LFC 2 DB Backend 1 DB synchronization DB Backend 2 6

Site service Failover 1 traditional cluster Cluster IP failover Virtual IP Virtual IP Real IP 1 Real IP 2 Heartbeat Node 1 Node 2 Shared Storage SAN, iscsi, DRDB with Cluster Filesystem (GFS2, OCFS) Not necessary for BDII,LFS Service runs on one node and is started on the other in case of a hardware failure 7

Site service failover with VM VM Cluster DomU 1 active DomU 2 not active HA DomU 1 not active DomU 2 active... (more nodes and VMs) Node 1 Dom0 Node 2 Dom0 Shared files: image DomU 1 and DomU 2, xen conf DomU 1 and DomU 2 GFS2 CLVS SAN,ISCSI or DRDB partitions Service runs on a VM on one node and life migrates to another in case of hardware failure 8

Load balancing Load balancing with failover Cluster IP failover Virtual IP Virtual IP Real IP 1 Real IP 2 Heartbeat LVS 1 LVS 2 Load Balancing node 1 node 2 Service nodes clustered or not Service is load balanced by the redundant LVS server 9

Conclusions 1 Service recovery should be implemented for al Grid services where it is possible Failover reached by installing a secondary service server No possible for all Grid services For some important VO services decentralized hosting could be of interest (LFC, VOMS,...) Not single site depended Technically complicated Higher costs (Oracle licenses, etc.) Site service clustering enables failover at the site Service runs like on a single machine but with failover Higher costs depended on the storage solution Each Grid service has to be teated differently Some Grid service are not clusterizable 10

Conclusions 2 Service independent failover with virtual machines Theoretically all services could be made failover No hardware dependency on the Grid middleware OS Easy maintenance of the services (life migration) Loss of performance over all disc access Higher hardware requirements to get the same performance Higher costs depended on the shared storage environment Service load balancing and failover Enables load balancing with failover depending on the service two other clustered machines needed more complex network structure 11