Fault tolerance based on the Publishsubscribe Paradigm for the BonjourGrid Middleware

Size: px

Start display at page:

Download "Fault tolerance based on the Publishsubscribe Paradigm for the BonjourGrid Middleware"

Christian Rogers
5 years ago
Views:

University of Paris XIII INSTITUT GALILEE Laboratoire d Informatique de Paris Nord (LIPN) Université of Tunis École Supérieure des Sciences et Tehniques de Tunis Unité de

2 University of Paris XIII INSTITUT GALILEE Laboratoire d Informatique de Paris Nord (LIPN) Université of Tunis École Supérieure des Sciences et Tehniques de Tunis Unité de Recherche UTIC Fault tolerance based on the Publishsubscribe Paradigm for the BonjourGrid Middleware Heithem ABBES, Christophe CERIN, Mohamed JEMNI and Walid SAAD Grid October 2010

3 Outline Introduction Objectives Design of BonjourGrid Integration of Boinc and Condor Fault tolerance approach Experimentation and validation Conclusion and future works 2

4 Introduction (1/3) P2P systems have allowed large improvements in the field of file sharing over Internet. 3

5 Introduction (1/3) P2P systems have allowed large improvements in the field of file sharing over Internet. Gnutella, Kazaa and Freenet 3

6 Introduction (1/3) P2P systems have allowed large improvements in the field of file sharing over Internet. Gnutella, Kazaa and Freenet Decentralized architecture No coordination between machines 3

7 Introduction (2/3) Grid computing : obtaining an infrastructure offering computing power for users applications. Coordination between machines during application execution. Centralized or hierarchical architectures (Globus, Glite, Condor). 4

8 Introduction (2/3) Grid computing : obtaining an infrastructure offering computing power for users applications. Coordination between machines during application execution. Centralized or hierarchical architectures (Globus, Glite, Condor). No scalability Complicated procedure of installation Complicated configuration phase for an ordinary user 4

9 Introduction (2/3) Grid computing : obtaining an infrastructure offering computing power for users applications. Coordination between machines during application execution. Centralized or hierarchical architectures (Globus, Glite, Condor). No scalability Complicated procedure of installation Complicated configuration phase for an ordinary user 4

10 Introduction (3/3) Desktop Grid led the community to build computing systems based on voluntary machines. Current systems use Master/Worker model 5

11 Introduction (3/3) Desktop Grid led the community to build computing systems based on voluntary machines. Current systems use Master/Worker model United Devices, BOINC, PLANETLAB, XtremWeb 5

12 Introduction (3/3) Desktop Grid led the community to build computing systems based on voluntary machines. Current systems use Master/Worker model United Devices, BOINC, PLANETLAB, XtremWeb Application domains Global climate prediction (BOINC) Search for extraterrestrial intelligence Cosmic rays study (XtremWeb). 5

13 Introduction (3/3) Desktop Grid led the community to build computing systems based on voluntary machines. Current systems use Master/Worker model United Devices, BOINC, PLANETLAB, XtremWeb Application domains Global climate prediction (BOINC) Search for extraterrestrial intelligence Cosmic rays study (XtremWeb). Demonstrate the potential of Desktop Grid 5

Current systems use Master/Worker model United Devices, BOINC, PLANETLAB, XtremWeb

intelligence (SETI@Home) Cosmic rays study (XtremWeb).

14 Introduction (3/3) Desktop Grid led the community to build computing systems based on voluntary machines. Current systems use Master/Worker model United Devices, BOINC, PLANETLAB, XtremWeb Application domains Global climate prediction (BOINC) Search for extraterrestrial intelligence Cosmic rays study (XtremWeb). Demonstrate the potential of Desktop Grid Suffer from being hardly scalable due to centralized control Rely on permanent administrative staff who guarantees the master operation 5

15 Objectives of BonjourGrid Design a multi-platform coordinators and fault tolerant system using existing desktop grid middleware Reduce the centralization factor: no static coordinator Benefit from the existing decentralized service discovery tools (Publish / Subscribe) Create coordinators on demand, automatically and without administrator intervention. Each coordinator selects machines to participate in the execution of a given application. 6

16 Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 coordinator + N workers 7

17 Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 coordinator + N workers 7

18 Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 coordinator + N workers 7

19 Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 coordinator + N workers 1 instance: 1 CE managed by a middleware 7

20 Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 coordinator + N workers 1 instance: 1 CE managed by a middleware Controls and orchestrate multiple instances 7

21 Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 coordinator + N workers 1 instance: 1 CE managed by a middleware Controls and orchestrate multiple instances Introduction of the concept of meta-grids 7

22 Design of BonjourGrid 8

23 Design of BonjourGrid A 8

24 Design of BonjourGrid A B 8

25 Design of BonjourGrid A C B 8

26 Design of BonjourGrid D A C B 8

27 Design of BonjourGrid D A C B 8

28 Design of BonjourGrid D A C B 8

29 Design of BonjourGrid D A C B 8

30 Design of BonjourGrid D A C B 8

31 Design of BonjourGrid D A C B 8

32 Design of BonjourGrid D A C B 8

33 Design of BonjourGrid D A C B 8

34 Design of BonjourGrid D A C B 8

35 Design of BonjourGrid D A C B 8

36 Design of BonjourGrid D A C B 8

37 Design of BonjourGrid D A C B A computing element for each user 8

38 Design of BonjourGrid D A C B A computing element for each user No static coordinator 8

39 Design of BonjourGrid D A C B A computing element for each user No static coordinator Each user can specify a middleware for his computing element 8

40 Components of BonjourGrid BonjourGrid is based on : A resource discovery protocol Fully decentralized A computing element Executes and handles the various tasks of an application (Condor, Boinc, XtremWeb) A global coordination protocol Manages and controls all resources, services and computing elements Does not depend on any specific machine or centralized element 9

41 Discovery protocol Based on Bonjour protocol Multicast IP network An implementation by Apple of ZeroConf protocol. Structured around three functionalities : Dynamic allocation of IP addresses without DHCP Resolution of names and IP addresses without DNS Services discovery without directory server Motivations Industrial protocol approved by Apple Different versions for the 3 OS (Windows, Linux, MaxOS) Linux and MacOS distributions integrate Bonjour Evolution of networks (10 Gb/s 10 * x Gb/s) => low risk of network congestion for multicast protocols 10

42 Computing element (CE) Each coordinator creates dynamically its CE CE = Coordinator + set of workers CE functionalities Allocates workers Submits and run tasks on workers Schedules and get results Computing systems XtremWeb, Condor or Boinc 11

43 Computing element (CE) Each coordinator creates dynamically its CE CE = Coordinator + set of workers CE functionalities Allocates workers Submits and run tasks on workers Schedules and get results Computing systems XtremWeb, Condor or Boinc 1 specific CE for each user 11

44 Coordination protocol Each machine can have one of the three states (Idle, Worker or Coordinator). A machine announces its state by publishing the specific service to this state : IdleService for idle state WorkerService for worker state CoordinatorService for coordinator state When machine state changes: it publishes the appropriate service to advertise this new state, after having deactivated the old one. Every machine can discover machines that are in a given state: A machine launches a discovery on a particular service instead of permanently receiving all new events. Restrict communication between machines. 12

45 Layered architecture 13

46 Layered architecture Publish/Subscribe 13

47 Layered architecture Connection to BonjourGrid Publish/Subscribe 13

48 Layered architecture Resources discovery Connection to BonjourGrid Publish/Subscribe 13

49 Layered architecture Resources discovery Resources characteristics Connection to BonjourGrid Publish/Subscribe 13

50 Layered architecture Establishment of CE network Resources discovery Resources characteristics Connection to BonjourGrid Publish/Subscribe 13

51 Layered architecture Establishment of CE network Resources discovery Connection to BonjourGrid Publish/Subscribe XtremWeb Resources characteristics 13

52 Layered architecture Establishment of CE network Resources discovery Connection to BonjourGrid Publish/Subscribe Condor XtremWeb Resources characteristics 13

53 Layered architecture Establishment of CE network Resources discovery Connection to BonjourGrid Publish/Subscribe Boinc Condor XtremWeb Resources characteristics 13

54 Layered architecture Deployment of a computing system Establishment of CE network Resources discovery Connection to BonjourGrid Publish/Subscribe Boinc Condor XtremWeb Resources characteristics 13

55 Integration of Boinc in BonjourGrid!"#$%&'()"%*'#"+!"#$"#%&'()*++,!"#$%&'(%)&*""+!"#$%&!"#$%&'()#*+,- Account,mail,Certificate!"#$%&'()''*&+, URL,ProjectName,NbreWorker URL,ProjectName ServiceType,HostName..!"#$"%&#'"($) IP,Hostname,CPU,Memory!"#$%&'(")!"#$%#&'%()%*+,!"#$"%&'$()$* IP,Hostname,CPU,Memory!"#$%&'(")!"#$"%&'&()*

56 Integration of Condor in BonjourGrid!"#"$%&'(#)*"++,!"#$%#&'()*+,,-!"#$"%&!"#"$%&'()#*$+ Host/IP Access levels security, Mail and Networks parameters!""#$%&&'()* IP,PoolName,ManagerName, CollectorName,DomainName,NbreWorker IP,PoolName,ManagerName, CollectorName,DomainName!"#$"%&#'"($) IP,Hostname,CPU,Memory!"#$%&'(") ServiceType,HostName..!"#$%#&'%()%*+,!"#$"%&'$()$* IP,Hostname,CPU,Memory!"#$%&'(")!"#$"%&'&()*

57 Fault tolerance approach Each computing system is responsible for : controlling and monitoring application tasks execution the fault-tolerance of its workers within a computing element The failure of workers is not the responsibility of BonjourGrid The failure of coordinators is in the charge of BonjourGrid 16

58 Fault tolerance approach Solution Create dynamically backup coordinators for each application, Provide k backup (k is a configuration setting that must be fixed before construction of the CE) for each application, using a passive replication

59 Fault tolerance approach Idle Worker Coordinator 18

60 Fault tolerance approach Idle Worker Coordinator Construction of a computing element and 2 backups of the coordinator 18

61 Fault tolerance approach Idle Worker Coordinator Construction of a computing element and 2 backups of the coordinator 18

62 Fault tolerance approach Idle Worker Coordinator Construction of a computing element and 2 backups of the coordinator 18

63 Fault tolerance approach Idle Worker Coordinator Construction of a computing element and 2 backups of the coordinator 18

64 Fault tolerance approach Idle Worker Coordinator Construction of a computing element and 2 backups of the coordinator The workers go back to idle state when the coordinator is disabled 18

65 Fault tolerance approach Idle Worker Coordinator Construction of a computing element and 2 backups of the coordinator The workers go back to idle state when the coordinator is disabled Problem: The coordinator has not yet completed the application 18

66 Fault tolerance approach Solution : Status field to distinguish between : Stop due to failure Status = 0 (application is in execution) Stop following the end of application Status = 1 (application finished) 19

67 Fault tolerance approach Solution : Status field to distinguish between : Stop due to failure Status = 0 (application is in execution) Stop following the end of application Status = 1 (application finished) Status Idle Worker Coordinator 19

68 Fault tolerance approach Solution : Status field to distinguish between : Stop due to failure Status = 0 (application is in execution) Stop following the end of application Status = 1 (application finished) Status Idle Worker Coordinator 19

69 Fault tolerance approach Solution : Status field to distinguish between : Stop due to failure Status = 0 (application is in execution) Stop following the end of application Status = 1 (application finished) Status Idle Worker Coordinator 19

70 Fault tolerance approach Solution : Status field to distinguish between : Stop due to failure Status = 0 (application is in execution) Stop following the end of application Status = 1 (application finished) Status Idle Worker Coordinator 19

71 Experimentations System evaluation based on a set of specific applications? a specific arrival pattern (Poisson s Law)? Workload model very close to the reality Feitelson and Lublin Inputs of the workload model Number of nodes (system size) Arrival time of applications Maximum number of parallel tasks Tasks execution times Application ID Arrived Time (s) Execution Time (s) Nbre of parallel tasks

72 Experimentations Emulation of a set of users and a set of applications 1 CE is dynamically created for each application Emulator Parameters : list of machines list of applications workload model Submit an application following the arrival pattern of applications in the workload Look for free machine on which a coordinator will start to initiate the application tasks execution The CE is released when application tasks finish 21

73 Experimentations Calculate: (end time of an application - submission time) Analyze the delay caused by the decentralization Analyze the behavior of BonjourGrid with : Boinc Condor Setup BonjourGrid : N machines (dynamic infrastructure) Boinc or Condor : 1 coordinator + N-1 workers (static infrastructure) 22

74 Experimentations - Boinc Setup : 130 applications (2 to 128 // tasks) 200 machines on Grid5000 (Orsay s node) Execution times vary from 1 to 500 seconds Results : With BonjourGrid, 60% of applications give a delay varying from 24 to 1277 s BonjourGrid gives execution times < Boinc when the tasks number is important Time (s) in logscale(2) Nbre Of // Tasks #Applications Turnaround time of BOINC Turnaround time of BonjourGrid Nbre of Tasks per App. 23

75 Experimentations - Condor Setup : 130 applications (2 to 128 // tasks) 200 machines on Grid5000 (Orsay s node) Execution times vary from 1 to 500 seconds Results : With BonjourGrid, 35% of applications generate a delay around 30 s BonjourGrid generates more important delays for applications which are preceded by applications with a large number (>100) of tasks Time (s) in logscale(2) Nbre Of // Tasks #Applications Turnaround time of Condor Turnaround time of BonjourGrid Nbre of Tasks per App. 24

76 Experimentations - Fault tolerance Using virtual machines to save the state of coordinators XEN virtualization system 10 applications, with parallel tasks ranging from 2 to 128 tasks 50 machines on the node of Nancy (Grid5000) Faults scenarios by injecting faults during the execution of applications Recovery Time Time of recovery of the coordinator Time to re-establish the connection workers 25

77 Fault tolerance framework!"#$%&'()*#!%+,-.$&-&/-(0#!""#$%&'("#) *"%&+,!"&$"#)!"#$"%&'()*+&'!"#$%&'()*+,'-.)&&%&/'012' Establishment of connection(4)!"#$%&&$ Worker Migrate Snapshot (2) Save Snapshot (1) Restore Snapshot (3)!"#$%&'()*# +)%,-.(/0 12&%/(2#!""#$%&'("#) *"%&+,!"&$"#)!"#$"%&'()*+&'!"#$%&'()*# +)%,-.(/0 1&-/(2&#!"#$"%&'()*+&'!"#$%&'( )"&*"+,-,%$(.,#/01",2(!"#$%&'(")*'!"#$%&'()*+,' -.)&&%&/'012'!"#$%&'(")*'!"#$%&'()*+,' -.,$/,*0'123'!"#$%&&$ Main coordinator!"#$%&&$ Backup coordinator

78 Experimentations - Fault tolerance BOINC Average delay of 197 sec Almost stable delay which does not depend on number of tasks Boinc allows the continuation of work after coordinator failure Time (s) in logscale(2) #Applications Turnaround time of BOINC Turnaround time of BOINC-FT Nbre of Tasks per App. Level of Fault Injection Nbre Of // Tasks 27

79 Experimentations - Fault tolerance Condor Average delay of 238 sec Condor recovers tasks that have not completed their executions Time (s) in logscale(2) #Applications Turnaround time of CONDOR Turnaround time of CONDOR-FT Nbre of Tasks per App. Level of Fault Injection Nbre Of // Tasks 28

80 Conclusion BonjourGrid: A novel approach for making a collaborative and decentralized Desktop Grid systems. Publish/Subscribe protocol Orchestrate the participants A computing system (Boinc, Condor or XtremWeb) for the execution level of an application. BonjourGrid makes a distributed control over resources and does not depend on a central element. BonjourGrid implements a Fault-tolerant mechanism for coordinators BonjourGrid favors collaborative execution and Meta- Grids orchestration 29

81 Future works Minimize the amount of information transferred between coordinators Include reservation rules based on history traces of the previous executions Integrate economic models Add a new layer for security issue 30

82 Our background is coordination, our future will be coordination of clouds!

84 Thanks. Any Questions?

Analysis of Peer-to-Peer Protocols Performance for Establishing a Decentralized Desktop Grid Middleware

Analysis of Peer-to-Peer Protocols Performance for Establishing a Decentralized Desktop Grid Middleware Heithem Abbes 1,2, Jean-Christophe Dubacq 2 1 Unité de Recherche UTIC ESSTT, Université de Tunis