D5.1: ENTICE knowledge base model and reasoning [M6] 30/07/2015

Size: px

Start display at page:

Download "D5.1: ENTICE knowledge base model and reasoning [M6] 30/07/2015"

Elinor Hunt
5 years ago
Views:

1 D5.1: ENTICE knowledge base model and reasoning [M6] 30/07/2015

2 decentralized repositories for transparent and efficient virtual machine operations D5.1 (M6) Responsible author(s): Vlado Stankovski and Salman Taherizadeh Co-author(s): Uroš Paščinski, Jernej Trnkoczy. This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No P a g e 2

3 Revision history Administration and summary Project acronym: ENTICE Document identifier: ENTICE D5.1 Leading partner: UL Report version: 1.0 Report preparation date: Classification: Public Nature: Report Author(s) and contributors: Vlado Stankovski et. al Status: - Plan - Draft - Working - Final X Submitted - Approved The ENTICE Consortium has addressed all comments received, making changes as necessary. Changes to this document are detailed in the change log table below. Date Edited by Status Changes made Salman Taherizadeh, Draft Initial outline Vlado Stankovski Vlado Stankovski, Draft Meetings Title: ENTICE ontology To Salman Taherizadeh, Uroš Paščinski, Jernej Trnkoczy Vlado Stankovski Draft First version of ENTICE ontology Salman Taherizadeh, Draft Ontology update Vlado Stankovski Salman Taherizadeh Draft Adding textual content and improving ENTICE ontology Uroš Paščinski Draft Developing and integrating initial OWL representation of ontology in Protégé Salman Taherizadeh Draft Improving the content, adding references Salman Taherizadeh Draft Proofreading Zsolt Nemeth, Draft Review process Gabor Kecskemeti Salman Taherizadeh Draft Addressing review comments Salman Taherizadeh, Vlado Stankovski, Thomas Fahringer Final Final version of D5.1 P a g e 3

4 Notice that other documents may supersede this document. A list of latest public ENTICE deliverables can be found at the ENTICE Web page at Copyright This document is ENTICE Consortium Citation Vlado Stankovski, Salman Taherizadeh, Uroš Paščinski, Jernej Trnkoczy, and Thomas Fahringer. (2015).. ENTICE Consortium, Acknowledgements The work presented in this document has been conducted in the context of the EU Horizon ENTICE is a 36-month project that started on February 1st, 2015 and is funded by the European Commission. The partners in the project are UNIVERSITAET INNSBRUCK (UBIK), MAGYAR TUDOMANYOS AKADEMIA SZAMITASTECHNIKAI ES AUTOMATIZALASI KUTATOINTEZET (SZTAKI), UNIVERZA V LJUBLJANI (UL), Flexiant Limited (Flexiant Limited), WELLNESS TELECOM SL (WTELECOM) and Deimos CASTILLA LA MANCHA SL (Elecnor Deimos Satellite Systems). The content of this document is the result of extensive discussions within the ENTICE Consortium as a whole. More information Public ENTICE reports and other information pertaining to the project are available through ENTICE public Web site under 4 P a g e

5 Revision history... 3 Copyright... 4 Citation... 4 Acknowledgements... 4 More information... 4 List of figures... 6 List of tables Introduction Comparison among RDF database tools An intelligent context-aware decision-support system oriented towards healthcare support A comparative analysis of Linked Data tools to support architectural knowledge Evaluation of the performance of open-source RDBMS and triple-stores for storing medical data over a web service Yet another triple-store benchmark? Practical experiences with real-world data Automating Cloud service level agreements using semantic technologies Distributed RDF query processing and reasoning for big data / linked data The Berlin SPARQL Benchmark Different triple stores from REASONING point of view Jena Fuseki Virtuoso AllegroGraph Sesame OWLIM Possibility of replicated data storage for the ENTICE project Ontology definition for the ENTICE project Ontology development for the ENTICE project Ontology reasoning with OWL Conclusion Bibliography and references Abbreviations/ Glossary Appendix: The first version of the ENTICE ontology P a g e 5

6 List of figures Figure 1: Average write performance Figure 2: Average read performance Figure 3: Initially identified entities and their relationships P a g e

7 List of tables Table 1: Features of triple-stores Table 2: DiskImage class Table 3: Data class Table 4: Fragment class Table 5: Repository class Table 6: Comment class Table 7: Functionality class Table 8: Implementation class Table 9: DiskImageSLA class Table 10: Country class Table 11: Delivery class Table 12: User class Table 13: Quality class Table 14: GeoLocation class Table 15: OperatingSystem class P a g e 7

8 1. Introduction The purpose of ontology is to unambiguously and formally define the relevant notions of the ENTICE environment. This formal explanation of the domain s terminology describes and organizes fundamentals for a generally accepted and shared understanding of the ENTICE project between all project partners. There are various ontology languages available to represent ontology like Developing Ontology-Grounded Methods and Applications (DOGMA) [Maniraj2010], Frame Logic (F-Logic) [Kifer1995], and Web Ontology Language (OWL) [Aref2005]. There are also several tools available for developing the ontologies like OntoEdit [Sure2002], WebODE [Corcho2002] and Protégé [Protégé2015]. During the ontology development stage, to design the ENTICE's ontology and export OWL files for the knowledge base (KB), Protégé was used as an ontology editor; since it is a widely-used, well-known, welldocumented, mature, and stable application specifically designed for building OWL DL-based knowledge representations [Knublauch2004]. Moreover, reasoning in knowledge bases and ontologies is one of the objectives why a specification needs to be a formal one. By reasoning such as OWL mechanism, we can derive facts that are not expressed in ontology or in knowledge base explicitly. Considerable growth of semantic applications leads ontology management systems to take advantages of reasoning capabilities. The main objective of the ENTICE knowledge base is to be a wide, flexible information management solution over which to apply ontology-based knowledge management and reasoning. It needs to offer a common interface over which other modules in ENTICE can interact in order to retrieve and store knowledge base data. One way to achieve those goals is by implementing it using triple-stores as they are able to ingest and index, with very little a priori knowledge, diverse type of data (structured, semi-structured and unstructured data), provide flexibility with respect to schema changes and mapping, offer several types of APIs (Application Programming Interface) and be used as a basis for reasoning and inferring new knowledge from existing facts. A triple is a data entity composed of three sections (subject-predicate-object), such as "JumlaUbuntu hasversion 14.04" which means "the version of a particular VM image named JumlaUbuntu is 14.04", and a triple-store is a purpose-built database for the storage and retrieval of triples through semantic queries. However there are many triple-store engines having various features, it is not obvious and easy to find the one most suited for exact requirements of the ENTICE; since it should be free, available, able to easily handle several million triples, and able to support reasoning. In this Deliverable, although, there are more RDF (Resource Description Framework) stores available, we focused only on Apache Jena Fuseki [Apache2015], OWLIM [OWLIM2015], OpenLink Virtuoso [OpenLink2015], Sesame [Sesame2015] and AllegroGraph [AllegroGraph2015] within our overview due to the restrictions of the ENTICE setup. 8 P a g e

9 To achieve the objective of providing high-availability and performance-intensive services in the ENTICE environment, it is essential to implement trustworthy knowledge base that can be responsible for all requests submitted when considering the limitations imposed by different problems such as a failed link, overloaded link, or server malfunctions. In other words, to attain high availability of performance-optimized ENTICE solution, creating and maintaining multiple copies of the ENTICE database possibly in different locations which is called data replication could be appropriate and hence ENTICE would be more efficient with zero downtime - planned or unplanned. This deliverable presents research mainly on comparing and analysing different RDF (Resource Description Framework) data engines, describing the most widely-used RDF data engines for reasoning, advantages of replicated knowledge base for the ENTICE environment, defining and developing ontology and finally ontology reasoning in relation to the ENTICE project. The rest of the document is structured as follows: Section 2 discusses the comparison among RDF database tools, followed by introducing different triple stores from reasoning point of view and then describing the possibility of replicated data storage for the ENTICE project respectively in Section 3 and Section 4. Section 5 provides the fundamental definitions for the ENTICE ontology. In the following, Section 6 is about how the ontology was developed for the ENTICE project. Section 7 shows how reasoning using the Web Ontology Language (OWL) can help to obtain and deduce more complete information and facts for such a big project like the ENTICE over RDF Data. Our concluding remarks are presented in Section 8. Moreover, bibliography and abbreviations can be seen in Section 9 and 10. Finally, the first version of the ENTICE ontology developed as an OWL file is in Appendix. 2. Comparison among RDF database tools ENTICE will cautiously store information about the environment in an RDF-based knowledge base that will be used for interoperability, integration, reasoning and optimisation purposes. An RDF repository is the database where RDF data is stored. The repository also provides access to the stored data as an RDF Graph, which can be retrieved via an RDF query language like SPARQL (SPARQL Protocol and RDF Query Language). Since the atomic unit of data in RDF is a Triple, RDF repositories are also called "triple-stores". There are several open-source and commercial RDF repositories that can be chosen for implementing the ENTICE knowledge base. Considering several RDF triple stores potentially to be chosen, the following Table (see Table 1) shows different features which are the most important characteristics for the ENTICE: P a g e 9

10 Virtuoso (OpenSource) DELIVERABLE 5.1 Table 1: Features of triple-stores Virtuoso OWLIM Sesame AllegroGraph Fuseki (Commercial) (GraphDB) Open source Yes No No Yes No Yes Free edition Yes No Yes/No Yes Yes/No Yes Support Yes Yes Yes Yes Yes Yes Reasoning Yes Yes Yes Yes (Little) Yes Yes Live backup No Yes Yes/No Kind of Yes Yes License GPLv2 Proprietary commercial GNU LGPL / commercial BSD-style Proprietary commercial Apache 2.0 This section covers the evaluation of different RDF database technologies based on several researches in literature review from different points of view. 2.1 An intelligent context-aware decision-support system oriented towards healthcare support [Manate2014] Semantic data is the information making machines to be able to realize the meaning of information. It describes the methods and frameworks that convey the meaning of the information. The semantic data are considered as one of the strong approaches to explain relations among different data sources. Regarding this technology, RDF is a data model as the standard format for data. In this way, the semantic annotation of data improves the data aggregation from different sources; since the relations between the entities are already defined. In this work, there are two topics which are Linked Data and Semantic Web. Linked Data is the term used to offer a mechanism of connecting and exposing data on the Web from different sources. The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Apache Jena is a Java-based open source toolkit to build Semantic Web and Linked Data applications. This framework provides an API to extract data from and write to RDF-based knowledge base. Actually, it enables the exposure of RDF triples as a SPARQL end-point which is accessible. It means that the Apache Jena project is able to use triple-store which is a purpose-built database for the storage and retrieval of triples through semantic queries. Utilization of the Apache Jena Fuseki module enables the exposure of RDF triples as a SPARQL end-point which is accessible through a REST (Representational State Transfer) API. In this paper, for their prototype, the authors have used a Jena Fuseki database server to store the RDF triples, because it is a light database server and it is easy to install, but in practice when the system is used at a large scale and the ontology is formed from a large number of triples, they encourage the usage of another database server called Virtuoso which is now available under both commercial and open source licenses. 10 P a g e

11 2.2 A comparative analysis of Linked Data tools to support architectural knowledge [Roda2014] Architectural knowledge is knowledge about the components of a complex system and how they are related. The authors proposed the use of Linked Data techniques to define and manage architectural knowledge. Their results showed that the best Linked Data tools for sharing and reusing this kind of knowledge, among different analysed ones, are Sesame and Jena Fuseki. Jena Fuseki is able to efficiently store large numbers of RDF triples on disk. It has a query engine compatible with the latest SPARQL specification which is a semantic query language to retrieve and manipulate data stored in RDF-based knowledge base. Moreover, it provides a server to allow these RDF data to be published; so that they can be queried and used by other applications using a variety of protocols. Jena provides a set of tools and Java-based libraries to develop semantic web and Linked Data applications, platforms and servers. Especially, Jena Fuseki includes an API for reading, processing and writing RDF data in different formats; because an RDF-based model can be represented in a variety of syntaxes, such as RDF/XML, N- triples and Turtle. It also has an ontology API and rule-based inference engine for handling and reasoning with OWL and RDFS (RDF Schema) as notations and ontology languages. Additionally, it provides constant classes for well-known ontology schemas (such as RDF, RDFS, RDFa, Dublin Core or OWL) and also has some methods for reading and writing RDF as XML. Sesame is another open-source Java-based framework for storing and querying RDF data, similar to Jena. Sesame platform is fully extensible and configurable with respect to storage mechanisms, inference engines, RDF file formats, query result formats and query languages. The core of the Sesame framework is the RDF Model API, which defines how the building blocks of RDF (statements, URIs, blank nodes, literals, graphs and models) are represented. Sesame also provides the repository API, which describes a central access point for Sesame repositories. Its purpose is to give a developer-friendly access point to RDF repositories, offering various methods for querying and updating the data in an easy way. Additionally, it supports the utilization of SPARQL for querying disk-based and memory-based RDF stores, RDF Schema (RDFS) inference engines, along with explicit support for the most popular RDF file formats (such as RDF/XML, RDFa, N-Triples and so on) and query result formats. 2.3 Evaluation of the performance of open-source RDBMS and triple-stores for storing medical data over a web service [Kilintzis2014] To develop production-grade systems, the current generation of triple-stores appears to be performance-wise. A dominant benefit that is crystal clear while developing the testing systems is that the use of triple-stores with SPARQL endpoint is easy to implement and interchange; since the SPARQL standard is more obvious and is followed strictly by the triple-store endpoints. P a g e 11

The authors have identified that no significant variation was observed in the execution time of successive same type requests in each of three systems: MySQL (RDBMS), Fuseki and Virtuoso.

12 The authors have identified that no significant variation was observed in the execution time of successive same type requests in each of three systems: MySQL (RDBMS), Fuseki and Virtuoso. Regarding the average write performance measured in milliseconds, the Fuseki backend implementation is the slowest of the three. Actually, the Virtuoso based implementation was significantly faster. It means the Fuseki based implementation suffered from write performance issues, when comparing to either the Virtoso or the MySQL based solutions (see Figure 1). Even taking into account the hardware used to evaluate its performance, it is too slow to be considered as an alternative for any production-grade system. Figure 1: Average write performance (Virtuoso, RDBMS, and Fuseki) 1 Additionally, as can be seen in Figure 2, regarding the average read performance; the Virtuosobased system had a better performance in comparison with Fuseki based implementation sometimes. Figure 2: Average read performance (Virtuoso, RDBMS, and Fuseki) 2 1 Reproduced with permission from [Kilintzis2014] 2 Reproduced with permission from [Kilintzis2014] 12 P a g e

13 The Fuseki-based implementation suffered from write performance issues, when comparing to either the Virtoso or the MySQL based solutions. Even taking into account the hardware used to evaluate its performance, it is too slow to be considered as an alternative for any productiongrade system. Production grade is a phrase that is used to describe more robust and rugged hardware and software that is designed for intensive business and enterprise computing environments. 2.4 Yet another triple-store benchmark? Practical experiences with realworld data [Voigt2012] The authors mentioned that however quite a number of RDF triple solution benchmarks have already been conducted and published, it appears to be not that easy to use the appropriate storage store for a given particular Semantic Web project. A fundamental reason is the absence of comprehensive performance tests with data having real-world characteristics which is the main contribution of this paper. Although, there are more RDF stores available, the authors focused on Apache Jena, BigData RDF Database, OWLIM Lite, and OpenLink Virtuoso within their research according to their project s purposes which were availability of free version, handling up to 100 million triples, supporting RDFS reasoning as well as SPARQL 1.1, and is built for the Java runtime environment. The paper shows a comparison on average query execution time (in ms) in multi-client scenario. It illustrates the average query execution time in their multi-client setup. OWLIM Lite scales best and Virtuoso scales worst. Fuseki and BigData are running shoulder on shoulder as both scales quite linear to the number of clients. The Apache Jena project comes up with a bunch of sub-systems. For their purpose they needed a fast triple store as well as a SPARQL endpoint, thus, they relied on the Fuseki server with the version which includes TDB a high performance RDF store. GraphDB (formerly OWLIM) is vailable in three versions - Lite, Standard and Enterprise. OWLIM-Lite is not opensource software. BlazeGraph which was previously called BigData is a high-performance, scalable, standards-based, and open-source graph database. This Java-based platform supports the RDF data model and the SPARQL language to query the RDF database. The BigData open source framework has been under continuous development for nearly 10 years. It is applicable based on a dual licensing model; containing GPL (General Public License) and commercial licensing. 2.5 Automating Cloud service level agreements using semantic technologies [Joshi2015] Cloud related legal documents, like terms of service or customer agreement documents are currently managed as text files. As a result cumbersome manual effort is needed to monitor the measures and metrics agreed upon in these SLAs (Service Level Agreements). The authors have P a g e 13

14 focused on remarkably automating this process using semantic web technologies. The Semantic Web deals primarily with data instead of documents. It enables data to be annotated with machine interoperable meta-data, enabling the automation of their retrieval and their usage in correct contexts. In this work, Jena Apache Fuseki server graph store was used as a store of the Cloud SLAs. Moreover, from their point of view, Fuseki can be easily queried for continuous SLA monitoring; since it is in a machine-readable format. They choose Fuseki as it is an Apache product that provides all its features as open sourced. However Virtuoso has an open-sourced version, it doesn't have as many features as their commercial product has. 2.6 Distributed RDF query processing and reasoning for big data / linked data [Perasani2014] RDF, as a model of entities and relationships, is used to describe an approach that makes relationships to connect different entities or to connect entity to its value. RDF does not have pre-defined schemas to represent data. RDF has the ability to merge two different data sources without having defined schemas. Hence, it can be used to merge unstructured, semi-structured and structured data and share it across the network. Producers can produce the RDF data and share on the network where as consumers can crawl it to use in their applications. This improves the reusability of existing information without having to create new one. SPARQL is a RDF query language which is used to get the information stored in the RDF format. A simple SPARQL query consists of triple patterns, which represents subject, predicate, and object. Complex SPARQL queries also consist of conjunctions, disjunctions, and optional patterns. Authors evaluated an existing in-memory based ARQ engine (a SPARQL processor for Jena) provided by Jena framework and found that it cannot handle large datasets, as it only works based on the in-memory feature of the system. ARQ is a query engine for Jena that supports the SPARQL RDF Query language. 2.7 The Berlin SPARQL Benchmark [Bizer2009] The SPARQL Query Language for RDF is increasingly used as a standardized query API for providing access to datasets on the public Web and within enterprise settings. A mechanism for evaluating the performance of different triple-stores is to run an RDF benchmark, where it is possible to compare the results obtained for each triple-store engine over a set of given queries and datasets. The Berlin SPARQL Benchmark compares the performance of various triple storage engines that expose SPARQL endpoints; such systems include native RDF stores, named graph store systems that also map relational databases into RDF, and SPARQL wrappers around other kinds of data sources. In other words, in order to prevent synchronization problems between RDF-based databases and relational databases, it is preferable in many situations to have direct SPARQL access to this data without having to replicate it into RDF. In this situation, 14 P a g e

15 this kind of access can be provided by SPARQL-to-SQL rewriters that translate incoming SPARQL queries on-the-fly into SQL queries against an application-specific relational schema. In this work, the benchmark is built around an electronic commerce use case, where a group of products is offered by different vendors and hence consumers have posted reviews about products. It also provides a dataset and test that can be used for testing different SPARQL endpoints. They ran the SPARQL Benchmark against four popular RDF stores (Sesame, Virtuoso, Jena TDB, and Jena SDB) and two SPARQL-to-SQL rewriters (D2R Server and Virtuoso RDF Views) for three dataset sizes: One million triples, 25 million triples, and 100 million triples. Comparing the RDF stores, Sesame showed a proper performance for small datasets while Virtuoso TS was faster for larger datasets. For larger datasets, Jena TDB and SDB could not compete in terms of overall performance. 3. Different triple stores from REASONING point of view Different properties, constraints, and relations among classes asserted by the ontology are collectively known as axioms. All together represent the ontology related to a particular domain and they define what can be true and what must be true. Facts are data statements regarding individuals. A reasoner is a piece of software which is capable of inferring logical consequences from stated facts in accordance with the ontology s axioms, and of determining whether those axioms are complete and consistent [IBM2014]. As partially described in the deliverable D2.3, reasoning with the technologies such as RDFS and OWL, which allow for adding rich semantics to the data, can help the system automatically gather and use deeper-level new information from the ENTICE knowledge store. By reasoning, ENTICE is able to derive facts that are not expressed in the knowledge base explicitly. In other words, reasoning means that the system can deduce new knowledge that is encoded implicitly in the existing information in the ENTICE database. Reasoning is a part of the ENTICE which can infer new knowledge from existing facts available in the ENTICE knowledge base. In this way, the inputs of the reasoning systems are data that is collected from all entities in the ENTICE environment describing the system and its context and the outputs could be also a set of planned adaptation actions to change system conditions; whereby, for example, ENTICE can introduce new comprehensive post-optimising algorithms so that existing VM images can be automatically adapted to dynamic Cloud environments. In this section, some of the most popular RDF data stores and their reasoning capabilities are reviewed: 3.1 Jena Fuseki Jena Fuseki [Apache2015] is a widely adopted platform covering most of the concerns related to the development of Semantic Web and ontology-based applications. It provides some valuable P a g e 15

16 reasoning features, such as consistency checking, ontology classification, concept validity, and query answering, through various internal reasoners; such as its inference engines in which inferencing is the method how the system can automatically derive data from given data. To this end, for each category of reasoning a specification of ontology model is available. Such specifications cover ontology languages such as OWL-{Lite, DL, Full}, RDFS, or DAML+OIL5, with different kinds of reasoners (transitive class-hierarchy inferences, or rule-based reasoners). All reasoners made available by the Jena API perform the reasoning in-memory. In other words, they require the data to be present an in-memory model [Bioontology2015]; since as memory sizes grow larger, in-memory RDF reasoning is attracting interest [Fernandez2014]. Because of the complex network of relations, classes, and individuals which ontology typically expresses, we distinguish between its asserted model - knowledge that is stated explicitly by the ontology - and its inferred model, which is the entire set of rich knowledge deducible from the axioms embedded in the ontology [IBM2014]. Based on reasoning capabilities, Jena differentiates between the asserted model (or base model, i.e. without inferences) and the inferred model, mainly because of dependences on the application, it is not always useful to materialize and store inferred statements which could be considered as virtual statements as they do not actually exist in the knowledge base. 3.2 Virtuoso Virtuoso [OpenLink2015] is a semi-commercial (commercial and open source products), generalpurpose, application and data container, with extensive SPARQL and RDF support [Erling2007]. Virtuoso copes with scalability for semantic data storage and entailment production which is a set of rules specifying when to derive what. In its commercial version, Virtuoso provides information distribution and heterogeneity issues through the setup of a unified data model on top of its Virtual Database Engine. Considering the inference scalability which shows the relationship between dataset size and loading speed, whereas typical technologies consist in beforehand materializing deduced facts, Virtuoso performs as far as possible on-demand and runtime inferencing. In order to enable a very fast response in a certain time, incomplete results can be returned through partial evaluation of SPARQL queries. Virtuoso offers a real-time implementation of the owl:sameas property as a reasoning capability described in Section 7, and also provides transitivity through the rewriting of transitive SPARQL sub-queries. While one sub-query starts from the source, the second one starts from the target, and the first intersection of paths produces a first partial result. Virtuoso is capable of backward-chaining reasoning which offers support for a subset of RDFS and OWL semantics, including class hierarchies and property equivalence such as rdfs:subclassof and owl:sameas demonstrated in Section 7. Backward-chaining or backward reasoning starts with a list of reasoning goals (or a hypothesis) and works backwards from the consequent to the antecedent to see if there is data available that will support any of these consequents [Russell2009]. Virtuoso's reasoning engine 16 P a g e

17 includes support for the following reasoning capabilities explained in Section 7: owl:sameas, rdfs:subclassof, rdfs:subpropertyof, owl:equivalentclass, owl:equivalentproperty, owl:inversefunctionalproperty, owl:transitiveproperty, owl:symmetricalproperty, and owl:inverseof. 3.3 AllegroGraph AllegroGraph [AllegroGraph2015] is a modern, high-performance, and persistent graph database. It is a commercial product providing a high-performance disk-based storage with standard HTTP (HyperText Transfer Protocol) and SPARQL interfaces. Multiple RDF triple stores can be assembled into a virtual single store, thus enabling the federation of distributed stores. Description logics called OWL-DL reasoners are appropriate at handling complex ontologies. They tend to be complete and give all the possible answers to a query; however they can be quite unpredictable with respect to execution time when the number of triples grows up beyond millions. Allegrograph provides some limited but effective and practical reasoning through its RDFS++ reasoner. RDFS++ supports the following predicates: rdf:type, rdfs:subclassof, rdfs:range, rdfs:domain, rdfs:subpropertyof, owl:sameas, owl:inverseof, owl:transitiveproperty, and owl:hasvalue. This engine can be topped with a Prolog interface allowing performing rule-based reasoning. The Prolog engine can also be used on top of RDF data to allow more complex deductions based on domain-specific reasoning. 3.4 Sesame Sesame [Sesame2015] is an open-source, community-supported framework to store and query RDF(S) data. It is able to connect to different storage mechanisms such as standard file systems or even relational databases through its SAIL (Storage and Inference Layer) API. Sesame aims to be a general platform for semantic software based on RDFS and therefore provides reasoning optimized for RDFS data. Sesame does not support incremental removals in the general case and also it has a limited support for custom rule reasoning on its own. The repository can be queried via the standard query language (SPARQL) and provides reasoning capabilities through offering an RDFS inferencer and SAIL. Besides, Sesame supports a standard Web service interface which implements the SPARQL protocol. Sesame is actually extensible through a plugin architecture. In the context of the Sesame engine development, a specific query language SeRQL (Sesame RDF Query Language) has been proposed. According to its authors, some original features have been proposed such as the support for graph transformation, basic set operators (union, intersection, and minus). SPARQL and SeRQL have been developed during simultaneous initiative but now some merge effort have been achieve to benefits, in SPARQL, from features of SeRQL and conversely. Alibaba is an expansion of the Sesame RDF repository which enables binding Java objects and classes to RDF triples and OWL classes. P a g e 17

18 3.5 OWLIM OWLIM [OWLIM2015] is a high-performance semantic Storage and Inference Layer (SAIL) for Sesame. It provides reasoning support and performance in a configurable way. OWLIM comes in two versions: SwiftOWLIM and BigOWLIM. SwiftOWLIM, the enterprise edition, is free edition to download and use. A copy of BigOLWIM can also be obtained upon request for evaluation, development and research purposes. Both versions offer a persistence strategy guaranteeing data consistency and preservation, and support for reasoning with OWL dialects, covering most of capabilities belonging to RDFS, OWL-Horst, OWL2 RL, and OWL-Lite. SwiftOWLIM is more assigned to prototyping as it loads the full content of the knowledge base in central memory and performs fast reasoning and query evaluation. On the other hand, BigOWLIM handles up over billions of triples, implements query and reasoning optimizations, however performance remains lower for the free version. 4. Possibility of replicated data storage for the ENTICE project Storing VM (Virtual Machine) images shows that the ENTICE environment is characterized by increasing heterogeneity, distribution and cooperation, where distributed knowledge base plays an important role. Knowledge usage in computer systems directly depends on knowledge representation schemes. Distributed knowledge base has intended to improve reliability and also performance; since it has replicated components and, thereby potentially implements loadbalancing and eliminates single points of failure. The failure of a single instance, or the failure of a communication link which makes the knowledge base reachable, is not sufficient to bring down the entire system. Database replication increases data availability; since copies of the data stored at an unreachable or failed instance still exist at other operational locations. However, there exist RDF stores adopting a centralized mechanism using one server running very specialized hardware, the ENTICE project needs to have a storage system that is highly available and performance-intensive. Therefore, the goal could be to develop a distributed storage system to store RDF data that at least can be partially replicated. With improvements in parallel computing, new methods can be devised to allow storage and retrieval of RDF across distributed environment in a replicated way. Sometimes, it is needed to design and develop an effective model to query RDF data that should be available, efficient and maybe distributed with reasoning capabilities. To this end, the main challenges which will be addressed are the following: Precise definition of environment for RDF triple-store, Live backup or automatic replication of RDF data, Automatic and dynamic distribution of query processing, and Dynamic and automatic updates in distinct RDF storages. 18 P a g e

19 Fortunately, SPARQL performs a process called federated query, where it will distribute the query to endpoints, then it will be computed, and hence results will be submitted. In this way, SPARQL language offers four distinct query forms to read RDF data from the databases. SELECT Query is responsible to get information stored in the database and processed results are stored in tabular format. Construct query is in charge of getting information from the repositories and results will be converted to RDF format. Another one, ASK query is used to get simple true or false inquiry. DESCRIBE query is used to get the RDF graph from the SPARQL endpoints [Perasani2014]. 5. Ontology definition for the ENTICE project Ontologies include computer-understandable descriptions of basic concepts in the practical domain and the relationships among them. An OWL or RDF schema defines the ontology level, i.e. the set of classes and the relations that can exist between those classes. Instances of these classes and their properties are then defined using OWL syntax [Ghijsen2013]. With ENTICE, VM images and their storage will be optimised and stored in heterogeneous repositories. For the optimization, the necessary details related to the ENTICE environment will be described with the ENTICE ontology and will be included in the knowledge base. A major result of the ENTICE project will be knowledge management approach for the ENTICE environment. For the perfection of ENTICE solution, all entities consisting mainly of the applications and the operated federated Cloud infrastructure will be considered in conceptual models. Therefore, an ontology that contains concepts of resources (e.g., software components, VM images, Cloud-based environmental settings), programming concepts (storage complexity, taxonomy of functional properties, etc.), virtual organisation concepts (e.g., privileges, credentials, ownership), resource negotiation-related concepts (SLAs), QoS (Quality of service) and runtime environment concepts will be developed. The following tables shows all classes and their own properties to be created in the detailed ENTICE ontology. We have incorporated this content in the ontology that we want to create using OWL and RDFS language. Table 2: DiskImage class Class Name: Description: DiskImage A virtual machine (VM) image file consists of a pre-configured operating system (OS) environment and application. The purpose of a virtual appliance is to simplify delivery and operation of an application. To this end, only necessary operating system components are included. Similarly to VM images, the ENTICE environment can be designed to also support containers, such as Docker container images. P a g e 19

20 Data Properties Property Name Range (Type) Rationale DiskImage_id DiskImage_Type DiskImage_Description DiskImage_Title DiskImage_Version DiskImage_Predecessor unsignedint string ("VMI" or "CI") string string string DiskImage_id / Fragment_id / Data_id DELIVERABLE 5.1 This property allows a unique number to be generated when a new individual as a DiskImage instance (VM image or Container image) is inserted into the database. The unsignedint data type can hold integer values in the range of 0 to 4,294,967,295. Some applications require containers for their execution. Similarly to VMIs (Virtual Machine Image), the ENTICE environment can be developed to support container images, such as those of CoreOS/Rocket and Docker. This property is used to display the description about the VM image. Title is a short and keywords-relevant description of VM image. During the Cloud application lifecycle, VMIs may be upgraded (e.g. operating system or software component changes) several times. As the existing Cloud application may be long running, an upgrade to a new VMI must happen during its execution. The ENTICE environment must provide the latest needed VMI at all the geographic locations where the Cloud application is currently running. Uploading an image means introducing previously non-existing images into the repository; adding some structurally and/or semantically new content. Uploading an image is not an individual use case but part of others: it can either be part of introducing a new VMI (adding an entirely new image) or updating an unoptimized image (modifying some existing images and uploading the new image). 20 P a g e

21 DiskImage_FileFormat DiskImage_Picture DiskImage_Encryption DiskImage_IRI DiskImage_SLA DiskImage_Price DiskImage_Owner string (few values) (image) boolean ( Yes or No ) anyuri SLA_id Decimal (currency) User_id Virtual machine images come in different formats, including ISO, DMG, FVD, IMG, NDIF, QCOW, UDIF, VDI, VHD, VMDK, WIM, and so on. A big part of having a comprehensible interface, means having nice pictures for VM images. The contents of some VMIs and data could be business-critical and not available under any circumstance to intruders. For this purpose, the repository must support content delivery through encrypted channels. Moreover, the analysis methods for VMIs of the ENTICE environment should not compromise the privacy and security of the Cloud application owners. VMIs and image fragments need to be properly indexed. The semantic model will contain information about the actual VMIs and their functionality, the geographic location, the URI (Uniform Resource Identifier), and other details for the search facility. Specifies a contract between the user and the Entice for a particular VM image. VM image can have a price value. Both the application developer and the service provider (ENTICE users) can introduce a new image to the ENTICE environment. For example, the application developer may release a new application independent of the previously uploaded VMIs. After this process the new image should be uploaded and undergo an optimization process. This property shows a DiskImage belongs to which user; since each user has a unique User_id. P a g e 21

22 DiskImage_Functionality DiskImage_Quality DiskImage_OperatingSystem DiskImage_NeedsDataFile DiskImage_GenerationTime DiskImage_Obfuscation Functionality_id Quality_id OperatingSystem_id boolean ( Yes or No ) datetime boolean ( Yes or No ) DELIVERABLE 5.1 The application developer can add generic functional requirements at any time to new or existing VMIs. Each VM image has its own quality information such as image size, image rate, number of downloads and so on. A virtual machine image must include an operating system (OS). The ENTICE solution should be agnostic to the applied operating systems by the application providers and must support the OSs necessary to run all pilot cases as a minimum (WP3). In the case of the Earth Observation Data (EOD) processing and distribution pilot case, we need to deliver satellite images together with the VM image containing a function that processes them. Specifies when a particular VM image was generated. This property is more an internal requirement for DEIMOS use case. All binaries with proprietary components (gs4eo product) deployed on the machines are generated with an obfuscation mechanism to conceal its purpose (security through obscurity) and its logic, in order to prevent tampering and deter reverse engineering from potential competitors. This does not include external linked libraries with open source or even thirdparty proprietary code. Table 3: Data class Class Name: Description: Data Some applications need the delivery of various content alongside the VMI. For example, in the case of the Earth Observation Data (EOD) 22 P a g e

23 processing and distribution pilot case, we need to deliver satellite images together with the image containing a function that processes them using the ENTICE environment as a content delivery mechanism at geographic locations. Data Properties Property Name Range (Type) Rationale Data_id unsignedint This property allows a unique number to be generated when a new individual as a Data instance is inserted into the database. Data_Description String The ENTICE environment must support in some cases the delivery of data in the form of files or folders containing satellite images. In such cases, ENTICE must provide appropriate storage and data delivery mechanism alongside the VMIs. We will have to properly indexed, so that we can search and deliver to certain geographic locations. Data_ReferenceImage DiskImage_id This property defines that a particular Data belongs to which DiskImage. Data_Picture (image) Just like DiskImage_Picture Data_Title String Just like DiskImage_Title Data_Version String Just like DiskImage_version Data_Encryption Boolean ( Yes or No ) Just like DiskImage_Encryption Data_StorageSize unsignedint Just like DiskImage_storage_size Data_Price Decimal Just like DiskImage_price (currency) Data_IRI anyuri The external Interface shall allow for data download needed by the application through an API, which allows the delivery of a pointer (i.e. URI) so that the Cloud application can access and use the data (files or folders) while running at the Cloud provider. Data_FileFormat String Data files can have different types of contents. File types can be determined by their extensions. Data_GenerationTime datetime Just like DiskImage_GenerationTime P a g e 23

24 Data_NumberOfDownloads Data_UserComments Data_Obfuscation unsignedint Comment_id Boolean ( Yes or No ) DELIVERABLE 5.1 Some of the data must be frequently accessed and some not, as there is different demands for it. Therefore, we need a mechanism to allow for moving data from one hierarchy level to another, supported by adequate technology for fast, medium and low-speed content delivery. Also ENTICE needs to collect information about the usage of Data and store it in the knowledge base for later statistics. The ENTICE repository should be able to support user complaints. Just like DiskImage_Obfuscation Class Name: Description: Fragment Table 4: Fragment class ENTICE will deliver technologies that decompose user VM images into smaller reusable parts called Fragment. In this way, it will reduce the storage space by storing the common parts of multiple images only once, and also it will lower the costs by ensuring that users only pay for the VM image parts that they cannot reuse from past images. Data Properties Property Name Range (Type) Rationale Fragment_id unsignedint This property allows a unique number to be generated when a new individual as a Fragment instance is inserted into the database. The ENTICE environment will allow splitting of VM images into different fragments for storing the frequently-shared image Fragment_ReferenceImage DiskImage_id components only once (e.g. a particular flavour of Linux used by two different images), thus reducing the overall storage space in the repository. This property implies 24 P a g e

25 that a particular fragment belongs to which VM image. Fragment_Repository Repository_id VMIs and image fragments need to be properly indexed. The semantic model will contain information about the actual VMIs and their functionality, the geographic location, the URI, and other details for the search facility. Fragment_IRI anyuri Just like Fragment_geographic_location Fragment_Size unsignedint Just like DiskImage_storage_size A VM image will be fragmented into pieces called Fragments (e.g. a particular flavour of Linux used by two different images). When a new fragment is generated, the ENTICE computes hash value which acts as a unique identifier of this fragment. Any new Fragment_HashValue1 string fragment with the same hash value is considered the same and there is no need to store it. In order to allow identification of even smaller parts (i.e. prospective new fragments) we need to have hashes for much more content. "Fragment_HashValue1" is a hash value generated by hash function for the first part of a particular fragment. Fragment_HashValue2 string Just like Fragment_HashValue1 Fragment_HashValue3 string Just like Fragment_HashValue1 Fragment_HashValue4 string Just like Fragment_HashValue1 Fragment_HashValue5 string Just like Fragment_HashValue1 Fragment_HashValue6 string Just like Fragment_HashValue1 Fragment_HashValue7 string Just like Fragment_HashValue1 Fragment_HashValue8 string Just like Fragment_HashValue1 Fragment_HashValue9 string Just like Fragment_HashValue1 Fragment_HashValue10 string Just like Fragment_HashValue1 Fragment_HashValueX string These predictions are useful to start the early stages of the ENTICE knowledge base modelling. In the next stages of the ENTICE project, we expect to define the exact P a g e 25

26 mechanism to make these hash values and how many hash values will be needed to be stored. Class Name: Description: Table 5: Repository class Repository Repository of VM Images enables the ENTICE solution to store and index the VM images together. Data Properties Property Name Range (Type) Rationale Repository_id unsignedint This property allows a unique number to be generated when a new individual as a Repository instance is inserted into the database. This property means a repository s Repository_Country Country_id infrastructure is in which country. Each county has a unique Country_id. Repository_GeoLocation GeoLocation_id There could be more than one repository in a particular country and geolocation of repository may influence the download time. Each GeoLocation has a unique GeoLocation_id. Repository_InterfaceEndPoint anyuri Each repository can be easily accessed through its unique url. Repository_OperationalCost Repository_PriorityLevel1Cost Decimal (currency) Decimal (currency) ENTICE will research heuristics for multiobjective distribution and placement of VM images across a decentralised ENTICE repository that optimises multiple conflicting objectives including performance-related goals, operational costs, and storage space. As fast storage costs more and slow storage costs less, it is necessary to provide mechanisms for storing VM images at three speed levels; namely: fast, medium and low-speed. Level1 means fast storage. 26 P a g e

27 Repository_PriorityLevel2Cost Repository_PriorityLevel3Cost Repository_Space Repository_SupportedFormat Decimal (currency) Decimal (currency) double string (few values) As fast storage costs more and slow storage costs less, it is necessary to provide mechanisms for storing VM images at three speed levels; namely: fast, medium and low-speed. Level2 means medium level. As fast storage costs more and slow storage costs less, it is necessary to provide mechanisms for storing VM images at three speed levels; namely: fast, medium and low-speed. Level1 means low-speed storage. In the ENTICE environment, dynamic information about the underlying federated Cloud infrastructure resources such as storage space is needed to be stored. Each repository could be able to support multiple VM image's formats or even only one format. It means each repository should be able to support the conversion into various kinds of VM/C images from one format for portability into various environments. This property is a likely scenario with heterogeneous infrastructure behind the Cloud-based Entice environment. Class Name: Description: Comment Table 6: Comment class The ENTICE repository should be able to support user complaints. Data Properties Property Name Range (Type) Rationale Comment_id unsignedint This property allows a unique number to be generated when a new individual as a P a g e 27

28 Comment_User Comment_Content Comment_Time User_id string datetime DELIVERABLE 5.1 Comment instance is inserted into the database. It implies the comment belongs to which user; since each user has a unique User_id. The ENTICE repository should be able to support user complaints. User should be able to put comments on DiskImages. Specifies when a particular comment was sent. Class Name: Description: Table 7: Functionality class Functionality To support the overall lifecycle of applications deployed as VM images and to facilitate automatic optimization, setup and management in the ENTICE environment, functional and nonfunctional descriptions of the software components, including information on their computational, memory, storage, and communication complexity, and all the necessary dependencies, libraries and environmental variables for deployment on a VM are needed to be stored. Data Properties Property Name Range (Type) Rationale Functionality_id unsignedint This property allows a unique number to be generated when a new individual as a Functionality instance is inserted into the database. Functionality_Classification classification_int A classification system, such as Universal Decimal Classification (UDC) or any other taxonomy could be used to classify various functionalities. The users will then be able to search according to various clearly defined taxonomical concepts. Functionality_Tag string Tags may additionally be used to describe a VM image s functionalities. By using tags, it may be possible to induce various 28 P a g e

29 Functionality_Name Functionality_Description Functionality_InputDescription Functionality_OutputDescription Functionality_Implementation Functionality_Domain string string string string implementation_id string functionality domains for which the users frequently upload or search for VMIs. Similarly to practices adopted by YouTube, functionalities must have clearly defined names corresponding to the core aspects covered. This is a longer, freestyle textual description of the functionality as provided by the person who is uploading the disk image. This is a freestyle description of the input data needed by the image. E.g. a data disk that should be mounted to the image. This is a freestyle description of the output data of the disk image when it is running. This is a description of the programming language, algorithm, libraries used to implement the functionality, and may be used as part of the search criteria. This is the application domain for the given functionality, e.g. used in civil engineering, medicine etc. Table 8: Implementation class Class Name: Description: Implementation Each function, which is implemented inside the DiskImage is implemented in some programming language, and may implement some complex computational algorithm, such as in the case of the DEIMOS EOD use case. Obviously, the implementation can influence the quality of the provided function. Some users may prefer specific implementations, due to which it is necessary to store metadata in the knowledge base about this aspect. Data Properties P a g e 29

30 Property Name Range (Type) Rationale This property allows a unique number to be generated when a new Implementation_id unsignedint individual as an Implementation instance is inserted into the database. Implementation_Description Implementation_Programminglanguage Implementation_Algorithm string string string This is a textual description about how the functionality has been implemented. This specifies the programming language that has been used to implement the functionality, potentially also libraries and other important information related to the implementation. Some functionalities may be complex, e.g. may require satellite images, such as in the case of the DEIMOS use case, and the output may be calibrated satellite images. For the implementation of the same functionality various algorithms may be used, so here, the developer may specify the exact algorithm used. Class Name: Description: Table 9: DiskImageSLA class DiskImageSLA To support the overall lifecycle of applications deployed as VM images and to facilitate automatic optimization, setup and management in federated Cloud environments, information on the potentially new and resulting Service Level Agreements (SLA) are needed to be stored. Data Properties Property Name Range (Type) Rationale DiskImageSLA_id unsignedint This property allows a unique number to be generated when a 30 P a g e

31 DiskImageSLA_ReferenceImage DiskImage_id DiskImageSLA_AgreedAvailabilityRepository Repository_id DiskImageSLA_AgreedAvailabilityCountry Country_id DiskImageSLA_AgreedRestriction Country_id DiskImageSLA_AgreedPrioerityLevel unsignedint DiskImageSLA_AgreedQoSOrder unsignedint new individual as a DiskImageSLA instance is inserted into the database. This property defines that a particular DiskImageSLA belongs to which DiskImage. The user may wish to specify for every uploaded disk image exactly in which repositories of the ENTICE environment it must be stored. The user may wish to specify countries or wider regions where a particular disk image has to be highly-available. This means, that the image will have to be stored in one or more repositories in the designated countries. Due to privacy and security reasons and National regulations, some users may wish to restrict the availability of a disk image in certain countries. Since the repository of VMIs has to provide fast delivery, and the fast storage may be expensive, it is necessary to provide mechanisms for storing images at three hierarchical levels supported by adequate technology for fast, medium and low-speed content delivery. Users will be able to specify one or multiple objectives to optimise their VM image and fragments distribution. The objectives are from three types of conflicting metrics: QoS-related, operational cost, and storage. The user can specify what is P a g e 31

32 DiskImageSLA_SecuredDelivery boolean ( Yes or No ) DELIVERABLE 5.1 the most important objective or will have to choose between different solutions showing compromises between the different objectives. The contents of some VMIs and data could be business-critical and not available under any circumstance to intruders. For this purpose, the repository must support content delivery through encrypted channels. Class Name: Description: Country Table 10: Country class Various countries in all over the world could be defined. Data Properties Property Name Range (Type) Rationale Country_id unsignedint This property allows a unique number to be generated when a new individual as a Country instance is inserted into the database. Country_Name string This property implies the name of country. Class Name: Description: Delivery Table 11: Delivery class This class specifies the overall disk image delivery event. It collects important information needed for reasoning, e.g. to check out if the Quality of Service and Service Level Agreement is met by the services of the distributed ENTICE repository. Data Properties Property Name Range (Type) Rationale Delivery_id unsignedint This property allows a unique number to be generated when a new individual as a Delivery instance is inserted into the database. 32 P a g e

33 Delivery_User Delivery_Functionality Delivery_RequestTime Delivery_TargetRepository Delivery_DeliveryTime Delivery_DeliveredDiskImage User_id Functionality_id datetime Repository_id datetime DiskImage_id Specifies the user or the Cloud application that requested the delivery of a particular disk image implementing a particular functionality. Specifies the functionality that was requested. Specifies the time when the delivery request was received by the ENTICE environment. Specifies the target repository where according to the internal reasoning method of the ENTICE repository the disk image must be delivered. Specifies the delivery time for a disk image - the moment when the download of a specific disk image successfully completed. Specifies the disk image which was delivered upon the request of the user or the Cloud application. Class Name: Description: User Table 12: User class The user prepares an un-optimised VMI or requests a prepared VMI. The user might need to use the ENTICE to optimize it by uploading it to the ENTICE repository along with its functional descriptions and the necessary functionality testing process. Moreover, the external Interface allows users to download VMIs through an API, which allows the delivery of a pointer (i.e. URI). Data Properties Property Name Range (Type) Rationale User_id unsignedint This property allows a unique number to be generated when a new individual as a User instance is inserted into the database. User_FullName string Specifies the name of the individual user (like application developer) or company (like Service provider). User_UserName string Specifies the name of the user account used P a g e 33

34 User_Password User_ string string DELIVERABLE 5.1 to log-in and authenticate. Specifies the password of the user account used to log-in and authenticate. Specifies the of the user used to automatically send and receive the notifications and transactions. User_PhoneNumber string Specifies the phone number of the user. Class Name: Description: Table 13: Quality class Quality Estimating the optimizing potential of a VMI is essential to make a trade-off between the time and resource devoted for the optimization and its benefits. If the anticipated benefits in terms of storage space and transfer times is too small, the optimization would not be initiated. Such request for analysis can be issued by the VMI distribution agent and the application developer. Upon the results of the analysis they may or may not initiate the image size optimization. Data Properties Property Name Range (Type) Rationale Quality_id Quality_RoughSize Quality_OptimizedSize Quality_PercentStorageOptimised Quality_FunctionalityTested unsignedint unsignedint unsignedint unsignedint (0-100) boolean ( Yes or No ) This property allows a unique number to be generated when a new individual as a Quality instance is inserted into the database. Specifies the first size of un-optimised uploaded VMI. In ENTICE, the VMIs will be passed over several treatments that reduce their size in order to reduce the necessary delivery time and storage needs for a particular functional requirement set. The developer could decide on an optimisation termination condition which might limit the possible size reduction. A Cloud application developer may use the ENTICE repository to optimize an own 34 P a g e

35 Quality_UserRating Quality_UserComments Quality_IsUpdateNecessary Quality_IsOptimizationNecessary Quality_NumberOfDownloads unsignedint (0-5) Comment_id boolean ( Yes or No ) boolean ( Yes or No ) unsignedint prepared and un-optimized VMI. For this purpose, he needs to upload the VMI along with its functional description and all the necessary functionality testing mechanism to the repository. The functionality testing mechanism should then assure that the optimized VMI preserves the original functionality. There is a way which enables users to rate VM images. Rating values are from 0 to 5 points. The ENTICE repository should be able to support user s complaints and comments. During the Cloud application lifecycle, VMIs may be upgraded (e.g. operating system or software component changes) several times. The application developer may request the update of an optimized VMI and subsequently. Updating an optimized image involves deciding the inclusion of an update by a service provider, i.e. depending on the decision of a service provider the update may or may not take place. Therefore, automatic updates may be allowed or disallowed by the developer of the image. Updating an optimized image involves deciding the inclusion of an update by a service provider, i.e. depending on the decision of a service provider the update may or may not take place (e.g. a certain service is no longer used hence, its update is not necessary); and redistributing the optimized image fragments. Also ENTICE needs to collect information about the usage of the VMI repository and store it in the knowledge base for P a g e 35

36 later statistics. ENTICE will automatically discover user demand patterns by analyzing the metadata (e.g. sequence and number of downloads of particular images or fragments) published by the provider-operated repositories (e.g. similar to Glance from OpenStack) and replicate the highly demanded images or fragments according to user demands. Class Name: Description: GeoLocation Table 14: GeoLocation class An international formal Class will be used. Data Properties Property Name Range (Type) Rationale GeoLocation_id unsignedint This property allows a unique number to be generated when a new individual as a GeoLocation instance is inserted into the database. There are different pre-defined properties to be considered. Class Name: Description: Table 15: OperatingSystem class OperatingSystem The ENTICE environment should support various types of VMI operating systems. Data Properties Property Name Range (Type) Rationale 36 P a g e

37 OperatingSystem_id OperatingSystem_Types OperatingSystem_FileSystem OperatingSystem_Architecture OperatingSystem_Version OperatingSystem_MobileOS unsignedint string String (few values) String (few values) string boolean ( Yes or No ) This property allows a unique number to be generated when a new individual as a OperatingSystem instance is inserted into the database. The operating systems of the VM images include ArchLinux, CentOS, Debian, Fedora, OpenSUSE, Ubuntu, AIX, Amiga, Android, AROS, Bada, BeOS, Blackberry OS, Brew, Chrome OS, COS, Danger Hiptop, DragonFly BSD, Fire OS, Firefox OS, FreeBSD, GNU OS, Haiku OS, HP-UX, Inferno OS, ios, IRIX,... To make deployment flexible for different needs of users and applications, it is desirable for the Entice to support a variety of versions of VM images for different types of file systems. To make deployment flexible for different needs of users and applications, it is desirable for the Entice to support a variety of versions of VM images for different types of configurations; e.g., 32-bit/64-bit hardware. To make deployment flexible for different needs of users and applications, it is desirable for the Entice to support a variety of versions of VM images for different versions of operating systems. Specifies if the operating system is a mobile operating system like Google Android or Windows Phone OS. 6. Ontology development for the ENTICE project For the development of the ENTICE ontology, Protégé is used as an ontology editor. This tool is a free, open source, ontology editor created by the Stanford Medical Informatics (SMI) group from Stanford University. Protégé, which is easy to follow, use and configure, was chosen from different ontology development frameworks; since to represent underlying knowledge, less a P a g e 37

38 priori knowledge and less difficulty of learning is needed. Additionally, Protégé is able to achieve interoperability with other knowledge representation systems [Noy2004]. Regardless of the tool used the crucial and the toughest part of the ontology development is identification of the entities, their properties and relations in connection to the problem. Thus, such tool like Protégé becomes more useful after initial ontology has been already defined. Initial ontology for ENTICE has been a collaborative effort of UL with many valuable comments from UIBK and SZTAKI. As already seen in Section 5, upon the present, several classes were recognized, each representing an entity in ENTICE, and shown in the following Figure (see Figure 3), which is effectively a summary of classes presented in the Tables 2 to Table 15, discarding their data properties and focusing on the object properties instead. Nodes (circles), on the Figure 3, represent classes, while directed edges represent relationships between them. For now, the relations are simple, mimicking the Parent-Child relation from the ER (Entity- Relationship) model. Note that different sizes and colors of nodes and edges carry no semantic meaning, only stylistic. In the following, a thorough explanation of the Figure 3 is given, with the focus on describing relations between the classes. As the ontology is a work in progress, a few shortcomings of the current ontology are addressed as well, serving as a foundation for ongoing work. DiskImage, VMImage and CImage The central class with the largest number of connections is DiskImage, which is a superclass of class VMImage (abbreviated as VMI) and CImage (abbreviated as CI). While technically different as representing two different levels of virtualization, namely hardware-based and operating system-based virtualization, correspondingly, currently from the perspective of relations with other classes, there is no distinction between them. On one hand it is desired for any stakeholder of ENTICE to have a similar experience when dealing with either type of image and hence treating them as a single class seems desirable but, on the other hand they are technically too different and therefore representing them as two separate classes seems more suitable, while still descending from a common parent. Regarding the latter, some connections on the Figure 3 might be misleading, as they visually appear as being connected to the container class only, although they are in the same relation with the virtual machine image class too. 38 P a g e

Figure 3: Initially identified entities and their relationships DiskImage class in relation with itself DiskImage class and thus both derived classes, VMI and CI classes, are related to themselves.

39 Figure 3: Initially identified entities and their relationships DiskImage class in relation with itself DiskImage class and thus both derived classes, VMI and CI classes, are related to themselves. This relation happens in two cases: (1) when the existing image is updated or, (2) when the existing image is optimized. In either case the result of both, update or size reduction optimization, is the new image, related to the old one. The relation should be maintained due to several reasons. Depending on the SLA of image owners and image users, some users may not want to receive updates at the moment they are available, but might need some time before they decide to incorporate them. Another example is when the application developer decides to optimize his or her image with the ENTICE size optimization facility but might not be confident P a g e 39

D5.3: ENTICE knowledge base model and reasoning 3 [M32] 30/09/2017

D5.3: ENTICE knowledge base model and reasoning 3 [M32] 30/09/2017 ENTICE Knowledge Base Model and Reasoning decentralized repositories for transparent and efficient virtual machine operations D5.3 ENTICE