OSDBQ: Ontology Supported RDBMS Querying

OSDBQ: Ontology Supported RDBMS Querying Cihan Aksoy 1, Erdem Alparslan 1, Selçuk Bozdağ 2, İhsan Çulhacı 3, 1 The Scientific and Technological Research Council of Turkey, Gebze/Kocaeli, Turkey 2 Komtaş Information Management Inc., Ostim/Ankara, Turkey 3 Turkish Court of Accounts, Söğütözü/Ankara, Turkey {caksoy, ealparslan}@uekae.tubitak.gov.tr Abstract. Data handling and retrieving has an essential importance where the data size is larger than a certain amount. Storing transactional data in relational form and querying with Structured Query Language (SQL) is very preferable because of its tabular structure. Data may be also stored in an ontological form if it includes numerous semantic relations. This method is more suitable in order to infer information from relations. When transactional data contain many semantic relations inside as in our problem, it is not easy to decide the method of storing and querying. We introduce a new querying mechanism Ontology Supported RDBMS Querying (OSDBQ) covering positive sides of both two storing and querying methods. This paper describes OSDBQ and presents a comparison between three querying methods. Results show that most of the time OSDBQ returns fairly better results than the others. Keywords: Ontology, relational database management systems, inference, transactional data querying, semantic matching. 1 Introduction The semantic web introduces web of data that enables machines to interpret the meaning of information on the web [1]. This approach was required since one cannot achieve corresponding information belong to the searching criteria from the web, because the web turned into information dump in the last decade. This technology aims at semantically tagging the resources in order to cope with dirty information. One of the most important parts of semantic web is ontology layer. Ontology is hierarchy of concepts representing the meaning of an information field [2]. It is possible to obtain ontological data from these concepts by instantiating them. Therefore, data includes same relationships between each other like the concepts. This allows us to infer hidden or deep information that is achieved by using some relationships. On the other hand, data size is increasing between two and three times since the data includes relationships. There are several ontology representations such as OWL [3], RDF [4], N-Triple [5] etc. A widely known query language for RDF is called as SPARQL [8]. Commonly used relational tables are far away from the meaning notion; they only keep the data in a tabular form. Data retrieved from these tables by querying with SQL. Since they only focus on data, their size is smaller than an ontological data store

that has same amount of data. That s why querying a relational table returns faster results. Relational tables are designed for tabular data, ontologies well behaved for hierarchical data where semantic relationships exist. Transactional data that consist of sales, deliveries, invoices, claims and other monetary and non-monetary interactions, may have many relationships inside. They are usually kept in tabular form. Modeling transactional data is a challenging point when the data size is larger than a certain amount. We should consider the querying performance, data size, extracting hidden or deep information. In this paper, we propose a new querying method that satisfies the lack of relational querying method by benefitting from the powerful side of ontology. Hence, our approach suggests keeping transactional data in tabular form and querying with SQL. But while querying, it infers supplementary information from ontologies and adds obtained information in parameterized SQL queries. The rest of the paper is organized as follows. In Section 2, we give details about the used data models and querying methods. Also we introduce a new querying method and explain its working principle. In Section 3, we apply mentioned methods to the sample data and we show the results. Finally, in section 4, we discuss the limitations of our approach and explain what can be done to solve them as future works. 2 Applied Methods 2.1 Querying Ontological Data via SPARQL As a first data model, we made BSBM Tools [10] a data generator tool provided by Berlin SPARQL Benchmark (BSBM) [11]- generate N-Triple data format to have ontological data. Ontological data includes not only the concepts that belong to a certain domain but also the instances of these concepts. These instances correspond to the tuples of a tabular data model. Moreover, the semantic relations between concepts and between instances are found in this kind of data. That s why ontological data covers larger size on disk compared with relational data. In order to have a querying interface, we put the generated data in TDB Store [6] by using tdbloader component of Jena API [7]. It is observed that the data size on disk is a bit decreasing due to the efficient storing mechanism of TDB Store. After have stored the data, we sent queries in SPARQL query language from a programmatic interface thanks to Jena API. Also we used JENA s reasoner because SPARQL queries are required to make inferences. That is very suitable for ontological data since they hold relationships inside. Indexing and caching capabilities of TDB Store provides shorter query time after each time for same query as seen in results.

2.2 Querying Relational Data via SQL In order to prepare a relational database as an environment of second method, we again used BSBM Tools and we made it generate an SQL dump in which all necessary commands to form the database tables are found. We executed these commands in MySQL database; we extracted and loaded this data in PostgreSQL. As we represent the characteristics of the dataset in 3.2, certain points are identical in all types of generated data for the same scale factor. For example, relationships inside the data, amounts of instances, amounts of concepts etc. are same in all types of generated data whereas the value of data may change. After we loaded the data in database, we noticed that there are many relation tables, so that the data represent same relationships which are found in ontological data. Instead of use the relationships of tables directly; we applied Das et al. s Supporting Ontology-based Semantic Matching in RDBMS approach [9] by taking these relationships into separate tables that represent the ontology. Therefore, as shown in Figure 1, the database can be viewed as it is formed from two main table group; one of them holds the ontologies while the other holds the data. Fig. 1. Das et al. s architecture. In this approach, as mentioned before, there are two groups of tables. In the first group, called system defined tables, the ontological structure and semantic relations are stored. We can infer semantic relations between individuals from system defined data tables. The second group, called user tables, stores the bulk individuals or tuples in tabular form. After inferring semantic rules from system defined tables, prepared rich SQL statement can be executed on user tables. In this way, querying semantic related data by using RDBMS can be achieved without using an inference engine. After generated the database, we executed the same queries from the same interface except the query language. In this method, we transformed SPARQL queries into SQL queries. Moreover, we inferred relationships from ontology related tables by using stored procedures as explained in Das et al s approach. For example, the operator ONT_RELATED returns all parents attached hierarchically to a certain concept or instance. The goal of this method is to represent the ontology in a relational database to be able to benefit from the performance of this kind of database.

2.3 Ontology Supported Relational Data Querying Thirdly, we propose to keep the semantic relations between individuals separate from the tabular transaction data. As we mentioned before, the first approach infers and queries both semantic relations and transactional data from RDF store. On the other hand, the second approach queries both semantic relations and transactional bulk data from RDBMS data store. In this approach we propose to infer semantic relations from RDF store and then querying the transactional bulk data from conventional RDBMS. In other words, we realize the second approach by replacing system-defined tables with RDF store. Fig. 2. OSDBQ architecture. To realize this approach the same RDF store have been used that we have generated in the first method. Different from first method, this time RDF store is used for querying the domain ontology and inferring semantic relations, not for querying the transactional bulk data. The related ontology file obtained from RDF store is loaded into the memory in order to avoid time losses of file opening while inferring. By using JENA API s reasoner, necessary semantic relations are inferred and then reflected to the second part of this approach. The second part, which queries bulk transactional data, takes the inference results previously presented by JENA as parameters and sends the parameterized SQL query to the RDBMS data store. Therefore a transactional data can be queried in the ontological form and with the RDBMS performance, as seen in Figure 2. Figure 3 depicts a basic flow of our proposed Ontology Supported Relational Database Querying Architecture. Complex user query, which may require both semantic inferring and transactional data querying, is given to the system by the user interaction. If the complex query needs inference then the system loads the required ontology RDF files into the memory and realize the semantic inferences by using ontology objects. These inferences basically prepare rich SQL parameters for transactional data querying. In other words, some of the parameters which are used in WHERE clauses of transactional SQL queries are prepared by inference engine

loaded into the memory. Therefore the system is able to send the SQL queries on the transactional data by using inferred parameters obtained from the inference engine running on memory. No SQL querying on transactional bulk data Complex user query Needs inference? Return rich query results Yes Load ontology into the memory Infer additional query parameters on the fly Fig. 3. Flowchart of OSDBQ architecture. 3 Application to Sample Data It is very important to properly decide for organizations how the data will be stored and retrieved in case of the huge data sizes. Size of the data is not only the considered point, but also the characteristic of the data is taken into account while giving a decision; data may be designed hierarchical, relational, etc. To choose the right architecture, organizations are in need of comparison and benchmarking studies. In this section, firstly we give some information about the environment and utilities that we have used during the tests, secondly we show the dataset on which we have applied proposed methods, thirdly we explain the queries, and finally we represent the comparison results of three proposed methods. 3.1 Experimental Setup We realized the experiments on a HP Workstation (processor: 2 x Intel Pentium 3 Xeon, 2833 MHz; memory: 8GB DDR2 667; hard disk: 320GB 7200Rpm SATA2) with Ubuntu 10.04 LTS 64-bit operating system. Also following utilities were used: Jena TDB Version 0.8.9 as RDF storage and query component PostgreSQL Version 1.12.2 as database To measure the performance of these methods, we prepared two different size of dataset for each method so that the results are rendered more consistent. One of these dataset includes 10000 products and related tables whereas the other one has 100000 products. For the same goal, we also repeated the execution of each query 4 times to avoid certain effects which can slightly change the real result, such as caching mechanisms in data stores and in databases. All methods were executed on the same machine in order to avoid network latency. Obtained results are recorded in millisecond.

3.2 Dataset Berlin Sparql Benchmark s dataset [11] are used because it is relevant with transactional data. It is built around an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about products. It is possible to generate an arbitrary amount of data where number of product is scale factor. The data generation is deterministic to be able to create different representation of the same dataset. The dataset is composed of instances of these classes: Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, Reviewer and ReviewingSite. All products have between 3-5 textual properties. Each property consists of 5-15 words that randomly selected from a dictionary. Also products contain between 3-5 numeric properties whose values range between 1 and 2000. All products have a product type from the type hierarchy. The depth and width of the product type hierarchy depends on number of products. This hierarchy is set to the dataset even if the data store or database doesn t support RDFS inference. All products have a variable number of product features regarding to its position on the product hierarchy. All products are offered by vendors. Offers contain the price and the number of days for the delivery, also they are proposed for a certain date interval. Reviews are published by reviewers. Reviewers have a name, a mailbox checksum, and a country that shows where they live in. Reviews have a title and a review text that consist of between 50-300 words. Also they have four random ratings. Table 1 shows the number of instances of each class in BSBM dataset depending on our choice of product number. We shortly called as small dataset which is generated with 10000 products, and big dataset which is generated with 100000 products. Since ontology hold data as triples, we gave the number of triples below for each amount of data. Table 1. Number of instances in BSBM datasets for different scales. Data Type Small dataset Big dataset Number of Product Feature 10519 47884 Number of Product Type 329 2011 Number of Producer 206 1988 Number of Vendor 105 995 Number of Offer 200000 2000000 Number of Review 100000 1000000 Total Number of Instances 311159 3052878 Total Number of Triples 3565060 35270984 3.3 Queries We used 4 queries which include sufficient depths in order to compare the methods. We give attention of depth since we look for an ideal data storing and querying mechanism for transactional data. As mentioned before, dataset is built around an e-

1 st query 2 nd query 3 rd query 4 th query commerce use-case, and queries correspond with the search and navigation pattern of a consumer looking for a product. In the first query, the costumer searches for products that have a specific type and features. Secondly, the consumer asks for products belong to certain types having several features but not having a specific other feature. Thirdly, the consumer looks for products belong to certain types matching either one set of features or another set. Inference is needed for these three queries since each product type hierarchically belongs to another product type as described in dataset section, thus products that don't match directly with a given type may be returned as result. In the last query, a vendor wants to find out which product categories get the most attention by people from a certain country. A similar inference as the others; product category that attracts most attention of people will increase the popularity of the parental product category. 3.4 Results and Interpretation After have prepared the environment, we started to execute the queries. Firstly we sent the queries on small dataset that consists of 10000 products. Table 2. Results of methods for 10000 products (ms). 1 st method 2 nd method 3 rd method 1 st test 3009 1169 1881 2 nd test 269 35 12 3 rd test 179 34 13 4 th test 117 31 12 1 st test 1075 772 1675 2 nd test 146 58 14 3 rd test 147 50 13 4 th test 72 35 12 1 st test 1637 104 496 2 nd test 150 46 64 3 rd test 105 48 62 4 th test 115 48 61 1 st test 2304 3153 1912 2 nd test 960 28 28 3 rd test 549 30 18 4 th test 515 29 17

1 st query 2 nd query 3 rd query In Table 2, each column represents one of data storing and querying method mentioned in the previous section. At the rows, experiments of queries are shown. Each query was performed four times. We took better results for each time we execute the same query. It is observed that results are sufficiently consistent after the first test. We can easily deduce that the 1 st method where ontologies queried gives worse results compared to the others. However, the 3 rd method where ontologies were partially used with tabular data, returns better results than 2 nd method where only tabular data were queried. This means that supporting relational databases with ontologies and querying them by using semantic reasoners may increase the performance. Table 3. Results of methods for 100000 products (ms). 1 st method 2 nd method 3 rd method 1 st test 77034 12728 10154 2 nd test 591 286 298 3 rd test 484 285 285 4 th test 468 279 286 1 st test 514 4856 12631 2 nd test 448 568 211 3 rd test 441 565 212 4 th test 443 559 211 1 st test 1401 1434 3233 2 nd test 1367 912 756 3 rd test 1342 912 754 4 th test 1323 905 749 In order to be certain from consistency of the results of our small dataset, we performed the same process with 10 times bigger data. In Table 3, where columns and rows represent same points as Table 2, results proved that 3 rd method returns usually the best results among three methods. On the other hand, although the 3 rd method gives the best result, it may not be always applicable as seen in our experiment. Since the suitable ontology for executing the 4 th query was larger than the size of the memory, we couldn t realize last query. This shows the scalability problem of the 3 rd method.

4 Conclusion In this paper, a new data storing and querying mechanism, Ontology Supported RDBMS Querying (OSDBQ) is introduced by comparing with the most known two data querying mechanisms. SQL is ideal for querying tabular data. SPARQL is preferred for querying and inferring ontological data where relations exist. OSDBQ approach is well behaved for querying tabular data where relations exist like ontological data. Its performance not only relies on the positive sides of the others, but also the necessary inferences are realized with ontologies that are held on memory. However, our approach may be restricted from where it bases on. Large ontologies may not be fit into the memory. Since the size of the data is increased because of all relationships are included in ontological data, results of 1 st method are always worse than the others. Most of the time 3 rd method returned better results than the 2 nd method. So we can easily say that it should be used as long as the ontologies are fit into the memory. This paper aims to develop a new querying method handling semantically related transactional dataset. The new OSDBQ method may be applied on huge datasets and large ontologies behind these dataset. Our future work will be coping with handling large ontologies on memory. We will try to predict the necessary parts of ontologies and partially bring them on memory. A huge transactional and semantically relational audit dataset of Turkish Court of Accounts will be adapted to the OSDBQ framework for analytical purposes. Therefore, we will try to overcome the size problem of large ontologies. Ontology merging may be another challenge for improving the performance by facilitating the foresight mentioned above. References 1. Berners-Lee, T., Hendler, J., Lassila O.: The Semantic Web. Scientific American Magazine. (2001) 2. Guarino, N.: Formal Ontology and Information Systems. In: 1 st International Conference on Formal Ontology in Information Systems [FOIS], pp. 2--5. Torino (1998) 3. McGuinness, D.L., Harmelen, F.: Owl Web Ontology Language, http://www.w3.org/tr/owl-features/ 4. RDF, http://www.w3.org/rdf/ 5. N-Triples, http://www.w3.org/2001/sw/rdfcore/ntriples/ 6. TDB, http://openjena.org/wiki/tdb 7. Jena A Semantic Web Framework for Java, http://jena.sourceforge.net/ 8. SPARQL Query Language for RDF, http://www.w3.org/tr/rdf-sparql-query/ 9. Das, S., Chong, E., Eadon, G., Srinivasan, J.: Supporting Ontology-based Semantic Matching in RDBMS. VLDB, pp. 1054-1065, Toronto (2004) 10. Schultz, A.: BSBM Tools, http://sourceforge.net/projects/bsbmtools/ 11. Bizer, C., Schultz, A.: The Berlin Sparql Benchmark. In: Int. J. Semantic Web. Inf. Syst., pp. 1--24. (2009)