Object Migration in a Distributed, Heterogeneous SQL Database Network

Size: px

Start display at page:

Download "Object Migration in a Distributed, Heterogeneous SQL Database Network"

Cory Small
5 years ago
Views:

1 Linköping University Department of Computer and Information Science Master s thesis, 30 ECTS Computer Engineering (Datateknik) 2018 LIU-IDA/LITH-EX-A--18/008--SE Object Migration in a Distributed, Heterogeneous SQL Database Network Datamigrering i ett heterogent nätverk av SQL-databaser Joakim Ericsson Supervisor : Tomas Szabo Examiner : Olaf Hartig Linköpings universitet SE Linköping ,

2 Upphovsrätt Detta dokument hålls tillgängligt på Internet eller dess framtida ersättare under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida Copyright The publishers will keep this document online on the Internet or its possible replacement for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: c Joakim Ericsson

3 Abstract There are many different database management systems (DBMSs) on the market today. They all have different strengths and weaknesses. What if all of these different DBMSs could be used together in a heterogeneous network? The purpose of this thesis is to explore ways of connecting the many different DBMSs together. This thesis will explore suitable architectures, features, and performance of such a network. This is all done in the context of Ericsson s wireless communication network. This has not been done in this context before, and a big part of the thesis is exploring if it is even possible. The result of this thesis shows that it is not possible to find a solution that can fulfill the requirements of such a network in this context.

4 Acknowledgments Thanks to my family that has encouraged and supported me through all my years of education. It has not always been easy but entirely worth it looking back. This thesis marks a cornerstone of my formal education, but it is only the start of a lifelong learning. iv

5 Contents Abstract Acknowledgments Contents List of Figures List of Tables List of Listings List of Acronyms iii iv v vii viii ix x 1 Introduction Motivation Old System Aim Initial System Description Requirements Heterogeneous Databases SQL Database Data Migration Performance Requirements Research Questions Delimitations Method Pre-study System Design and Implementation Evaluation Theory Relational Database Management Systems and SQL SQL ACID OLAP versus OLTP Distributed Database Management Systems ODBC, JDBC, and OLE DB Microsoft Linked Servers Distributed Query Engines PrestoDB Apache Drill v

6 3.7 Multistore and Polystore Systems CloudMdsQL and CoherentPaaS BigDAWG Data Warehousing Distributed Database Database Schema Distribution Schema System Architecture Centralized Approach Distributed Approach Possible Architectures CellApp Architectures Architecture A Architecture B Architecture C Machine Learning Application Architecture D Architecture E Results Performance Measurements Test Setup Mock Application Database Management Systems (DBMSs) Tested Test Parameters Result using ODBC ODBC Using a High-performance Machine Using JDBC Introducing a Network Delay Introducing a Distributed Query Engine Comparing Relational Database Management Systems (RDBMSs) to a non SQL (NoSQL) Alternative The Machine Learning Application Data Migration Final System Discussion Literature Study Performance Measurements Uniform Structured Query Language (SQL) Syntax Data Migration Method Final System Future Work Societal and Ethical Considerations Conclusion 48 Bibliography 50 A Protobuf Model 52 vi

7 List of Figures 1.1 Data format Initial sketch of the proposed system architecture SQL database schema States table distribution schema Neighbors table distribution schema Architecture A - Simple direct connection Architecture B - Simple middleware Architecture C - Simple Architecture D - Middleware Architecture E - Database access component Test setup MySQL Memory engine read and write operations SQLite read and write operations SQLite spike read and write operations SQLite main memory read and write operations Test setup Test setup Redis read and write operations Proposed solution vii

8 List of Tables 6.1 Test parameters Test machine ODBC - Initial measurements of query read operations Test machine ODBC high-performance machine - Measurements of query read operations JDBC - Initial measurements of query read operations ODBC - Over network Distributed query engines - Over network Performance measurements non SQL (NoSQL) viii

9 List of Listings 3.1 PrestoDB distributed query example Drill distributed query example Mock CellApp pseudo code Read request Write request GET operation SET operation SQL data migration Delete functionality A.1 Protobuf model ix

10 List of Acronyms 5G 5th generation mobile network. ACID atomicity, consistency, isolation, and durability. ANSI American National Standards Institute. API application programming interface. DBMS DDBMS ISTC JDBC NoSQL ODBC OLAP OLE DB OLTP RDBMS RTT SQL database management system. distributed database management system. Intel Science and Technology Center for Big Data. java database connectivity. non SQL. open database connectivity. online analytical processing. object linking and embedding, database. online transaction processing. relational database management system. round-trip time. structured query language. x

11 1 Introduction The field of distributed systems is growing rapidly today. Nowadays an increasing number of systems are in some way distributed to enhance performance and/or stability but at the same time, the distribution often increases the complexity of the system. 1.1 Motivation The 5th generation mobile network (5G) architecture is a highly distributed system with a lot of different requirements such as the need to route huge amounts of data in real-time. To keep this system up and running there is a need to store and share a lot of configuration data between the applications in the system. This configuration data is used to calculate handovers between cells. It is therefore important to maintain a common configuration throughout the entire network. Distributed databases are also a growing field, with new techniques developing at a rapid pace. Having a large number of databases distributed over large geographical areas leads to interesting problems that need to be solved. There exist two distinctly different architectural approaches, homogeneous and heterogeneous. In a heterogeneous distributed database system, each site may run a different database software. Even the operating system or hardware may differ. This has the advantage of making the entire system easier to expand when it is compatible with a variety of configurations. This flexibility does, however, come with some disadvantages; communication between the different types of databases is not as straightforward as it would have been if the system was homogeneous. Transfer of data must be translated when the exchange of data occurs between databases of different types. This increases the complexity of the system and can prove to be both a technical and economical challenge. [1] A homogeneous distributed database system consists of a network of similar machines running the same hardware and software. This is often simpler and less costly to implement, but at the same time, it puts more requirements on the system. [2] 1

12 1.2. Old System The customer, in this case Ericsson, have a set of requirements that the system needs to conform to in order to be a useful replacement for their current system. To measure if the system reaches these requirements some metrics must be used. These metrics represent an important part of the requirements. More specific information about the requirements and metrics that will be studied in order to measure the system can be found below in the method and theory sections. 1.2 Old System Ericsson has thousands of base stations that currently hold a lot of configuration data using in-memory storage. This data is used to keep track of neighboring cell towers, optimize handovers, and store application state. For simplicity these functions will be combined into a factitious application called "CellApp" in this thesis. The data that should be stored for each CellApp is using the structure illustrated in Figure 1.1. Figure 1.1: Data format The attributes in Figure 1.1 are key-value pairs where the keys are strings and the values are of simple primitive types like integers, booleans, and short strings. Each neighbor entry in the list consists of a small number (usually ten to fifty) of key-value attributes representing the state of one of the neighboring base stations. This data about neighbors is used for handover calculations. 1.3 Aim This master s thesis will implement and evaluate ways of communicating with distributed heterogeneous structured query language (SQL) database systems. The solutions will be compared, using some selected metrics, to similar implementations in a heterogeneous designed network. The metrics used to compare the systems will be selected by examining the existing solution and architecture of the system. It may not be feasible to implement this 2

13 1.4. Initial System Description system using a heterogeneous architecture. In that case, different heterogeneous solutions will be evaluated against each other. 1.4 Initial System Description The problem with the current way of data storage, using the in-memory storage, is that it is hard to change if you want to use another database or move the data to a remote database. This would require rewrites of the code and would also require specially written code for each of the different ways of storing data. This thesis aims to find solutions to this problem by moving towards a more abstracted approach of data storage for this application. This will enable data to more easily be stored using different database systems and both be stored locally and remotely, which is not possible at the moment, all using a uniform syntax. The thought, by Ericsson, is that SQL could be a suitable abstraction to solve this. Therefore, SQL DBMSs will be an important focus. An initial sketch of the architecture wanted for this system can be seen in Figure 1.2. Machine Learning Application SQL MIDDLEWARE Distributed Database Local SQLite Remote MySQL Cloud SQL SQL MIDDLEWARE CellApp CellApp CellApp Figure 1.2: Initial sketch of the proposed system architecture Another reason why this abstraction is wanted is the possibility for some machine learning application to be able to read all the data in the entire network of databases and analyze it (see Figure 1.2). This application will then do some calculation to optimize handovers between CellApps and make changes to configuration data in the heterogeneous database network, based on these optimizations. This change could for example be adding or removing neighbors for a specific CellApp. This machine learning program has significantly fewer performance requirements than the CellApp. This is because the machine learning algorithm will only be run during some maintenance period, typically once a day. The CellApp however, needs to continuously run in real-time. 3

14 1.4. Initial System Description It is possible that the data of a CellApp need to be migrated to another database. One case for this could be that too many CellApps are running on the same piece of hardware and some of them need to be moved to another machine due to performance reasons. In this case, the data will also need to be moved into a remote or cloud database. This migration should preferably not interfere with the availability of the data. Data also needs to be consistent between databases after a migration so that it is not the case that data is corrupted or only partially migrated or, in the worst case, removed completely. The decision to start a migration of the data is made by human intervention. Another aspect of the research is to check if it is possible to move this data to a remote or cloud database or if this will add too much of a performance penalty that it will be unusable. It will often be the case that multiple CellApps use the same DBMS Requirements In this section, the requirements of the system will be listed and explained. The requirements are tightly coupled to the research questions. The requirements come from the customer, Ericsson, and are what this thesis will focus on. It is not certain that all the requirements below are feasible to implement together, but the goal of the project is to find and evaluate such a system. This section will also try to explain the reason why these requirements exist and why the requirements are important in the context of this project and the system in question. The source of the information in this chapter is the customer, Ericsson, if nothing else is specified. The information was extracted during meetings with the technical supervisor of the thesis project and from that the requirements were formed. The following requirements on the system exist. The requirements are tightly coupled with the research questions that are listed in Section Heterogeneous Databases The network is heterogeneous and distributed because the system will consist of many different machines, that handle different load, are located at different locations and, most importantly, run different DBMSs. This becomes even more complex when taking into account that not all of the DBMSs will have the same features or even run on the same system architecture. Some of the DBMSs could be running in the cloud and have access to thousands of gigabytes of data storage and a lot of processing power and RAM. These computers may even be able to scale up and down depending on the demand of the service. Some of the other parts of the system may not have such resources and instead contain the processing power and storage of an average desktop PC. All of this increases the complexity of connecting and using the network of databases using a uniform syntax SQL Database One of the requirements of the system is that it should be an SQL database system. SQL will be used because SQL is considered, by Ericsson, to be a good abstraction for communication in a heterogeneous database system. From this requirement it follows that relational databases are a good choice because they almost always use SQL. The data that is being stored in the system is relatively simple and will not use all the features that a relational SQL database provides. The data, as it looks today, is almost entirely key-value storage without any complex relations. It is also believed to be worth to use SQL for future development if the data stored should need some more advanced structure in the future. 4

15 1.5. Research Questions Data Migration Data in this system should be able to be moved from one DBMS to another, this will be called a migration of data. Whenever such a migration takes place, the data should at the same time be available to the applications. Hence, data migration should impose as little loss of availability as possible. The data must also remain consistent between the migrations, so it is not the case that data is corrupted or only partially migrated Performance Requirements These are the performance requirements for the system. There are two types of applications that are part of the system: the CellApps and a machine learning application. These two applications have different data access patterns and different performance requirements. CellApp typical usage: query data from 100 different neighbors per second, and only a few write operations every minute. Each of the neighbors that are being queried contains 20 attributes. One read query will be issued every 10ms and should not overrun that time. Machine learning application: Reads all the data periodically (daily) and makes updates to selected attributes. The time frame for this is hours. To benchmark this system, mock applications with the same access patterns as the real system will be developed. Based on these mock applications, I will measure the time it takes for typical queries in the system to run. More specifically the time is measured from the when the query is executed until the data queried is returned to the application. 1.5 Research Questions The following questions will be considered in this master s thesis: 1. What are the important properties for a distributed network of heterogeneous SQL DBMSs in the given setting? 2. What frameworks and techniques can be used to communicate with a heterogeneous network of DBMSs using a uniform SQL syntax? a) How do these frameworks and techniques compare to each other? b) Is it feasible to build such a system, using the frameworks and techniques, that fulfills the functional and performance requirements of this project? 3. Will the system built, using the frameworks and techniques, enable data to migrate from one DBMS to another while keeping consistency and availability of the data? 1.6 Delimitations The focus of the project will be to find already existing solutions, frameworks, and techniques to fulfill the requirements and not to implement and design such a system from scratch. Ericsson acknowledges that a system that achieves all the requirements could be built but it would take a lot of time and resources to implement from scratch. They also believe that there are similar, already existing, solutions and techniques that could be used to solve the problem. Finding these existing solutions and techniques will be the focus of this master s thesis. 5

16 1.6. Delimitations For the machine learning application, only solutions, frameworks, and techniques that can query data in place, at the different DBMSs, will be considered. This is opposed to copying data from the different DBMSs to a single location and then running queries on the collection of data. This delimitation was made to limit the scope of the thesis. One delimitation is that the frameworks and techniques must be able to run on Linux since this is the intended operating system that it will run on in production. Another delimitation that has been made in this project is that only open source and other "free to use projects" will be considered. This is because it is in, most cases, a hassle to get trial subscriptions of paid products while the open source projects can just be downloaded and run instantly. While these paid alternatives will not be tested in this thesis, some of them will be evaluated in the theory section and used to compare, conceptually, with the open source alternatives found. 6

17 2 Method This chapter aims to provide a method of how to build a system consisting of a distributed database, to store the data in the system, and a middleware to handle communication with the many DBMSs that the distributed database consists of. To develop this system and find answers to the research questions the first thing were to extract the requirements for the system from the customer. The requirements are by design coupled with the research questions. After the requirements were collected, a preliminary design of the system architecture could be established. These requirements were then used to perform a pre-study of techniques believed to solve the problem. These techniques were then tested and benchmarked, using the requirements, in order to find the most suitable solution for this project. To benchmark the different techniques, they were first integrated to fit into the system. Following this, tests were made with synthetic data to measure the different metrics. Synthetic data was used because of the fact that the system in question is not in production yet and there is no real data available at this moment. The synthetic test data is based on data from a legacy system. Therefore, results and measurements made in this thesis can be expected to be similar to the result in production with real data. The method chapter is divided into three parts. The first part is the pre-study were a literature study was performed to find suitable techniques for the problem. Secondly, the system design and implementation phase. Here the techniques found in the pre-study were developed into possible architectures for the system. A test environment was then set up in order to test these techniques with different architectures. In the last part, an evaluation of the different techniques and architectures was done with data measurements collected. 2.1 Pre-study The pre-study assessed different alternative techniques and architectures to build the system during a literature study. The focus was on finding systems that satisfy the requirements stated in Section To do this evaluation, techniques brought up in the theory section of this master s thesis were studied with the help of the theory as a foundation. The techniques that according to the theory are believed to be able to fulfill all of the requirements were 7

18 2.2. System Design and Implementation researched further and implemented in the design phase of the project. In the case that no system is found that can satisfy all the requirements, a system that accomplishes most requirements will be used. The result of the pre-study is a theory chapter with techniques that could be used to build the distributed database and the system around it. Based on these techniques alternative architectures of the system could be developed. These architectures and techniques will be implemented and tested in the next phase of the method. 2.2 System Design and Implementation In the design and implementation phase, the different techniques and architectures were implemented in a realistic environment. Because of the massively distributed system that the final result will be, a less complex version of the environment was used to run the tests and benchmarks in this project. The test environment will try to capture the most important features of the real environment, but it is possible that the result of the tests in the test environment could differ from the real environment. In this step the database schema for the databases was developed. All the databases in the system combine into a distributed database. A distribution schema for the entire distributed database was also developed. 2.3 Evaluation Here the data collected in the performance measurements, and the data collected in the literature study were evaluated. The performance measurements were collected by creating a testbench for the CellApps, to emulate real-world conditions. This test was then run in a realistic environment. The measurements from the test were collected and evaluated against each other. The candidate for the final system architecture was chosen based on the measured performance and how well the architecture fit in accordance with the other requirements. 8

19 3 Theory 3.1 Relational Database Management Systems and SQL A database is a collection of data. In an relational database management system (RDBMS) the data is stored as rows in a table. Data in an RDBMS can be connected through relations and constraints. It is up to the RDBMS to ensure that all the data adheres to the constraints. One example of such a constraint could be that all rows in the table should have a unique primary key. [3] There are a lot of different features and requirements that are needed in a traditional RDBMS. A lot of features are supported in almost all DBMSs, both non-relational and relational. This chapter will highlight some techniques that are used in RDBMSs SQL Structured Query Language, SQL for short, is a language used in an RDBMSs. The language supports to describe the structure of the database and the data stored. SQL can also be used to query the database for data. A query is a request to read from the database. An SQL statement is a request to either read or write to the database. [4] SQL is a standard published by the American National Standards Institute (ANSI). They progressively publish new additions to the SQL standard every couple of years. [5] Even if many RDBMSs use SQL they do not always follow the standard completely. Different RDBMSs often support different data types and other features. This means that a query that is supported in one RDBMS not necessarily works in another RDBMS. Although, they both should be able to support roughly the same functionality. [4] ACID ACID is an acronym that stands for atomicity, consistency, isolation, and durability. These are properties of a transaction in an RDBMS. A transaction in SQL is one or multiple SQL statements that are always run in full. Either all statements are executed completely or none 9

20 3.2. OLAP versus OLTP of the statements should have any effect on the database. An RDBMS ensures that every transaction is run in accordance with the ACID properties. [6] Most of the ACID concepts do not only concern database theory but are general concepts used in computer science. [7] Atomicity: This property guarantees that a transaction always is executed in full. It can never happen that a transaction partially updates the database. This includes unexpected situations where for example the DBMS crashes due to loss of power or crashes due to other failures or bugs. Consistency: This property makes sure that every transaction that is run in the database will result in a valid state of the database. To be in a valid state means that all data that is written follows the rules and constraints in the database. Isolation: Isolation is used to ensure database integrity when running transactions concurrently in the database. What this means is that the result of concurrent transactions should be able to be reproduced by the same transactions running in some sequential order. Durability: Durability means that a committed transaction will continue to be committed indefinitely. Even if the program or the computer running the database crashes. To implement this, the DBMS must make sure to always save committed transactions to a permanent memory. Durability will therefore not be achieved if just storing the data in regular RAM. 3.2 OLAP versus OLTP There are two main classes of use cases when it comes to database systems: online analytical processing (OLAP) and online transaction processing (OLTP). OLTP: An OLTP system is characterized as a system that performs small and fast transactions against a database. This means that queries need to be fast. Often, in an OLTP system, the data read or modified is directly facing a user or application. [8] A typical example of this kind of OLTP system is a database storing user login credentials for a website. The queries will, in this case, be exceedingly small and need to be fast. If the user wants to log in, a simple query to read a data for one user will be executed. If the user wants to change password, a simple update statement will execute. These queries are simple and need to be fast. OLAP: An OLAP system, on the other hand, is as the name suggests, more suitable for analytics. Queries in an OLAP system could be complex and require data from many sources, with different database schemas, to resolve. These queries could take a long time to execute compared to queries in an OLTP system. These systems often query a lot of historical data compared to OLTP which often only query current data. Because queries here is used for analytics, these systems are often read-only by design. These systems often do not face an end user the same way that OLTP does. [8] An example of this kind of OLAP system could be a big collection of databases. Data from these databases will be aggregated and queried using complex and long-running functions. Based on this analysis some changes or decisions will then be made to optimize the system. 10

21 3.3. Distributed Database Management Systems Most systems today support either OLTP or OLAP, not both. However, there has been some research of combining them to a system that can handle them both, but this increases the complexity of the system. [9] 3.3 Distributed Database Management Systems A distributed database management system (DDBMS) is a management system for a collection of distributed databases. Similar to a DBMS that manages a local database, the DDBMS manage a distributed collection of databases. It hides the complexity of the distributed system to the user. [2] Although the system developed in this thesis will not directly be a DDBMS it will be similar in many ways, therefore some conceptual background of these systems will be required. Before a decision is made to implement a distributed database system for storage of data, there are several factors to consider. A distributed database system has both advantages and disadvantages compared to a regular centralized system. A distributed system will probably be more expensive in terms of hardware required, but runtime costs may benefit from the possibility to fine-tune each machine separately from the rest to achieve maximum performance. When the data is distributed between several locations it has both security and integrity related benefits. The protection of data is improved when the data is not located at the same site. [10] A distributed system may be heterogeneous in a few different ways. The different components of a heterogeneous distributed system may differ in hardware, software, or communication protocols. Two different systems can have different data models which in turn have differences in data structures, constraints, and query language. The difference in structure is easier to deal with if the two representations have the same data content. If not, the difference in content may require significant work to make the systems compatible with each other. [11] Data is fetched from a database by executing so-called queries. In a conventional, non-distributed database system, all data asked for in the query can be provided by the single database. This may not be the case in a distributed system. To retrieve all data a user is interested in, several queries may have to be executed, by the user, on different databases. The results of the queries then have to be combined, by the user, into a resulting dataset. To make the distributed system convenient to use for the end-user, a distributed query manager can be used. The query manager makes the distributed system behave like its non-distributed counterpart by taking a single query provided by the user and combining data from its respective sources to form a single resulting set of data. Creating an efficient distributed query manager might be a more or less difficult task, depending on the differences between the databases in the system. [12] When a transaction is executed in a distributed environment it could be the case that it writes data located in multiple databases at once. In this case, synchronization problems arise. This is because in a distributed system there are no native synchronization or atomic operations. This could lead to unwanted situations where data in the database becomes corrupt because of race conditions, or deadlocks could occur because of some partially updated data. There are some cases where this kind of synchronization is not needed but in the cases in which it is, there are two properties that need to be taken into consideration. The first is local synchronization that makes sure that concurrent running queries on the same database are synchronized and run in order. The other is global synchronization that makes sure that the 11

22 3.4. ODBC, JDBC, and OLE DB entire network of databases keeps a consistent state. The latter is harder to achieve and can add additional overhead. [12] For any database system, there has to be a way of performing various administrative tasks. These include authorization of users and management of semantic integrity rules. In a heterogeneous and distributed database system, the method chosen to perform these tasks depends on the degree of centralization. When authorizing users, there can be an advantage of providing permissions from a centralized system. Thomas et al. [12] bring up the example of giving a user access to the average salary for employees at a company while denying access to the salary of individual employees, where the salaries and employees are stored in different databases in a distributed system. If the authorization of users is centralized, it is trivial to create a database view that captures the necessary queries and operations to obtain the average salary. The user can then be granted access to this view by the centralized authorization system. If one chooses to use a decentralized authorization system, no method exists to grant the user access to the average salary without them being able to also access information about the salary of individual employees as well. If the database management system of the distributed database system supports having semantic integrity rules for the stored data, this can be handled either centrally, through a global schema, or locally at every database. The global approach does, in this case, have a significant advantage due to the possibility to add constraints that depend on data stored in different databases. [12] 3.4 ODBC, JDBC, and OLE DB There are many different existing DBMSs that often are using different interfaces. Open database connectivity (ODBC) aims to provide a uniform interface to the different DBMSs using SQL acting as a kind of middleware. There are three ways that ODBC tries to standardize this: Provide a uniform communication middleware: This means that it is possible to handle connections to different DBMSs in the same way. Datatypes standard: Provide a standard of what datatypes that can be used and how they are mapped to the target DBMS. Provide an application programming interface (API): Used to execute queries with the same SQL syntax no matter of the underlying target DBMS. To be able to use ODBC with a specific DBMS, a driver for that DBMS is required. This driver is often provided by the vendor of the DBMS. [13] For the Java environment, there is the java database connectivity (JDBC) developed by Oracle. It aims to solve the same problem as ODBC and includes the same features. There are connections that bridge ODBC to JDBC and vice versa. This means that one can almost always be used instead of the other. [13] Object linking and embedding, database (OLE DB) is like ODBC developed by Microsoft. The main difference is that while, ODBC works with most DBMS, OLE DB has support for connecting to many other different sources. This includes sources like files on disk and other non-dbmss. OLE DB also contains the functionality to connect to regular DBMSs by using ODBC as a middleware. [14] 12

23 3.5. Microsoft Linked Servers 3.5 Microsoft Linked Servers Linked servers is a technology developed by Microsoft and is shipped with Microsoft s SQL Server. Linked servers makes it possible to connect and access data from other data sources. Linked servers use ODBC and OLE DB as a middleware for this connection and therefore it can access data from a variety of sources. [15] Using linked servers, it is possible to perform distributed queries. Performing a distributed query means that it is possible to query data from many heterogeneous sources in one query. As mentioned in Section 3.3, there are multiple problems when dealing with a distributed query compared to a local one. Microsoft solves this by, what they call, distributed transactions. These distributed transactions try to emulate the transparency of a query on a local database. These distributed queries do enforce the ACID properties. This is implemented using the two-phase commit principle. This will enable the transaction to enforce ACID while adding some overhead. [16] This technology seems to be a good fit for the work in this thesis but unfortunately, it is a Windows-exclusive software and therefore can not be used in a Linux environment, which is a requirement for this thesis. [15] 3.6 Distributed Query Engines A query engine is a component that can query data from a DBMS or some other data store. It could also be the case that the query engine supports making distributed queries and joining data over multiple heterogeneous sources. The query engine can itself be distributed, a distributed query engine, meaning that it can scale and run on multiple machines at the same time to enhance performance. [17] PrestoDB PrestoDB brands itself as a distributed query engine [18]. It is mainly designed for big data analytics and is able to query multiple heterogeneous databases using SQL. It does this by providing connectors to many of the most common DBMSs. [19] Among the supported connectors relevant to this project are: MySQL PostgreSQL Redis and many more. [18] These connectors provide the PrestoDB core with specific information needed for the DBMS. PrestoDB is built in a distributed manner to enhance performance when analyzing large quantities of data. It is designed with one coordinator server that is the central component of PrestoDB. When executing a query in PrestoDB it first reaches the coordinator. The coordinator takes the queries and divide them into tasks that it then schedules to the workers. The workers then handle the reading or writing of data to the databases using the connectors. [18] PrestoDB has originally been developed by Facebook to query the internal data of the company from many different sources at once. Today PrestoDB is an open source project licensed under the Apache license. [18] PrestoDB is backed and used by a lot of big 13

24 3.6. Distributed Query Engines companies. An example is Airbnb that develop a web user interface for PrestoDB named Airpal. [20] PrestoDB is designed for analytics and can be considered an OLAP system. This means that when it comes to modifying data it has some limitations of which SQL statements that are supported. The supported SQL statements depend on the connector used but are mainly the same for all RDBMSs. For the MySQL connector the INSERT statement is supported but neither the DELETE nor the UPDATE statement. This limitation could cause problems if the system should be used to update any data. [18] Illustrated in Listing 3.1 is a simple distributed analytics query. In this example, there are two databases that are queried using the same query. These databases are in this example called mysql2 and postgresql. As the name suggests mysql2 is a MySQL DBMS while postgresql is a PostgreSQL DBMS. This means that this is not only an example of a distributed query but also an example of querying different, heterogeneous, DBMSs. The test and public identifiers are referencing databases and neighbors is referencing a table. The query in Listing 3.1 will compare and find rows where field4 in different databases does not match each other. Listing 3.1: PrestoDB distributed query example SELECT mysql. cellappid, post. cellappid, mysql. f i e l d 4, post. f i e l d 4 FROM p o s t g r e s q l. public. neighbors as post, mysql2. t e s t. neighbors as mysql WHERE mysql. f i e l d 4!= post. f i e l d 4 AND mysql. c e l l a p p i d =post. c e l l a p p i d AND mysql. neighborid=post. neighborid ; Apache Drill Apache Drill is also, like PrestoDB, a distributed query engine. Drill started as the Google project Dremel in The techniques from Dremel then became an Apache project under the name Drill. [17] Drill supports a variety of data sources including both RDBMSs, and NoSQL. It supports full ANSI SQL.[17] It is built using storage plugins that are used to handle connections to specific data stores. Drill is written in Java and uses JDBC to connect to RDBMSs. By using JDBC, Drill should be able to connect to most sources that support the interface. Actively supported and tested data sources relevant to this project include: MySQL PostgreSQL Like PrestoDB, Drill is mainly designed for data analytics and, therefore, does not contain all the functionality for modifying and updating data that is available in ANSI SQL. Operations like INSERT, UPDATE or even DELETE are not yet supported. [21] Illustrated in Listing 3.2 is a distributed query using Drill. The query does the same kind of analytics as the example for PrestoDB in Listing 3.1. The query for PrestoDB and Drill looks similar but there are some small differences. One example of differences is that in Drill the JOIN statement must be explicitly stated while in the example with PrestoDB this was not 14

25 3.7. Multistore and Polystore Systems needed since it is implicit. Drill also did not support the!= operator like PrestoDB did. Other than this both queries gave the same result. Listing 3.2: Drill distributed query example SELECT mysql. cellappid, post. cellappid, mysql. f i e l d 4, post. f i e l d 4 FROM mysql2. t e s t. neighbors as mysql JOIN p o s t g r e s q l. public. neighbors as post ON mysql. c e l l a p p i d =post. c e l l a p p i d AND mysql. neighborid=post. neighborid AND NOT ( mysql. f i e l d 4 =post. f i e l d 4 ) ; 3.7 Multistore and Polystore Systems According to the book Data Management and Analytics for Medicine and Healthcare written by BEGOLI, Edmon; WANG, Fusheng; LUO, Gang [22] both multistore and polystore are defined as systems that combine heterogeneous DBMSs by using a single uniform interface or language, such as SQL. [22] Multistore and Polystore Systems are, similar to the distributed query engines, a way to query data from multiple heterogeneous data sources in a single query. [23] CloudMdsQL and CoherentPaaS CloudMdsQL is a data query language similar to SQL. The syntax of CloudMdsQL is comparable to SQL but contains some additional features. It is developed inside the project of CoherentPaaS, a project funded by the EU. The project aims to solve the problem of querying heterogeneous data sources. [24] It achieves this by taking another approach than the systems previously described. Instead of providing a uniform syntax to all data stores, it enables the user to write native queries for each different database. It also provides a way of combining these native queries into bigger distributed queries, using SQL-like syntax, stretching over many data sources. [23] By using this technique, CloudMdsQL can query a wide range of data sources including relational database systems, NoSQL systems, and even graph-based databases. Out of the box, CloudMdsQL supports one of each to prove the concept: Derby as a relational database, MongoDB as NoSQL database, and Sparksee as a graph database. With some work, other databases can be added and used with the system [23] CloudMdsQL should at this time be seen as an interesting research project on the subject but is not yet a system that should be used in production according to V. Giannakouris et al. [19]. Even if it is well documented how the language works, it is not well documented how to run the project, or what parts are needed to extend in order to get it to work with other databases than those provided with the project BigDAWG Much like CloudMdsQL, BigDAWG is mainly a research proof of concept and not production ready yet. [19] BigDAWG stands for Big Data Analytics Working Group, and is originally developed by Intel Science and Technology Center for Big Data (ISTC). BigDAWG is developed around an idea expressed by M. Storebraker and U. Cetintemel [25] as: 15

26 3.8. Data Warehousing "No one size fits all" This idea claims that there is no single DBMS that can be used in all circumstances. Instead, DBMSs need to be chosen based on the domain of the data. BigDAWG is able to query data from multiple sources and different types of DBMS. Like CloudMdsQL, BigDAWG comes with support for a number of different DBMSs to be able to prove the concept. It also contains a sample dataset, consisting of public health records from Israel, that can be used to try all the different functionalities. [26] 3.8 Data Warehousing Data warehousing is a little different approach than the ones mentioned previously. Data warehousing means that all the data from different sources is collected and unified into a single data storage. This is mainly used in big corporations with big collections of data. This data may be spread out at different locations and even stored in heterogeneous databases. The idea with a data warehouse is to collect all this information to be able to run analytics on data. [27] According to "The data warehouse toolkit: The complete guide to dimensional modeling" a book written by R. Kimball and M. Ross [27], there are several components relevant to a data warehouse system: Source systems: These are the original data sources. It is this part of the system that is facing the user or applications. These systems are often of OLTP type and need to be able to resolve simple queries fast. These systems often only contain the current state of the system and little to no historical data. Staging area: This is the stage where data extracted from the source systems is transformed to fit into the database schema of the data warehouse. Presentation Area: In this step, the data is structured in a suitable way and stored in the data warehouse system. From here it can be analyzed using different queries and other tools. Data warehousing is something that will not fit directly into the work done in this thesis since it will require data to first be copied into the data warehouse before being queried. This project is looking for solutions where the data can be queried in place, without being copied first. 16

HTTP Based Adap ve Bitrate Streaming Protocols in Live Surveillance Systems

HTTP Based Adap ve Bitrate Streaming Protocols in Live Surveillance Systems HTTP Based Adapve Bitrate Streaming Protocols in Live Surveillance Systems Daniel Dzabic Jacob Mårtensson Supervisor : Adrian Horga Examiner : Ahmed Rezine External supervisor : Emil Wilock Linköpings