CHOOSING A DATABASE- AS-A-SERVICE

CHOOSING A DATABASE- AS-A-SERVICE AN OVERVIEW OF OFFERINGS BY MAJOR PUBLIC CLOUD SERVICE PROVIDERS Warner Chaves Principal Consultant, Microsoft Certified Master, Microsoft MVP With Contributors Danil Zburivsky, Director of Big Data and Data Science Vladimir Stoyak, Principal Consultant for Big Data, Certified Google Cloud Platform Qualified Developer Derek Downey, Practice Advocate, OpenSource Databases Manoj Kukreja, Big Data and IT Security Specialist, CISSP, CCAH and OCP When it comes to running your data in the public cloud, there is a range of Database-as-a-Service (DBaaS) offerings from all three major public cloud providers. Knowing which is best for your use case can be challenging. This paper provides a high-level overview of the main DBaaS offerings from Amazon, Microsoft, and Google. After reading this white paper, you ll have a high-level understanding of the most popular data repositories and data analytics service offerings from each vendor, you ll know the key differences among the offers, and which ones are best for each use case. With this information, you can direct your more detailed research to a manageable number of options. www.pythian.com White Paper 1

This white paper does not discuss private cloud providers or colocation environments, streaming, data orchestration, or Infrastructure-as-a-Service (IaaS) offerings. This paper is targeted to IT professionals with a good understanding of databases and also business people who want an overview of data platforms in the cloud. WHAT IS A DBAAS OFFERING? A DBaaS is a database running in the public cloud. Three things define a DBaaS: The service provider installs and maintains the database software, including backups and other common database administration tasks. The service provider also owns and manages the operating system, hypervisors, and bare metal hardware. Application owners pay according to their usage of the service. Usage of the service must be flexible users can scale up or down on demand and also create and destroy environments on demand. These operations should be possible through code with no provider intervention. FOUR CATEGORIES OF DBAAS OFFERINGS To keep things simple, we ve created four categories of DBaaS offerings. Your vehicles of choice are: The Corollas: These are the classic RDBMS services in the cloud: Amazon Relational Database Service (RDS), Microsoft Azure SQL Database, and Google Cloud SQL. The Formula One offerings: These special-purpose offerings ingest and query data very quickly but might not offer all the amenities of the Corollas. Options include Amazon DynamoDB, Microsoft Azure DocumentDB, Google Cloud Datastore, and Google Cloud Bigtable. The 18-wheelers: These data warehouses of structured data in the cloud include Amazon Redshift, Microsoft Azure SQL Data Warehouse, and Google BigQuery. The container ships: These Hadoop-based big-data systems can carry anything, and include Amazon Elastic MapReduce (EMR), Microsoft Azure HDInsight, and Google Cloud Dataproc. This category also includes the further automated offering of Azure Data Lake. The rest of this white paper discusses each category and the Amazon, Microsoft, and Google offerings within each category. We describe each offering, explain what it is well suited for, provide expert tips or additional relevant information, and provide high-level pricing information. www.pythian.com White Paper 2

COROLLAS With the Corollas, just like with the car, you know what you re getting, and you know what to expect. This type of classic RDBMS service gets you from point A to point B reliably. It s not the flashiest or newest thing on the block, but it gets the job done. AMAZON RDS Amazon Relational Database Service (RDS) is the granddaddy of DBaaS offerings available on the Internet. RDS is an automation layer that Amazon has built on top of MySQL, MariaDB, Oracle, PostgreSQL, and SQL Server. Amazon has also developed its own MySQL fork called Amazon Aurora, which also lives inside RDS. RDS is an easy way to transition into DBaaS because the service mimics the onpremises experience very closely. You simply need to provision an RDS instance, which maps very closely to the virtual machine models that Amazon offers. Amazon then installs bits, manages patches and backups, and can also manage the high availability, so you do not need to plan and execute these tasks yourself. RDS is very good for lift-and-shift types of cloud migrations. It makes it easy for existing staff to take advantage of the service because it mimics the on-premises experience, be it physical or virtual. The storage is very flexible: this is both a pro and a con. The pro is that you have a lot of control over storage. The con is that there are so many storage options, you need the knowledge to choose the best one for your use case. Amazon has general storage, provisioned IOPS (input/output operations per second), and two categories of magnetic storage. The storage method you choose will depend on your particular use cases. You need to be aware that Amazon does not make every patch version of all products available on RDS. Instead, Amazon makes only some major service packs or Oracle patch levels available. As a result, the exact patch level that you have on premises might not map to a patch level on RDS. In this situation, do not move to a patch level that is below the patch level you have because that may result in product regressions. Instead, wait until Amazon has deployed a patch level higher than what you have. At this point, it should be fairly safe to start testing if you want to migrate to RDS. The hourly rate for RDS depends on: whether you have your own license or if Amazon is leasing you the license; www.pythian.com White Paper 3

how much compute power you choose: The number of cores, and amount of memory and temporary disk you want on this instance; the storage you require; and whether you pre-purchased with Reserved Instances. MICROSOFT AZURE SQL DATABASE Microsoft Azure SQL Database is a cloud-first SQL Server fork. The term cloudfirst means that Microsoft now tests and deploys their code continuously with Azure SQL Database, and the code and lessons learned are implemented in the retail SQL Server product whether the product is on premises or on a virtual machine. Even if you don t have any investment in SQL Server, Azure SQL Database is an excellent DBaaS platform because of the investments made to support the elastic capabilities and to the ease of scaling horizontally. As you need more capacity, you just add more databases. It s also easy to manage the databases by pooling resources, performing elastic queries, and performing elastic job executions. You could deploy your own code to do something similar in Amazon RDS, but in Azure SQL Database, Microsoft has already built it for you. In addition, Azure SQL Database makes it easy to build an elastic application on a relational service. This capability supports the Software-as-a-Service (SaaS) model, wherein you have many clients and each has a database. The SaaS provider has a data layer that is easier to manage and scale than if they were running on their own infrastructure. Unlike Amazon RDS, Azure SQL Database does not exactly map to a type of retail database, such as Oracle, SQL Server, or open-source MySQL. It is closely related to SQL Server but it s not licensed or sold in a similar way. As a result, Azure SQL Database does not have any licensing component. At the same time, Azure SQL Database does not give you a lot of control over the hardware. With Amazon RDS, you need to select CPUs, memory, and your storage layout. Azure SQL Database does all this for you. With Azure SQL Database the only thing that you need to choose is the service tier. Your choice determines how much power your database has. There are three service tiers: basic, standard, and premium. Each of these also has some sub-tiers to increase or decrease performance. If you have many databases in Azure SQL Database, you can also choose the elastic database pool pricing option to increase your savings by sharing resources. www.pythian.com White Paper 4

Azure SQL Database is a good choice if you already have Transact-SQL (T-SQL) skills in-house. If you have a large investment in SQL Server, Azure SQL Database is the most natural way to take advantage of DBaaS offerings in the cloud. It s also a very good web scale relational service in its own right because of all the investments made to support the SaaS model. You do need to ensure that you do the proper SQL tuning to be able to choose the right service tier for your needs. In the past, it was more difficult to scale up because all equipment was on premises. Now, it s very easy to increase the power of the service and therefore pay more money. However, just because scaling up is easy does not mean it s always what you need to do. If you perform the proper SQL tuning, you will not need to pay more for raw power. Azure SQL Database has a simple pricing model. You pay an hourly rate for the service tier your database is running on: Basic, Standard, or Premium. Each has a different size limit for the database and provides more performance as you go up in the tier. GOOGLE CLOUD SQL Google Cloud SQL is a MySQL managed database service that is very similar to Amazon RDS for MySQL and Amazon Aurora. You select an instance and deploy it without needing to install any software. CIoud SQL automates all your backups, replication, patches, and updates anywhere in the world while ensuring greater than 99.95 percent availability. Automatic failover ensures your database will be available when you need it. Cloud SQL Second Generation introduces per-minute, pay-per-use billing, automatic sustained use discounts, and instance tiers to fit any budget. Cloud SQL does have restrictions on: anything related to loading/dumping the database to a local file system, installing plugins, creating user-defined functions, performance schema, SUPER privileges, and Storage engines: InnoDB is the only one supported for Second Generation instances www.pythian.com White Paper 5

Pricing for Cloud SQL Second Generation is made up of three components: instance pricing, storage pricing, and network pricing. The charge is based on the machine type you choose for the instance. Storage and network pricing are separate charges. FORMULA ONE OFFERINGS The Formula One DBaaS offerings are fit-for-purpose offerings. They do not have all the functionality of the mature RDBMS products but they do a limited number of things very well. A Formula One car is built purely for speed. It does not have a cup holder, heated seats, or satellite radio. However, it s fit for purpose and that purpose is to go fast. (Admittedly, you might miss some of the amenities that you are used to with a regular car.) Similarly, the Formula One DBaaS offerings are built for purpose. That purpose is to ingest and query data very quickly. Think of them as NoSQL in the cloud. The NoSQL movement was popularized by large web applications such as Google and Facebook as a way to differentiate their database platforms from the classic RDBMS offerings. Usually NoSQL products handle horizontal scalability with more ease, have more relaxed restrictions on schema (if any), and forego some of the ACID requirements as a trade-off for more speed. AMAZON DYNAMODB Amazon DynamoDB is a very popular service offered through Amazon Web Services (AWS). It s basically a NoSQL document/key value table store. All you need to define is the table and either its key or its key and sort order. The schema is completely flexible and is up to you. DynamoDB is best suited for applications with known query patterns that don t require complex transactions and that ingest large volumes of data. DynamoDB is built for scale-out growth of high-ingest applications because the Amazon scale-out architecture guarantees that you will not run out of space. You don t need to worry about the scale out, you just need to know that that this is how Amazon has architected the service. For example, when you specify a partition key for records, they will all be distributed to the same nodes that Amazon builds transparently behind the scenes for your data. This offering does not have an optimizer, so it does not support ad hoc SQL querying the way a relational product does. It s more a set of the normalized instantiated views based on the indexes that you have created on your data. Querying is not done with SQL, it is performed through a different type of specification. Amazon provides SDKs in many languages, including Java, NET, and Python. You use www.pythian.com White Paper 6

these SDKs to develop queries. This process does require a bit of learning but that s not a major time investment. Although DynamoDB does not have a fixed schema, it does support complex schemas. For example, fields are denormalized: some fields could be lists, some could be maps or sets. This service also exposes a stream-based API, so if you need to replicate the data changes from DynamoDB to another system, you can do so through the stream-based API. Because this service does not support ad hoc querying, your schema can have a huge impact on what you re allowed to do on your application. DynamoDB also has a finite number of indexes that you can apply: five global indexes for each table. You need to keep in mind the indexing limits and lack of an optimizer, and ensure that your schema will be able to support your future application requirements. The cost of DynamoDB is based on storage, how much data you have, and the I/O rate: your number of requests for read units and write units. If you have any streams, you will need to pay for the streams read rate. MICROSOFT AZURE DOCUMENTDB Microsoft Azure DocumentDB is a NoSQL document database that is basically a repository for JSON (JavaScript Object Notation) documents. JSON documents have no schema restrictions. They can contain almost any type of field, and they can also have nested fields. This DBaaS is NoSQL denormalized, with built-in support for partition collection, so you can specify a field in the JSON documents and Azure DocumentDB will partition the documents based on that field. Azure DocumentDB also has built in geo-replication support, so you can have, for example, an Azure DocumentDB collection reading and writing on the east coast of the United States and a replica of this collection that you can use for reads in the central United States. If there s an issue with the DocumentDB on the east coast, you can failover to the other geo-region for very high availability. Azure DocumentDB is a good choice for JSON-based storage, and it s very easy to set up and start storing documents. Retrieval is also easy because this database supports full-blown SQL-style queries, so you don t need to learn any new query language. If you don t specify any indexes, the system has some automatic indexing policies. However, keep in mind that indexing has a storage consumption value, so the more www.pythian.com White Paper 7

indexes you have, the more storage you will consume and you will pay for that amount of storage. The storage could be for indexes that you do not use, so ensure that the automatic indexing policies work for your use case. If it doesn t make sense to have an index on a field because you never search on it, you can disable the index through a custom policy. Also, if specific collections have limits and you need to perform partition collection, each key will be able to hold no more than 10 Gbit of documents. If you need more than this amount per partition key, you will probably want to ensure that you design with a very high-granularity partition key. Azure DocumentDB offers some pre-defined tiers for billing based on common usage patterns. However, if you want to customize the system, you can easily select your individual compute power, referred to as request units, plus the amount of storage that you want for the collection. GOOGLE CLOUD DATASTORE Google Cloud Datastore is Google s version of a NoSQL cloud service similar to Amazon DynamoDB and Microsoft Azure DocumentDB. From an architecture perspective, Cloud Datastore is similar to other key/value stores. The data model is organized based on entities, which loosely resemble rows in a relational table. Entities can have multiple properties but no rigid schema is imposed on entities. Two different entities of a similar type don t need to have the same number or type of properties. An interesting feature of Cloud Datastore is built-in support for hierarchical data. In addition to all the properties you would expect from a cloud NoSQL DBaaS, such as massive scalability, high availability, and flexible storage, Cloud Datastore also supports some unique properties, including out-of-the-box transaction support and encryption at rest. Google also provides tight integration of Cloud Datastore with other Google Cloud Platform services. Applications running in Google App Engine can use Cloud Datastore as their default database. You can also load data from Cloud Datastore into Google BigQuery for analytics purposes. There are multiple ways to access data in Cloud Datastore. There are client libraries for most popular programming languages as well as a REST interface. Google also provides a GQL language that is roughly modelled on SQL and can provide an easier transition from relational databases to the NoSQL world. www.pythian.com White Paper 8

Cloud Datastore automatically indexes all properties for an entity, making simple single-property queries possible without any additional configuration. More complex multi-property indexes can be created by defining them in a special configuration file. Similar to other cloud NoSQL services, Cloud Datastore is priced according to amount of storage the database requires and the number of different operations it performs. Google defines prices for reads, writes, and deletes per 100,000 entities. However, simple requests such as fetching an entity by its key (which is a very common operation), are free. GOOGLE CLOUD BIGTABLE Google Cloud Bigtable is Google s NoSQL big-data database service. It s the cloud version of the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail. Bigtable is designed to handle massive workloads at consistent low latency and high throughput, so it s a great choice for both operational and analytical applications, including Internet of Things (IoT) use cases, user analytics, and financial data analysis. This public cloud service gives you instant access to all the engineering effort that was put into Bigtable at Google over the years. The Apache HBase-like database is flexible and robust, and lacks some of the inherited HBase issues, such as Java Google Cloud stalls. In addition, Cloud Bigtable is completely managed, so you don t need to provision hardware, install software, or handle failures. Cloud Bigtable does not have strong typing; it s basically a massive key value table. As data comes in, it is treated as binary strings. This DBaaS also does not have any type of querying through SQL. You have the key, then you can get the value. Cloud Bigtable is also built for very large tables, so it s not worth considering this for anything less than a table of 1 terabyte. Pricing for Cloud Bigtable is based on: the number of Cloud Bigtable nodes that you provision in your project for each hour (you will be billed for a minimum of one hour); the amount of storage that your tables use over a one-month period; and the amount of network bandwidth used. Some types of network egress traffic are subject to bandwidth charges. www.pythian.com White Paper 9

18-WHEELERS The 18-wheelers can handle the heavy load of structured data. These are basically data warehouses in the cloud. They store and easily query large amounts of structured data. AMAZON REDSHIFT Amazon Redshift is the granddaddy of the 18-wheeler DBaaS offerings. This is Amazon s modified PostgreSQL with columnar storage. Other columnar storage-type offerings include HPE Vertica, Microsoft SQL Server Parallel Data Warehouse (PDW) and SQL Server Column stores, Oracle Exadata Database Machine (Exadata), and Oracle Database In-Memory. All these technologies achieve excellent compression ratios through the columnar storage. Instead of storing the data by rows, they store it by columns, which makes the scans of the data very fast. Redshift is a relational massively parallel processing (MPP) data warehouse, so there are multiple nodes rather than just one big machine. The service works with SQL queries as well as allowing you to write your own modules on Python. Because Redshift is scaled per node, if you need more power you need to add another node. This means you need to make a selection of both compute and storage, and the service is charged per node, per hour. Redshift gives you a lot of control over specific node configurations, so you can choose how many cores and how much memory the nodes have. You can also decide whether to pay more and have the fastest storage on the nodes through solid state drives (SSDs) or save some money by instead using hard drive-based storage attached to the nodes. Redshift is a very good warehousing solution for all your data. If you have a big footprint on AWS, Redshift is definitely the warehousing solution that you want. With Redshift, you do need to watch node count and configurations. The ideal configuration of your Redshift cluster might depend on your workload and your workload patterns, so you need to decide if it s better to have fewer nodes with really high specs or more nodes with less compute or less memory. Based on your analysis, you then need to properly tune Redshift for your workload and warehouse design. Also be aware of possible copy issues due to Amazon Simple Storage System (S3) consistency. Amazon recommends that you use the manifest files to specify what www.pythian.com White Paper 10

you want to load so that you re not in a situation where you just read the names of the files off S3, and because of the eventual consistency, there is a file that you miss. Finally, Redshift does require regular maintenance to keep the statistics and tables up to date. If you do any updates or deletes, the service has an operation called the Vacuum to keep the tables optimally organized for fast retrieval. Redshift is billed by the hour per node. The cost of each node depends on the configuration of cores, amount of memory, and type of storage you select. MICROSOFT AZURE SQL DATA WAREHOUSE Microsoft Azure SQL Data Warehouse is Microsoft s response to Redshift. It s fully relational, with 100 percent SQL-type queries, and highly compatible with the T-SQL for SQL Server. If you have SQL Server investments, it would be very easy to adopt SQL Data Warehouse. Like Redshift, storage is columnar and the service is MPP. Data is split into storage distributions when you load it. The architecture is distributed, so a query is sent to all the different nodes to help resolve your questions. Azure SQL Data Warehouse scales compute and storage independently. Unlike Redshift, where you always need to scale on a full node, Azure SQL Data Warehouse allows you to add just more compute if you only need more compute. You can also add more storage and keep the same amount of compute. A very powerful capability is that you can pause compute completely. For example, if you don t have much load on your data warehouse during the weekend, you can decide to shut it down and pause it completely during the weekend, for maximum savings. Azure SQL Data Warehouse is an excellent enterprise warehousing solution, particularly if you have a lot of data already built on Azure services. If you have a pause-friendly workload, this service will provide very good savings. Unlike Redshift, which gives you a lot of control over the configuration of the nodes, Azure SQL Data Warehouse gives you no control over hardware. It s 100 percent Platform-as-a-Service (PaaS). You simply select a compute unit, called a Data Warehousing Unit (DWU). The amount of the DWU will give you an idea of the power that you get for the data warehouse. www.pythian.com White Paper 11

Be aware that at the time of publication, not all T-SQL data types are supported yet. For example, if you need to store spatial data, you could store it now just as binary, but you won t have full support of all the spatial functions. Before you start a full migration to Azure SQL Data Warehouse, ensure that you carefully review which functionality is available. However, if you have only regular structured types on your data warehouse, it s definitely wise to consider this service now. Azure SQL Data Warehouse has two separate cost components: storage and compute. Compute is elastic and is billed by the hour based on the number of DWUs you provision. GOOGLE BIGQUERY Google BigQuery is a mix of an 18-wheeler and a container ship. A container ship is a big-data, Hadoop-style service. BigQuery is a hybrid because it is based on a structured schema but at the same time allows for easy integration with Google DataProc and fixing schema on read over storage. The service supports regular tables with data stored inside the service as well as virtual tables where you put schema on read. It s the same with external tables, so you can map BigQuery to other services inside Google, such as Google Cloud Storage, and then have those tables defined inside BigQuery to be used for your analytic queries. BigQuery is Google Cloud Platform s serverless analytics data warehouse, so you do not need to manage hardware, software or the operating system. Google has replaced its SQL with a standards-compliant dialect that enables more advanced query planning and optimization. There are also new data types, additional support for timestamps, and extended JOIN support. BigQuery also has a streaming interface, so instead of running an Extract, Transform and Load (ETL) process based mostly on fixed-schedule batch processing, you can also have a streaming flow that brings inserts directly into BigQuery constantly using an API or the Cloud Dataflow engine. BigQuery is a very good one-stop shop if you have streaming data, relational data, and file-based data because it can put schema on read. But watch out for www.pythian.com White Paper 12

high-compute queries where Google estimates that it takes too much compute to resolve at their regular rate per query: as of August 2016, the limit is 1 terabyte. Above this limit, the extra compute cost is $5 per terabyte. You might receive an error message that reads, Hey, you need to run with higher compute. The cost of that query will be higher, so you will need to watch out for runaway costs. Hadoop can be attached to BigQuery tables, but it does require a temporary data copy. The BigQuery Hadoop connector will perform a temporary data copy to Google Cloud Storage (GCS) for Hadoop. Don t be surprised if you incur some GCS costs for this type of operation. BigQuery is billed per storage and per query, with automatic lower pricing tiers after 90 days of data being idle. Costs are based on the amount of data you have and how much of it you read. If you are streaming data in, you will pay extra for it. With BigQuery, you pay only for data read by queries. For example, if you have a 20-TB warehouse in BigQuery, but you re only running 1 to 10 queries per day, you will pay for only those queries. You do not need to pay for provisioning compute storage the way you do with Redshift. With Azure SQL Data Warehouse you also pay for compute, but at least you can pause it. BigQuery goes one step beyond by charging for only specific queries that you run. As a result, you don t even need to think about starting and pausing compute. You simply use compute on demand whenever you want to run a query. CONTAINER SHIPS The container ships are big-data systems that carry everything, any shape or form. They are really Hadoop-as-a-Service, and this is very attractive because on-premises Hadoop deployments have a high cost to experiment: the high cost of curiosity. You need to build your Hadoop service and also have enough storage and enough nodes before you can start your data exploration. If you instead do your data exploration in the cloud, you can let the cloud deploy all the power you need. If you need a very large cluster, you don t need to make any type of capital expenditures to get up and running. You also don t need to make operational expenditures to manage the cluster. You simply create and destroy as needed, and you pay for storage in the cloud. All the major cloud providers offer this type of service. All of the container ships follow a similar pattern. You pick a machine model for your nodes, deploy the cluster with a given size, pick how many nodes you want, and then you attach the ship to a storage service that it can read the data from. The Amazon storage service is S3. The Microsoft Azure services are Azure Data Lake and Azure Blob Storage. Google uses Google Cloud Storage. www.pythian.com White Paper 13

After the cluster is deployed, you use it as a Hadoop installation if you need to run EMR, Apache Spark, Apache Storm, or any other type of Hadoop-based service. AMAZON ELASTIC MAPREDUCE Amazon Elastic MapReduce (EMR) is a managed Hadoop framework that makes it easy, fast, and cost-effective to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Spark and Presto in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. Amazon EMR releases are packaged using a system based on Apache Bigtop, which is an open-source project associated with the Hadoop ecosystem. In addition to Hadoop and Spark ecosystem projects, each Amazon EMR release provides components that enable cluster and resource management, interoperability with other AWS services, and additional configuration optimizations for installed software. Amazon provides the AWS Data Pipeline service, which allows automating recurring clusters by implementing an orchestration layer to automatically start the cluster submit job, handle exceptions, and tear down clusters when the job is done. Amazon charges per hour for EMR. One way to minimize costs is to have some of the compute nodes deployed on Spot Instances; this provides savings of up to 90 percent. MICROSOFT AZURE HDINSIGHT Microsoft Azure HDInsight is an Apache Hadoop distribution that deploys and provisions managed Hadoop clusters. This service can process unstructured or semi-structured data and has programming extensions for. C#, Java, and.net, so you can use your programming language of choice on Hadoop to create, configure, submit, and monitor Hadoop jobs. HDInsight is tightly integrated with Excel, so you can visualize and analyze your Hadoop data in compelling new ways using a tool that s familiar to your business users. HDInsight incorporates R Server for Hadoop, a cloud implementation of one of the most popular programming languages for statistical computing and machine learning. It gives the familiarity of R with the scalability and performance of Hadoop. HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of the Hadoop Distributed File System (HDFS). This lets you do large transactional processing (OLTP) of non-relational data, enabling use cases like interactive websites or having sensor data write to Azure Blob Storage. www.pythian.com White Paper 14

HDInsight includes Apache Storm, an open-source stream analytics platform that can process real-time events at large scale. It also includes Apache Spark, an open-source project in the Apache ecosystem that can run large-scale data analytics applications in memory. HDInsight includes HBase, enabling you to do large transactional processing (OLTP) of non-relational data for use cases such as interactive websites or having sensor data write to Azure Blob Storage. You can also run Spark and Storm in HDInsight. HDInsight is priced based on storage and the cost of the cluster. The cost of the cluster is an hourly rate per node of the cluster. GOOGLE CLOUD DATAPROC Google Cloud Dataproc is a managed Apache Hadoop, Apache Spark, Apache Pig, and Apache Hive service that lets you use open-source data tools for batch processing, querying, streaming, and machine learning. Dataproc helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don t need them. From a networking perspective, Dataproc supports subnets, role-based access, and clusters with no public IP. Similar to Amazon EMR, Dataproc releases are packaged using a system based on Apache Bigtop, which is an open-source project associated with the Hadoop ecosystem. Although some of the tools from the Hadoop ecosystem might not be enabled by default, it is very easy to add them to the deployment. One advantage of Dataproc over EMR is how fast the cluster can be deployed: for most of the configurations the time is less than 90 seconds. Also, after the first 10 minutes there is by-the-minute billing, which makes Dataproc a great contender for building blogs of a more complex ETL pipeline. Another advantage of Dataproc over other managed Hadoop services is its integration with Google Cloud Storage as an alternative to the Hadoop Distributed File System. This integration provides immediate consistency. By contrast, it usually takes 1 to 3 minutes before files become visible on, for example, S3. Immediate consistency in Dataproc means that the same storage can be accessed across multiple clusters in a consistent manner. There is no global orchestration and scheduling service available from Google yet (similar to AWS Data Pipeline), so custom Luigi, Oozie, or Airflow will need to be deployed and maintained. www.pythian.com White Paper 15

Google is also still working on deeper integration of Stackdriver, Google s integrated monitoring, logging, and diagnostics tool, with Dataproc. An integration at the Job level should be available soon. In the meantime, the Dataproc user interface does provide access to the required logs. Pricing for Dataproc is based on storage and the cost of the cluster. The cost of the cluster is an hourly rate per node of the cluster. MICROSOFT AZURE DATA LAKE Microsoft Azure Data Lake is Microsoft s one step up from Hadoop-as-a-Service. Azure Data Lake service is separated into storage and analytics. The storage service has no limit on size, including no limit on the size of a file. The analytics service can run large data jobs on demand, very similar to how BigQuery runs queries on demand. Because Azure Data Lake is a big-data type of repository, you can mix tables, you can mix files, and you can have external tables. Azure Data Lake does all this through the U-SQL language, which is a mix of SQL and C#. If you have DBAs in your company, or if you have developers who know SQL and C#, it is easy for them to be productive very quickly with Azure Data Lake without needing to learn all the different pieces of the Hadoop ecosystem, such as Pig and Hive. If you do need a full Hadoop cluster, for example if you want to use some Mahout algorithms on your data, you can attach an HDInsight cluster directly to Azure Data Lake and then run from that. You also have the option of on-demand analytics through U-SQL. Analytics can also be scaled dynamically to increase compute. You simply increase the number of analytic units, which are the nodes running your queries. Because analytics are performed per job, you can easily control your cost of using the service. Each time you submit a job, there s a fixed cost. Azure Data Lake is excellent for leveraging T-SQL and.net skills to provide Platform-as-a-Service (PaaS) big-data analytics. The barrier of entry for doing big data analytics is very low in terms of learning new skills. Be aware that this service is still on public preview at the time of this writing. For this reason, it has a limit of 50 analytic units when you run a job, and 3 concurrent jobs per account. However, if you do have a strong use case, you should reach out to Microsoft Support because they can lift these restrictions. www.pythian.com White Paper 16

Azure Data Lake has two components: storage and jobs. Your total costs on storage depend on how much you store and volume of data transfers. The jobs have a flat rate per job and amount of Analytic Units. These units govern how many compute resources you can get. SUMMARY When it comes to choosing a DBaaS, you have a variety of options. The Corollas are the classic RDBMS services in the cloud: not flashy, but reliable. The Formula One offerings are built for purpose. They don t have all the functionality of the mature RDBMS products but they ingest and query data very quickly. The 18-wheelers are data warehouses in the cloud that store and easily query large amounts of structured data. The container ships are big-data systems that carry everything. Think of them as Hadoop-as-a-service. All of these offerings can improve delivery because all the management tasks are automated. As a result, there s less chance of human error and less chance of quality issues during maintenance. All of the offerings also reduce time-to-market, enable faster ROI, and reduce capital expenditures. Before you choose a service, you need to understand all of them, then closely consider your requirements. You don t want to deploy DocumentDB, then realize later that what you really needed was an RDBMS service. You don t want to choose Redshift, only to discover that you d have been better served by BigQuery. Think about your relational data, your NoSQL unstructured data, and your big structured data requirements for warehousing. Maybe you re also adopting big data analytics. With the right public cloud service for your use case, you can leverage your data to gain insights, then use those insights to gain competitive advantages. For more information about how Pythian can help you choose the right DBaaS for your needs, please visit: https://www.pythian.com/solutions/ ABOUT THE AUTHOR Warner Chaves @warchav Warner Chaves is a principal consultant at Pythian, and Microsoft Certified Master and Microsoft MVP. Warner has been recognized by his colleagues for his ability to remain calm and collected under pressure. His transparency and candor enable him to develop meaningful relationships with his clients, where he welcomes the opportunity to be challenged. Originally from Costa Rica, Warner is fluent in English and Spanish. www.pythian.com White Paper 17

CONTRIBUTORS Danil Zburivsky @zburivsky Danil Zburivsky is Pythian s director of big data and data science. Danil leads a team of big data architects and data scientists that help customers worldwide to achieve their most ambitious goals when it comes to large scale data platforms. He is recognized for his expertise in architecting, and building and supporting large mission-critical data platforms using MySQL, Hadoop and MongoDB. Danil is a popular speaker at industry events, and has authored a book titled Hadoop Cluster Deployment. Vladimir Stoyak Vladimir Stoyak is a principal consultant for big data. Vladimir is a certified Google Cloud Platform Qualified Developer, and Principal Consultant for Pythian s Big Data team. He has more than 20 years of expertise working in Big Data and machine learning technologies including Hadoop, Kafka, Spark, Flink, Hbase, and Cassandra. Throughout his career in IT, Vladimir has been involved in a number of startups. He was Director of Application Services for Fusepoint, which was recently acquired by CenturyLink. He also founded AlmaLOGIC Solutions Incorporated, an e-learning analytics company. Derek Downey @derek_downey Derek Downey is the practice advocate for the OpenSource Database practice at Pythian, helping to align technical and business objectives for the company and for our clients. Derek loves automating MySQL, implementing visualization strategies and creating repeatable training environments. Manoj Kukreja @mkukreja Manoj Kukreja is a big data and IT security specialist whose qualifications include a degree in computer science, a master s degree in engineering, along with CISSP, CCAH and OCP designations. With more than twenty years of experience in the planning, creation and deployment of complex and large scale infrastructures, Manoj has worked for large scale public and private sectors organizations including US and Canadian government agencies. Manoj has expertise in NoSQL and big data technologies including Hadoop, MySQL, MongoDB and Oracle. ABOUT PYTHIAN Pythian is a global IT services company that helps businesses become more competitive by using technology to reach their business goals. We design, implement, and manage systems that directly contribute to revenue and business success. Our services deliver increased agility and business velocity through IT transformation, and high system availability and performance through operational excellence. Our highly skilled technical teams work as an integrated extension of our clients organizations to deliver continuous transformation and uninterrupted operational excellence using our expertise in databases, cloud, DevOps, big data, advanced analytics, and infrastructure management. Pythian, The Pythian Group, love your data, pythian.com, and Adminiscope are trademarks of The Pythian Group Inc. Other product and company names mentioned herein may be trademarks or registered trademarks of their respective owners. The information presented is subject to change without notice. Copyright <year> The Pythian Group Inc. All rights reserved. www.pythian.com White Paper 18 V01-092016-NA