Oracle Big Data SQL High Performance Data Virtualization Explained

Keywords: Oracle Big Data SQL High Performance Data Virtualization Explained Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data SQL, SQL, Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical session focuses on Oracle Big Data SQL. Big Data SQL The key goals of Big Data SQL are to expose data in its original format, and stored within Hadoop and NoSQL Databases through high performance Oracle SQL being offloaded to Storage resident cells or agents. The architecture of Big Data SQL closely follows the architecture of Oracle Exadata Storage Server Software and is built on the same proven technology. With data in HDFS stored in an undetermined format (schema on read), SQL queries require some constructs to parse and interpret data for it to be processed in rows and columns. For this Big Data SQL leverages all the Hadoop constructs, notably InputFormat and SerDe Java classes optionally through Hive metadata definitions. Big Data SQL then layers the Oracle Big Data SQL Agent on top of this generic Hadoop infrastructure, as can be seen in Figure 1. Figure 1. Architecture leveraging core Hadoop and Oracle together Because Big Data SQL is based on Exadata Storage Server Software, a number of benefits are instantly available. Big Data SQL not only can retrieve data, but can also score Data Mining models at the individual agent, mapping model scoring to an individual HDFS node. Likewise

querying JSON documents stored in HDFS can be done with SQL directly and is executed on the agent itself. How does Hadoop on the Bottom work? The general Hadoop constructs (like InputFormat) mentioned above work in concert to impose schema on read on the data stored in HDFS. The Hive Catalog captures the schema definitions and enables Hive to act as a SQL engine. This metadata is what is instrumentation in leveraging these Hadoop constructs for Big Data SQL. Figure 2. How Hive Metadata Enables SQL Access As can be seen in Figure 2, the InputFormat (a Java Class) is used to parse the file into chunks of data on which the RecordReader then imposes a record format (start of record to end of record). Once that is done, a SerDe imposes attribute (or column) definitions on these records. Note that all of these steps are part of the query execution and happen on the fly. The metadata definition (eg. A table) is captured in the Hive Metastore. How does Oracle on Top work? With the understanding of Figure 2 in mind, it is easy to see how an Oracle SQL query can leverage these constructs, however to enable data local processing an additional component is required. First however, lets draw the picture as to what Big Data SQL does with the notion of reading data with the constructs Hive uses. This is shown below.

Figure 3. Big Data SQL and Hive Metadata Partition Pruning and Predicate Pushdown One of the most effective performance features ever invented is partitioning of data. Partitioning has been implemented in Hive and can now be used by Big Data SQL to speed up IO by eliminating partitions from a query. Big Data SQL leverages the predicates in a query and eliminates partitions from the data selected. This is very similar to what happens in Oracle Database. Partition pruning is the first step taken, and it impacts not just IO but also the number of splits eventually created. Using the same predicates, much more IO can be avoided when the underlying files are preparsed. As an example, Parquet files which contains parsed data (like a database file), can be interrogated intelligently by using the file metadata and the predicates. This way, database like selectivity is achieved, driving up query performance. Big Data SQL leverages both partitioning and predicate pushdown to Parquet, ORC or HBase as examples to ensure excellent performance. Storage Indexes Storage Indexes (SI) provide the same benefits of IO elimination to Big Data SQL as they provide to SQL on Exadata. The big difference is that in Big Data SQL the SI work on an HDFS block (on BDA 256MB of data) and span 32 columns instead of the usual 8. SI is fully transparent to both Oracle Database and to the underlying HDFS environment. As with Exadata, the SI is a memory construct managed by the Big Data SQL software and invalidated automatically when the underlying files change.

Figure 4. Storage Indexes work on HDFS Blocks and speed up IO by skipping blocks SI works on data exposed via Oracle External tables using both the ORACLE_HIVE and ORACLE_HDFS types. Fields are mapped to these External Tables and the SI is attached to the Oracle columns, so that when a query references the column(s), the SI - when appropriate - kicks in. In the current version, SI does not support tables defined with Storage Handlers (ex: HBase or Oracle NoSQL Database). Bloom Filters To ensure joins can be pushed down to the Hadoop nodes instead of requiring all data to be database resident are now also supported in Big Data SQL. Figure 5. An example of Bloom Filtering

Bloom filters are used in for example, Exadata as well to convert joins into scans, enabling the joining of data to take place in the storage tier. The same is implemented in Big Data SQL, where the bloom filters are pushed to the cell and execute a join on the Hadoop node with local data. Smart Scan Within the Big Data SQL Agent, similar functionality exists as is available in Exadata Storage Server Software. Smart Scans apply the filter and row projections from a given SQL query on the data streaming from the HDFS Data Nodes, reducing the data that is flowing to the Database to fulfill the data request of that given query. The benefits of Smart Scan for Hadoop data are even more pronounced than for Oracle Database as tables are often very wide and very large. Because of the elimination of data at the individual HDFS node, queries across large tables are now possible within reasonable time limits enabling data warehouse style queries to be spread across data stored in both HDFS and Oracle Database. Compound Benefits Smart Scan, Predicate Pushdown, Partitioning and Storage Index features deliver compound benefits. Where Predicate Pushdown, Partitioning and Storage Indexes reduce the IO done, Smart Scan enacts row filtering and column projection when the IO reduction in not granular enough. This latter step remains important as it reduces the data transferred between systems and serves as a lock on the system preventing it from overflowing. Update on Availability and Support for Big Data SQL As Big Data SQL matures as a technology its applicable platforms are expanded. The upcoming (at the time of writing) version of Big Data SQL is going to add a large number of supported configurations as shown below: Figure 6. Expanded Deployment Options The graphic shown above shows the rounding out of deployment options, with support for Sparc SuperCluster to Big Data Appliance and various hybrid options, including Oracle Exadata to generic Hadoop cluster running either Hortonworks or Cloudera distributions.

Virtual Machine to try out Big Data SQL and all other components Of course, having all these new features in the platform is a lot of fun. But how do I get to try all of this? It s simple. Oracle packages all of these features into a virtual machine called Oracle Big Data Lite VM. This is updated for each new version of BDA software and picks up the components in the stack. This VM not only includes all of Oracle Data Integration including the Big Data add-on but it now also includes Oracle Big Data Discovery. It is the perfect client for a BDA, and the perfect test bed for any of your investigations. You can find the VM here: http://www.oracle.com/technetwork/database/bigdata-appliance/oraclebigdatalite-2104726.html Contact address: Jean-Pierre Dijcks Oracle 500 Oracle Parkway MS 4op7 Redwood City, CA 94065 Phone: +1 650 607 5394 Email Jean-Pierre.Dijcks@oracle.com Internet: www.oracle.com/bigdata