1
Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2
Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle Database Oracle Direct Connector for HDFS Oracle Loader for Hadoop Performance 3
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 4
Oracle s Big Data Platform Stream Acquire Organize & Discover Analyze Visualize & Decide 5
Oracle s Big Data Platform Hadoop Oracle Database 6
Oracle Big Data Connectors Oracle Direct Connector for HDFS Oracle Loader for Hadoop Oracle R Connector for Hadoop Oracle Data Integrator Application Adapters for Hadoop 7
Oracle Loader for Hadoop and Oracle Direct Connector for HDFS Access data resident on Hadoop from Oracle Database Load data from Hadoop into Oracle Database Analyze all data together: Data processed on Hadoop along with data in Oracle Database 8
Oracle R Connector for Hadoop R Analytics leveraging Hadoop and HDFS Oracle R Client Linearly Scale a Robust Set of R Algorithms HDFS Hadoop Leverage MapReduce for R Calculations Compute Intensive Parallelism for Simulations 9
Oracle Data Integrator Application Adapters for Hadoop Transforms Via MapReduce(HIVE) Benefits Consistent tooling across BI/DW, SOA, Integration and Big Data Activates Loads Reduce complexities of processing Hadoop through graphical tooling Improves productivity when processing Big Data (Structured + Unstructured) Oracle Database Improving Productivity and Efficiency for Big Data 10
Big Data Connectors ORACLE LOADER FOR HADOOP ORACLE DIRECT CONNECTOR FOR HDFS 11
Loading and Accessing Data from Hadoop INPUT 1 SHUFFLE /SORT LOG FILES INPUT 2 SHUFFLE /SORT SHUFFLE /SORT Oracle Database 12
Example Use Case BUSINESS PROBLEM Need insight into customer web activity (clickstream data) CONNECT HADOOP WITH ORACLE DATABASE Aggregate raw data and load into database for analysis BUSINESS PROBLEM Need to connect web activity with transactional activity CONNECT HADOOP WITH ORACLE DATABASE Perform analysis on in-place data by running Oracle SQL queries 13
Usage Scenarios Bulk load large volumes of data Example: Historical data, daily uploads of data gathered during the day Loads at regular frequency Example: 24/7 monitoring of log feeds Loads at irregular frequency Example: Monitoring of sensor feeds Access data files in place on HDFS 14
Oracle Direct Connector for HDFS Accessing HDFS Data from Oracle Database Features Access and analyze data in place on HDFS HDFS Access or load into the database in parallel using external table mechanism Oracle Database SQL Query Query and join data on HDFS with database resident data External Table Load into the database using SQL if required HDFS Client Automatic load balancing to maximize performance 15
Oracle Direct Connector for HDFS External Tables Access data on HDFS via external tables No DML operations, and no indexes can be created on external tables Data files can be text files or Oracle Data Pump files (created by Oracle Loader for Hadoop) Parallelism is controlled by the external table definition Data files are grouped to distribute load evenly across PQ slaves 16
Oracle Direct Connector for HDFS 3 Simple Steps Create external table Run the Oracle Direct Connector for HDFS utility to publish HDFS content to the external table Access and load into the database using SQL >hadoop jar \ $ODCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.hdfs.extab.externaltable\ -conf MyConf.xml \ -publish 17
Performance Comparison Fuse DFS Load rate (TB/hour) CPU Usage 6 5 4 3 2 1 0 Fuse-DFS Oracle Direct Connector for HDFS CPU seconds used per GB 180 160 140 120 100 80 60 40 20 0 Fuse-DFS Oracle Direct Connector for HDFS 18
Key Benefits Uniquely enables access to HDFS data files from Oracle Database Performance 12 TB/hour from Oracle Big Data Appliance to Oracle Exadata 5x 20x faster than comparable third party products Easy to use for Oracle DBAs and Hadoop developers Developed and supported by Oracle 19
Oracle Loader for Hadoop Read target table metadata Connect to the database from from the database reducer nodes, load into ORACLE LOADER FOR HADOOP database partitions in Partition, sort, and convert parallel (JDBC or direct into Oracle data types on path) Hadoop SHUFFLE /SORT Features Offloads data preprocessing from the database server to Hadoop Works with a range of input data formats SHUFFLE /SORT Handles skew in input data to maximize performance Online and offline modes (offline: create Oracle Data Pump files on HDFS) 20
Input Formats Oracle Loader for Hadoop Delimited text InputFormat Hive tables InputFormat Avro record InputFormat User written InputFormat (Planned) Regular expression InputFormat (Planned) Oracle NoSQL Database InputFormat 21
Automatically Handle Input Data Skew Load Balancing across Reducers Distribute load evenly across reduce tasks All reducers do approximately the same amount of work Avoids slowdown because of unbalanced reducer loads Maximizes performance Data is sampled to determine optimal partitioning of map output keys 22
Oracle Loader for Hadoop 2 Simple Steps Create target table Submit Oracle Loader for Hadoop job to the cluster >hadoop jar \ $OLH_HOME/jlib/oraloader.jar \ oracle.hadoop.loader.oraloader \ -conf MyConf.xml 23
Performance Comparison Third party products Load rate (TB/hour) CPU Usage 2.5 2 1.5 1 0.5 0 Comparable third party product Oracle Loader for Hadoop CPU seconds used per GB 700 600 500 400 300 200 100 0 Comparable third party product Oracle Loader for Hadoop 24
Key Benefits Load directly from HDFS, Hive tables, into Oracle Database without intermediate staging files Performance 10x faster than comparable third party products Offload database server processing on to Hadoop Minimizes impact on performance SLAs of production applications Easy to use for Oracle DBAs and Hadoop developers Developed and supported by Oracle 25
Oracle Loader for Hadoop and Oracle Direct Connector for HDFS ORACLE LOADER FOR HADOOP ORACLE DIRECT CONNECTOR FOR HDFS SHUFFLE /SORT SQL Query External Table HDFS Client SHUFFLE /SORT Oracle Database 26
Performance Summary 12 TB / HOUR (66 BILLION ROWS) 5 20 TIMES FASTER THAN THIRD PARTY PRODUCTS D DATABASE CPU USAGE IN COMPARISON 27
Summary High performance connectors for load and access of data from a Hadoop cluster Fast and efficient connectors support a range of use cases Simple to set up, easy to use for developers Developed and supported by Oracle 28
Q & A 29
Graphic Section Divider 30
31
32