Part 1 Configuring Oracle Big Data SQL

Size: px

Start display at page:

Download "Part 1 Configuring Oracle Big Data SQL"

Franklin Garrett
6 years ago
Views:

Oracle Big Data, Data Science, Advance Analytics & Oracle NoSQL Database Securely analyze data across the big data platform whether that data resides in Oracle Database 12c, Hadoop or a combination

1 Oracle Big Data, Data Science, Advance Analytics & Oracle NoSQL Database Securely analyze data across the big data platform whether that data resides in Oracle Database 12c, Hadoop or a combination of these sources. You will able to leverage your existing Oracle skill sets and applications to gain these insights. Apply Oracle's rich SQL dialect and security policies across the data platform greatly simplifying the ability to gain insights from all your data. Two parts to Big Data SQL: Enhanced Oracle external tables Oracle Big Data SQL Server o Big Data SQL Server applies SmartScan over data stored in Hadoop in order to achieve fast performance (Big Data SQL Server is available on Oracle Big Data Appliance only not on VM/OVA Part 1 Configuring Oracle Big Data SQL Copy/Download bigdatasql_hol_otn_setup.sql and bigdatasql_hol.sql files. Run bigdatasql_hol_otn_setup.sql script on SQL Developer, when prompted for a connection, select the moviedemo connection and click OK. This will complete the setup for this tutorial, the script bigdatasql_hol.sql has DEMO steps. The virtual environment for this tutorial is mostly preconfigured for Oracle Big Data SQL: There are six simple tasks required to configure Oracle Big Data SQL: 1. Create the Common Directory and a Cluster Directory on the Exadata Server. DONE. 2. Create and populate the bigdata.properties file in the Common Directory. DONE. 3. Copy the Hadoop configuration files into the Cluster Directory. DONE. 4. Create corresponding Oracle directory objects that reference these configuration directories. 1. Install Oracle Big Data SQL on the BDA using Mammoth the BDA's installation and configuration utility. DONE. 5. Install a CDH client on each Exadata Server. DONE.

Common Directory The Common directory contains a few subdirectories and an important file, named bigdata.properties. This file stores configuration information that is common to all BDA clusters.

For Exadata, the Common directory must be on a clusterwide file system; it is critical that all Exadata Database nodes access the exact same configuration information.

In addition, the Cluster directory must be a subdirectory of the Common directory and the name of the directory is important: It is the name that you will use to identify the cluster.

2 Common Directory The Common directory contains a few subdirectories and an important file, named bigdata.properties. This file stores configuration information that is common to all BDA clusters. Specifically, it contains property value pairs used to configure the JVM and identify a default cluster. For Exadata, the Common directory must be on a clusterwide file system; it is critical that all Exadata Database nodes access the exact same configuration information. cd /u01/bigdatasql_config/ cat bigdata.properties Cluster Directory The Cluster directory contains configuration files required to connect to a specific BDA cluster. In addition, the Cluster directory must be a subdirectory of the Common directory and the name of the directory is important: It is the name that you will use to identify the cluster. Notes: The properties, which are not specific to a hadoop cluster, include items such as the location of the Java VM, classpaths and the LD_LIBRARY_PATH. In addition, the last line of the file specifies the default cluster property in this case bigdatalite. As you will see later, the default cluster simplifies the definition of Oracle tables that are accessing data in Hadoop. In our hands-on lab, there is a single cluster: bigdatalite. The bigdatalite subdirectory contains the configuration files for the bigdatalite cluster. The name of the cluster must match the name of the subdirectory (and it is case sensitive!. cd /u01/bigdatasql_config/bigdatalite ls Notes: These are the files required to connect Oracle Database to HDFS and to Hive. Although not required, in our example these files were previously retrieved by using Cloudera Manager. The screenshot below shows the home page for a Cloudera Manager cluster. In our example, we select View Client URLs from the actions menu, and then downloaded the configuration files for both YARN and Hive to the Cluster Directory.

Create the Corresponding Oracle Directory Objects (Task #4 ORACLE_BIGDATA_CONFIG : the Oracle directory object that references the Common Directory ORACLE_BIGDATA_CL_bigdatalite : the Oracle

This name is case sensitive and is limited to 15 characters. Must match the physical directory name in the file system (repeat: it's case sensitive!

3 Create the Corresponding Oracle Directory Objects (Task #4 ORACLE_BIGDATA_CONFIG : the Oracle directory object that references the Common Directory ORACLE_BIGDATA_CL_bigdatalite : the Oracle directory object that references the Cluster Directory. The naming convention for this directory is as follows: Cluster Directory name begins with ORACLE_BIGDATA_CL_ Followed by the cluster name (i.e. "bigdatalite". This name is case sensitive and is limited to 15 characters. Must match the physical directory name in the file system (repeat: it's case sensitive!. SQL> create or replace directory ORACLE_BIGDATA_CONFIG as '/u01/bigdatasql_config'; SQL> create or replace directory "ORA_BIGDATA_CL_bigdatalite" as ''; Notice that there is no location specified for the Cluster Directory. It is expected that the directory will be a subdirectory of ORACLE_BIGDATA_CONFIG, Use the cluster name as identified by the Oracle directory object. Recommended Practice: In addition to the Oracle directory objects, you should also create the Big Data SQL Multithreaded Agent (MTA. ( Already done as pre-configuration. This agent bridges the metadata between Oracle Database and Hadoop. Technically, the MTA allows the external process to be multithreaded instead of launching a JVM for every process (which can be quite slow. SQL> create public database link BDSQL$_bigdatalite using 'extproc_connection_data'; SQL> create public database link BDSQL$_DEFAULT_CLUSTER using 'extproc_connection_data';

4 Part 2 Create Oracle Table Over Application Log The movie application streamed data into HDFS specifically into the following directory: /user/oracle/moviework/applog_json Execute the following command to review the log file stored in HDFS: hadoop fs -ls /user/oracle/moviework/applog_json hadoop fs -tail /user/oracle/moviework/applog_json/movieapp_log_json.log The JSON log captures the following information about each interaction/ contains every click happened on the web site: Create Oracle Table: SQL> CREATE TABLE movielog (click VARCHAR2(4000 ORGANIZATION EXTERNAL (TYPE ORACLE_HDFS DEFAULT DIRECTORY DEFAULT_DIR LOCATION ('/user/oracle/moviework/applog_json/' REJECT LIMIT UNLIMITED; SELECT * FROM movielog WHERE rownum < 20; SQL> CREATE TABLE movielog_plus (click VARCHAR2(40 ORGANIZATION EXTERNAL (TYPE ORACLE_HDFS DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bigdatalite com.oracle.bigdata.overflow={"action":"truncate"} LOCATION ('/user/oracle/moviework/applog_json/' REJECT LIMIT UNLIMITED; The click column has been changed to a VARCHAR2(40. Clearly, this is going to be a problem; the length of a JSON document exceeds that size. There are numerous ways to handle this situation, including: o Generate an error and then either reject the record, o set its value to null or replace it with an alternate value. o Simply truncate the data. Here, we are truncating the data. And, we have applied this truncate action to all columns in the table; you can also specify the individual column(s to truncate. A cluster bigdatalite has been specified. This cluster will be used instead of the default (which in this case happens to be the same. Currently a given session may only connect to a single cluster. SELECT * FROM movielog_plus WHERE rownum < 20; Oracle Database 12c ( includes native JSON support. This allows queries to easily extract attribute data from JSON documents. Run the following query in SQL Developer: SQL> SELECT m.click.custid, m.click.movieid, m.click.genreid, m.click.time FROM movielog m WHERE rownum < 20; The column specification in the select list is a full path to the JSON attribute. The specification starts with the table alias ("m" note: this is required!, followed by the column name ("click", and then a case sensitive JSON path (e.g. "genreid".

5 Combine data from Oracle Database and Hadoop. Combine the "click" data with data sourced from the movie dimension table from Oracle database. SQL> SELECT f.click.custid, m.title, m.year, m.gross, f.click.rating FROM movielog f, movie m WHERE f.click.movieid = m.movie_id AND f.click.rating > 4; Create view(s to simplify queries against the JSON data. SQL> SELECT CAST(m.click.custid AS NUMBER custid, CAST(m.click.movieid AS NUMBER movieid, CAST(m.click.activity AS NUMBER activity, CAST(m.click.genreid AS NUMBER genreid, CAST(m.click.recommended AS VARCHAR2(1 recommended, CAST(m.click.time AS VARCHAR2(20 time, CAST(m.click.rating AS NUMBER rating, CAST(m.click.price AS NUMBER price FROM movielog m; Oracle SQL for MoviePlex average ratings compare to top 10 grossing movies: SQL> SELECT m.title, m.year, m.gross, round(avg(f.rating, 1 FROM movielog_v f, movie m WHERE f.movieid = m.movie_id GROUP BY m.title, m.year, m.gross ORDER BY m.gross desc FETCH FIRST 10 ROWS ONLY; Part 3 Leverage the Hive Metastore to Access Data in Hadoop Hive enables SQL access to data stored in Hadoop and NoSQL stores. Two parts to Hive: the Hive execution engine and the Hive Metastore. The Hive execution engine launches MapReduce job(s based on the SQL that has been issued. MapReduce is a batch processing framework and is not intended for interactive query and analysis but it is extremely useful for querying massive data sets using the well understood SQL language. Importantly, no coding is required (Java, Pig, etc.. The SQL supported by Hive is still limited (SQL92, but improvements are being made over time. The Hive Metastore has become the standard metadata repository for data stored in Hadoop. It contains the definitions of tables (table name, columns and data types, the location of data files (e.g. directory in HDFS, and the routines required parse that data (e.g. StorageHandlers, InputFormats and SerDes - serializer/deserializer. The same metadata can be shared across multiple products (e.g. Hive, Oracle Big Data SQL, Impala, Pig, Stinger, etc.; Review Tables Stored in Hive: CLI hive > show tables;

and no filters applied, the query simply scans the file and returning the results. No MapReduce job is executed.

6 The movielog Table is equivalent to the external table that was defined in Oracle Database in the previous exercise. Review the definition of the table by executing the following command at the hive> show create table movielog; hive> select * from movielog limit 10; Because there are no columns in the select list and no filters applied, the query simply scans the file and returning the results. No MapReduce job is executed. The second table queries that same file however this time it is using a SerDe that will translate the attributes into columns. Review the definition of the table by executing the following command: There are columns defined for each field in the JSON document making it much easier to understand and query the data. A java class org.apache.hive.hcatalog.data.jsonserde is used to deserialize the JSON file. hive > show create table movieapp_log_json;

This is an illustration of Hadoop's schema on read paradigm; a file is stored in HDFS, but there is no schema associated with it until that file is read.

7 This is an illustration of Hadoop's schema on read paradigm; a file is stored in HDFS, but there is no schema associated with it until that file is read. Our examples are using two different schemas to read that same data; these schemas are encapsulated by the Hive tables movielog and movieapp_log_json. The Hive query execution engine converted hivesql query into a MapReduce job. The author of the query does not need to worry about the underlying implementation Hive handles this automatically. hive > select * from movieapp_log_json where rating > 4; hive > exit; Leverage Hive Metadata When Creating Oracle Tables: Create a table over the Hive movieapp_log_json table using the following DDL: The ORACLE_HIVE access driver type invokes Oracle Big Data SQL at query compilation time to retrieve the metadata details from the Hive Metastore. The default can be overridden using ACCESS PARAMETERS. The metadata includes the location of the data and the classes required to process the data (e.g. StorageHandlers, InputFormats and SerDes. The scanned the files found in the /user/oracle/movie/moviework/applog_json directory and then used the Hive SerDe to parse each JSON document. In a true Oracle Big Data Appliance environment, the input splits would be processed in parallel across the nodes of the cluster by the Big Data SQL Server, the data would then be filtered locally using Smart Scan, and only the filtered results (rows and columns would be returned to Oracle Database. SQL> CREATE TABLE movieapp_log_json ( custid INTEGER, movieid INTEGER, genreid INTEGER, time VARCHAR2 (20, recommended VARCHAR2 (4, activity NUMBER, rating INTEGER, price NUMBER ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR REJECT LIMIT UNLIMITED; SQL> SELECT * FROM movieapp_log_json WHERE rating > 4 ;

8 The second Hive table over the same movie log content except the data is in Avro format not JSON text format. Create an Oracle table over that Avro based Hive table using the following command: The Oracle table name does not match the Hive table name. Therefore, an ACCESS PARAMETER was specified that references the Hive table (default.movieapp_log_avro. SQL> CREATE TABLE mylogdata ( custid INTEGER, movieid INTEGER, genreid INTEGER, time VARCHAR2 (20, recommended VARCHAR2 (4, activity NUMBER, rating INTEGER, price NUMBER ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.movieapp_log_avro REJECT LIMIT UNLIMITED; SQL> SELECT custid, movieid, time FROM mylogdata; To illustrate how Oracle Big Data SQL uses the Hive Metastore at query compilation to determine query execution parameters, just change the definition of the hive table movieapp_log_data. In Hive, alter the table's LOCATION field so that it points to a file that containing has only two records ( example. The Oracle SQL also runs without making any changes to the Oracle table query movieapp_log_json: hive > ALTER TABLE movieapp_log_json SET LOCATION "hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/two_recs"; hive > SELECT * FROM movieapp_log_json; SQL > SELECT * FROM movieapp_log_json; Reset the Hive table and then confirm that there are more than two rows. Execute the following commands. hive > ALTER TABLE movieapp_log_json SET LOCATION "hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json"; hive > select * from movieapp_log_json limit 10;

Part 4 Applying Oracle Database Security Policies Over Data in Hadoop Oracle Database security features, including strong authentication, row level access, data redaction, data masking, auditing and

9 Part 4 Applying Oracle Database Security Policies Over Data in Hadoop Oracle Database security features, including strong authentication, row level access, data redaction, data masking, auditing and more have been utilized to ensure that data remains safe on HDFS/Hadoop/BigDATA. Example, to protect personally identifiable information, including the customer last name and customer id. Oracle Data Redaction policy has already been set up on the customer table that obscures these two fields. This was accomplished by using the DBMS_REDACT PL/SQL package. SQL> DBMS_REDACT.ADD_POLICY( object_schema => 'MOVIEDEMO', object_name => 'CUSTOMER', column_name => 'CUST_ID', policy_name => 'customer_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => '9,1,7', expression => '1=1' ; Creates a policy called customer_redaction: It is applied to the cust_id column moviedemo.customer table It performs a partial redaction I e. it is not nec. applied to all characters in the field It replaces the first 7 characters with the number "9" The redaction policy will always apply since the expression describing when it will apply is specified as "1=1" DBMS_REDACT.ALTER_POLICY( object_schema => 'MOVIEDEMO', object_name => 'CUSTOMER', action => DBMS_REDACT.ADD_COLUMN, column_name => 'LAST_NAME', policy_name => 'customer_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25', expression => '1=1' ; Updates the customer_redaction policy, redacting a second column in that same table. It will replace the characters 3 to 25 in the LAST_NAME column with an '*'. The fact that the data is redacted is transparent to application code. SELECT cust_id, last_name FROM customer;

Apply Redaction Policies to Data Stored in Hadoop: Apply an equivalent redaction policy to two of our Oracle Big Data SQL tables, with the following effects: The first procedure redacts data sourced

ADD_POLICY( object_schema => 'MOVIEDEMO', object_name => 'MOVIELOG_V', column_name => 'CUSTID', policy_name => 'movielog_v_redaction', function_type => DBMS_REDACT.

10 Apply Redaction Policies to Data Stored in Hadoop: Apply an equivalent redaction policy to two of our Oracle Big Data SQL tables, with the following effects: The first procedure redacts data sourced from JSON in HDFS. The second procedure redacts Avro data sourced from Hive. Both policies redact the custid; attribute. SQL> BEGIN -- JSON file in HDFS DBMS_REDACT.ADD_POLICY( object_schema => 'MOVIEDEMO', object_name => 'MOVIELOG_V', column_name => 'CUSTID', policy_name => 'movielog_v_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => '9,1,7', expression => '1=1' ; Avro data from Hive -- DBMS_REDACT.ADD_POLICY( object_schema => 'MOVIEDEMO', object_name => 'MYLOGDATA', column_name => 'CUSTID', policy_name => 'mylogdata_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => '9,1,7', expression => '1=1' ; END; / Review the redacted data from the Avro source: SQL> SELECT * FROM mylogdata WHERE rownum < 20; Join the redacted HDFS data to the customer table by executing the following SELECT statement: SQL> SELECT f.custid, c.last_name, f.movieid, f.time FROM customer c, movielog_v f WHERE c.cust_id = f.custid;

Part 5 Using Oracle Analytic SQL Across All Your Data Oracle Big Data SQL allows you to utilize Oracle's rich SQL dialect to query all your data, regardless of where that data may reside.

11 Part 5 Using Oracle Analytic SQL Across All Your Data Oracle Big Data SQL allows you to utilize Oracle's rich SQL dialect to query all your data, regardless of where that data may reside. Oracle MoviePlex's understanding of customers by utilizing an RFM analysis: Recency : when was the last time the customer accessed the site? Frequency : what is the level of activity for that customer on the site? Monetary : how much money has the customer spent? SQL Analytic Functions will be applied to data residing in both the application logs on Hadoop and sales data in Oracle Database tables. RFM combined score of 551 indicates that the customer is in the highest tier of customers in terms of recent visits (R=5 and activity on the site (F=5, however the customer is in the lowest tier in terms of spend (M=1. Apply Oracle NTILE functions across all data: The customer_sales subquery selects from the Oracle Database fact table movie_sales to categorize customers based on sales. The click_data subquery performs a similar task for web site activity stored in the application logs categorizing customers based on their activity and recent visits. These two subqueries are then joined to produce the complete RFM score. SQL> WITH customer_sales AS ( Sales and customer attributes SELECT m.cust_id, c.last_name, c.first_name, c.country, c.gender, c.age, c.income_level, NTILE (5 over (order by sum(sales AS rfm_monetary FROM movie_sales m, customer c WHERE c.cust_id = m.cust_id GROUP BY m.cust_id, c.last_name, c.first_name, c.country, c.gender, c.age, c.income_level, click_data AS ( clicks from application log SELECT custid, NTILE (5 over (order by max(time AS rfm_recency, NTILE (5 over (order by count(1 AS rfm_frequency FROM movielog_v GROUP BY custid SELECT c.cust_id, c.last_name, c.first_name, cd.rfm_recency, cd.rfm_frequency, c.rfm_monetary, cd.rfm_recency*100 + cd.rfm_frequency*10 + c.rfm_monetary AS rfm_combined, c.country, c.gender, c.age, c.income_level FROM customer_sales c, click_data cd WHERE c.cust_id = cd.custid;

12 2 We want to target customers who we may be losing to competition. Therefore, execute the following amend the query which finds important customers (high monetary score that have not visited the site recently (low recency score: SQL> WITH customer_sales AS ( Sales and customer attributes SELECT m.cust_id, c.last_name, c.first_name, c.country, c.gender, c.age, c.income_level, NTILE (5 over (order by sum(sales AS rfm_monetary FROM movie_sales m, customer c WHERE c.cust_id = m.cust_id GROUP BY m.cust_id, c.last_name, c.first_name, c.country, c.gender, c.age, c.income_level, click_data AS ( clicks from application log SELECT custid, NTILE (5 over (order by max(time AS rfm_recency, NTILE (5 over (order by count(1 AS rfm_frequency FROM movielog_v GROUP BY custid SELECT c.cust_id, c.last_name, c.first_name, cd.rfm_recency, cd.rfm_frequency, c.rfm_monetary, cd.rfm_recency*100 + cd.rfm_frequency*10 + c.rfm_monetary AS rfm_combined, c.country, c.gender, c.age, c.income_level FROM customer_sales c, click_data cd WHERE c.cust_id = cd.custid AND c.rfm_monetary >= 4 AND cd.rfm_recency <= 2 ORDER BY c.rfm_monetary desc, cd.rfm_recency desc ; Pattern Matching and Advance Analytics with PIVOT tables::

Oracle Big Data SQL High Performance Data Virtualization Explained

Keywords: Oracle Big Data SQL High Performance Data Virtualization Explained Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data SQL, SQL, Big Data, Hadoop, NoSQL Databases, Relational Databases,