Narration Script for ODI Adapter for Hadoop estudy

Size: px

Start display at page:

Download "Narration Script for ODI Adapter for Hadoop estudy"

Gertrude Jefferson
5 years ago
Views:

1 Narration Script for ODI Adapter for Hadoop estudy MODULE 1: Overview of Oracle Big Data Title Hello, and welcome to this Oracle self-study course entitled Oracle Data Integrator Application Adapter for Hadoop. My name is Richard Green. I am a curriculum developer at Oracle and I have helped educate customers on Oracle products since I will be your tour guide for the next hour of interactive lectures and demos. The aim of this course is to introduce you to the Oracle Data Integrator Application Adapter for Hadoop, one of the Oracle Big Data Connectors. After completing this course, you should be able to appreciate the features of this adapter, including an understanding of the business and technical benefits of this product. With the focus on the ODI Application Adapter for Hadoop only, this course is not intended as an overall introduction to all of the features of Oracle Data Integrator. A case study involving Simon Howell, a data integration project manager of a fictitious company, brings out the business benefits of the ODI Application Adapter for Hadoop. 2 Using The Player Before we begin, now might be a good time to take a look at some of the features of this Flash-based course player. Feel free to skip this slide and start the lecture if you ve attended similar Oracle self-study courses in the past. To your left, you will find a hierarchical course outline. This course enables and even encourages you to go at your own pace, which means you are free to skip over topics you already feel confident on, or jump right to a feature that really interests you, or go back and review topics that were already covered. Simply click on a course section to expand its contents and then select an individual slide. However, note that by default we will automatically walk you through the entire course without requiring you to use the outline. Also to your left is a panel containing any additional reference notes for the current slide. Feel free to read these reference notes at the conclusion of the course. Or if you prefer you can pause and read them as we go along. Standard Flash player controls are also found at the bottom of the player, including pause, previous, and next buttons. There is also an interactive progress bar to fast forward or rewind the current slide. Interactive slides may have additional controls and buttons along with instructions on how to use them. Various handouts may be available from the Attachments button, including the audio narration scripts for this course. The course will now pause, so feel free to take some time and explore the interface. Then when you re ready to continue, click the NEXT button below, or alternatively click the Module 1 slide in the course outline at left.

2 3 About This Course Introduction So after having been given a quick introduction to the goals of this course, you still may be asking yourself Am I in the right place? To help you answer this question, you can access information here regarding the course s objectives, target audience, and prerequisites. Overview This course introduces you to the features of Oracle Data Integrator Application Adapter for Hadoop, ODIAAH for short. Using lecture materials and product demos, it provides an introduction to the architecture and capabilities of ODIAAH. This course does not serve as a general introduction to the broad set of features and functionality of Oracle Data Integrator. This course is most useful if you have prior hands-on experience with Oracle Data Integrator. For broad or detailed information about the entire range of Oracle data integration capabilities, refer to the documentation and Oracle University courses for data integration and data warehousing. Course Objectives After completing this course, you should be able to: Describe the Oracle approach to Big Data Explain the main features and capabilities of ODIAAH Explain how to use ODIAAH to load data from files into Hive Explain how to use ODIAAH to transform files in Hive Explain how to use ODIAAH to move data from Hive into Oracle What are the Prerequisites? To get the most from this course, you should have thorough familiarity, preferably hands-on, with Oracle Data Integrator. 4 Course Road Map This course consists of five modules: 1. The first module is an overview of the Oracle approach to big data 2. The second module is an introduction to Oracle Data Integrator Application Adapter for Hadoop 3. The third module shows how to use ODIAAH to load data from files into Hive 4. The fourth module shows how to use ODIAAH to transform and validate files that are in Hive 5. The fifth module shows how to use ODIAAH to load data from Hive into an Oracle database. 5 Module 1 title slide: Overview of Oracle Big Data Let us now proceed with the 1st module, Overview of Oracle Big Data.

3 6 Module Topics Before we dive into our examination of the Oracle Data Integrator Application Adapter for Hadoop, we need to step back and consider the so-called big data environment in which it operates. This module s topics include: What does the term big data mean? What are the Oracle Big Data products? and an introduction to the Hadoop environment and MapReduce programming 7 Big Data is About Today the term big data draws a lot of attention, but there's a simple story behind the hype. For decades, companies have been making business decisions based on transactional data that is stored in relational databases. The usual information-gathering process involves extracting the data that resides in a database. That process restricts you to retrieving mostly structured and transactional data. The Oracle big data solution eliminates this restriction by tapping into diverse data sets, finding relationships in the data, and using the data for different business purposes. Beyond that critical data, however, is a potential treasure trove of nontraditional, less structured data, such as weblogs, social media, , sensors, and photographs, which can be mined for useful information. Decreases in the cost of both storage and compute power have made it feasible to collect this data, which would have been thrown away only a few years ago. As a result, more and more companies want to include nontraditional data and traditional enterprise data in their business intelligence analysis. 8 How Did Big Data Evolve? As more and more people use the Internet and new technologies like smartphones, greater volumes of data are generated all over the world. Not only is this data voluminous, but it is also generated in various formats. Some examples of big data sources are: Social networks Banking and financial services E-commerce services Web-centric services Internet search indexes Scientific searches Document searches Medical records And Weblogs Greater volumes of data are being generated by: Traditional enterprise applications such as transactional ERP data, web store transactions, and general ledger data. And vast amounts of machine-generated/sensor data including Call Detail Records, weblogs, smart meters, manufacturing sensors, equipment logs (often referred to as digital exhaust ), and trading

4 systems data. And an explosion of social data including customer feedback streams, microblogging sites like Twitter, and social media platforms like Facebook. Because traditional databases cannot handle this data or process it instantly, a different approach to storing data became necessary. 9 Big Data: Infrastructure Requirements Let s examine the three steps for converting data into decision-making information. First, capture or acquire raw data with Hadoop Distributed File System (HDFS) and key-value stores (NoSQL databases). Second, use a programming paradigm called MapReduce to interpret and refine the data. Third, feed the refined and organized data into a relational database (SQL databases) to enable proper analysis, and then base business decisions on the data. 10 Oracle Integrated Software Solution This diagram depicts, at a high level, Oracle s integrated software solution for conducting the: Acquire, Organize, and Analyze & Decide phases of data conversion. Business decisions are derived from the analyzed data using your choice of tools. This Oracle solution handles data that can be classified as ranging from high information density to high data variety. This data might be organized in schemas, it might be schema-less, or it might be entirely unstructured. The software that is the focus for this self-study is Oracle Data Integrator Application Adapter for Hadoop. 11 Oracle Big Data Appliance: Hardware Components An Oracle Big Data Appliance rack consists of these components: 18 Sun Fire X4270 M2 Servers 1 Sun Rack II 1242 Base 2 NM2-GW Sun Network QDR InfiniBand Gateway Switches 1 NM2-36 Sun Datacenter InfiniBand Switch 36 1 Cisco Catalyst 4948 Ethernet Switch 1 KVM Switch And 2 Power Distribution Units 12 Oracle Big Data Appliance: Software Components Oracle Big Data Appliance includes open-source components, which are packaged as system software with the appliance, and Oracle software, which is packaged as Big Data Connectors. Oracle Big Data Appliance is preinstalled and preconfigured for large-scale big data management. It uses Cloudera s Distribution including Apache Hadoop (CDH) and Oracle NoSQL Database as data management capabilities, and runs on Oracle Linux and Oracle HotSpot JVM. It includes Cloudera

5 Manager for cluster-wide administration and monitoring of CDH. For deep analysis of big data, it also includes an open-source distribution of the statistical environment R. In contrast to building a big data system from the ground up, the Big Data Appliance eliminates the time-consuming efforts of choosing and configuring hardware, determining the proper open-source components and versions, and integrating and tuning the overall configuration. The entire solution is preinstalled and preconfigured out of the box for high performance and high availability. The foundation software residing on Oracle Big Data Appliance includes: Oracle Linux Oracle Java VM Open-source Apache Hadoop distribution from Cloudera and Open-source R Distribution The application software includes: Oracle NoSQL Database Enterprise Edition Oracle Loader for Hadoop and Oracle Data Integrator Application Adapter for Hadoop The focus of this self-study is Oracle Data Integrator Application Adapter for Hadoop (ODIAAH). Note that the position of blocks in the diagram is not meant to suggest that ODIAAH operates on top of Oracle Loader for Hadoop. 13 Oracle Big Data Connectors Software Where Oracle Big Data Appliance makes it easy for organizations to acquire and organize new types of data, Oracle Big Data Connectors enables an integrated data set for analyzing all data. You can install Oracle Big Data Connectors on Oracle Big Data Appliance or on a generic Hadoop cluster. There are four Oracle Big Data Connectors: Oracle Loader for Hadoop uses MapReduce processing to load data efficiently into Oracle Database 11g. Oracle R Connector for Hadoop gives R users native, high-performance access to the HDFS and MapReduce programming framework. Oracle Direct Connector for Hadoop Distributed File System enables the Oracle Database SQL engine to access data seamlessly from HDFS; it allows Oracle Database to access big data on a Hadoop cluster without loading the data. Oracle Data Integrator Application Adapter for Hadoop enables Oracle Data Integrator to generate Hadoop MapReduce programs that extract, transform, and load big data into partitioned tables in Oracle Database, through an easy-to-use graphical interface. ODIAAH is the focus of this training. 14 Hadoop Architecture With Hadoop MapReduce, you can easily develop applications that process massive volumes of data in parallel on large clusters of an engineered system, which is subject to fault tolerance and reliability. The framework sorts the output of the maps, which are then input to the reduce tasks. Job input and output are stored in a file system. The framework also schedules tasks, monitors them, and reexecutes the failed tasks. In a single Oracle Big Data Appliance rack, Hadoop distributes the files and workload across the 18 servers in a cluster. The characteristics of the Hadoop architecture are:

6 Distributed file system with redundant storage Map/Reduce programming paradigm ighly scalable data processing And cost-effective model for high-volume, low-density data 15 What is MapReduce? MapReduce is a framework that enables distributed computation on large clusters. MapReduce is a set of code and infrastructure for parsing and building large data sets. A map function generates a key/value pair from the input data, and this data is then reduced by a function to combine all values that are associated with equivalent keys. The cluster has a single JobTracker, and each node in the cluster has a TaskTracker. The JobTracker schedules jobs. The jobs are broken down into component tasks that are submitted to, and executed by, the TaskTrackers. The JobTracker monitors and re-executes any failed tasks. 16 Example of a Word Count MapReduce Program This diagram shows an example of a MapReduce program for performing a word count. The Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input data sets into intermediate data sets. The transformed intermediate data sets do not have to be of the same type as the input data sets. A given input pair may map to zero or many output pairs. 17 MapReduce Sessionization Example Sessionization is a distinctive map-reduce use case in web data analysis. The diagram displays a scenario for dividing the available click streams into sessions within a specific range of time, and then finding different patterns from each session. You can use the session log for marketing strategies, familiarity of a particular product, irrelevant transactions, and so on. The map-reduce process for sessionization involves steps that map, shuffle and sort, and reduce. The input file is the session information log, which contains the userid, pageid, and time stamp detail on their login. Based on this log, you can identify the number of sessions that are active at a particular time, including the duration of each session. The input files are mapped using the Mapper. The mapped pairs are sorted and shuffled, and redundancy is eliminated. These sorted pairs are further reduced to extract the required output. Note: You can repeat this map-reduce process based on the size of the input data. 18 What is Hive? Hive is a data warehouse infrastructure for Hadoop. It facilitates easy data summarization, ad hoc queries, and analysis of large data sets that are stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto this data, and query the data using a SQL-like language called HiveQL. Hive is not designed for online transaction processing (OLTP) workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data, such as web

7 logs. Key benefits of Hive are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and custom scalar functions [UDFs], aggregations [UDAFs], and table functions [UDTFs]), fault tolerance, and loose coupling with its input formats. 19 Review In this module, you should have learned about: What does the term big data mean? What are the Oracle Big Data products? and an introduction to the Hadoop environment and MapReduce programming 20 Quiz MODULE 2: Oracle Data Integrator Application Adapter for Hadoop 21 Title Slide Welcome to Module 2, Oracle Data Integrator Application Adapter for Hadoop. In this module, you are first introduced to the Oracle Data Integrator product. You then learn how the ODI Application Adapter for Hadoop can serve as an efficient alternative to manual MapReduce programming. 22 Module Topics In this module, you learn: What is Oracle Data Integrator? What are the main components of ODI? The benefits of the ODI declarative design The flexibility and efficiency of ODI knowledge modules What is the ODI Application Adapter for Hadoop? What are the Knowledge modules specifically designed for Hadoop? 23 Should We Switch from Manual MapReduce Programming? Let s meet Simon Howell, who manages a data integration project for a fictitious company. Simon wants to know if Oracle s ODI Application Adapter for Hadoop can possibly replace his team s manual MapReduce programming. Simon needs to learn enough about ODIAAH to be able to assess whether it can provide a more efficient alternative to manual MapReduce programming. 24 What is Oracle Data Integrator? Oracle Data Integrator (ODI) is a tool for designing, deploying, and executing jobs for data movement and data transformation among data systems. ODI can read and write to and from a number of different sources and targets. ODI provides predefined connectors and application adapters to connect to specific source applications,

8 databases, and legacy systems. Oracle Data Integrator is a widely used data integration software toolset that provides: Declarative design approach to defining data transformation and integration processes Faster and simpler development and maintenance Unique ELT architecture (extract, then load, then transform) Most cost-effective solution Unified infrastructure to streamline data and application integration projects 25 ODI Components Let s look at the components of ODI. The ODI repository is a database containing all of the ODI metadata. It is made of 2 parts, the Master Repository and one or more Work Repositories. The Master Repository is shared among different projects and users. It contains security information related to users and their access rights. It contains topology information about the different technologies with which ODI can interface. By technology, we mean platforms that ODI can read or write to, or platforms that ODI can use as an engine, such as Oracle, or DB2, or flat files. The technology that is specified determines what kind of code ODI will generate. For example, the generated code can include outer joins if the specified technology supports outer joins. You can create a number of Work Repositories that sit on top of the Master Repository. This is where you define interfaces that map the movement and transformation of data from source to target. You also model your schema definitions in Work Repositories. You also organize your work as projects in Work Repositories. Your projects contain definitions of such things as business rules, packages, procedures, and knowledge module templates for connecting to specific source technologies. Work Repositories also store information about the success or failure of your job executions. The Agent is the program that schedules and runs all your interfaces of source to target mappings. The agent orchestrates when and where the code is executed. The interface code is compiled and runs as what is called a scenario. The agent takes the scenario from the repository, and sends the scenario to the machine on which it will run. The agent starts the job and gets log information, allowing you to see which task is taking place. The agent retrieves return codes, messages, and statistics (such as the number of rows processed and execution time), which it writes back to the Work Repository. The ODI Studio is the graphical user interface to define everything. You can define your source to target interfaces, package them, manage executions, and so on. You import the Application Adapters, such as the Hadoop adapter, using ODI Studio. The Application Adapters are stored in the ODI Repository. The Applications Adapters are not shipped as part of the ODI Studio. 26 ODI Declarative Design A powerful feature of ODI is its declarative design paradigm. You define what you want to do. You pass it through one or more of ODI s predefined templates, which describe how the job is done. These templates are either the standard library of knowledge module templates that come with ODI, or one of the application adapters like the adapter for Hadoop, which you can import into ODI. ODI s code generator automatically defines code specifically for that task, written for the particular source and target technologies, such as Hadoop or Oracle or DB2.

9 The screenshot shows the editor for an ODI interface, presenting a logical representation of source objects in the upper left panel, target objects in the upper right panel, and the properties of the target object in the panel at the bottom. 27 ODI Knowledge Modules ODI ships with a comprehensive library of knowledge modules, which are like templates for generating technology-specific code. Knowledge modules (KMs) are at the core of the ODI architecture. They make all ODI processes modular, flexible, and extensible. They implement the actual data flows and define the templates for generating code across the multiple systems that are involved in each process. ODI provides a comprehensive library of knowledge modules. o The red chevrons at the top of the slide show the 6 types of knowledge modules For example, in ODI, you can define a flat file definition and define the target table in a database. You specify what you want to accomplish by mapping the flat file fields to the table columns. But at that point, you are not actually saying how you want to load the data into the table. You can now choose the Oracle SQL*Loader knowledge module, and the resulting code will push the data into the table using SQL*Loader. Or, you could use the External Tables knowledge module, and the resulting code will push the data into the table using an external table. You choose either approach of how to perform work, without changing the initial mapping of what you want done. 28 ODI Components This is the ODI components diagram that we examined earlier. Now we add the Oracle Big Data Appliance symbol, to indicate which ODI components are included in the Big Data Appliance, and which lie outside. The ODI Repository is inside a MySQL instance. There is one Master Repository and one Work Repository. The Hive JDBC connection is predefined. The ODI Studio is not included in the Big Data Appliance. If you use ODI Studio, you will need to define a connection to the ODI Repository. The Application Adapters are licensed with the Big Data Appliance. In ODIAAH, the agent runs the interfaces that you created and passes the work to the JobTracker. ODIAAH accesses Hive through Java Database Connectivity (JDBC). If you do a query through JDBC, such as an Insert, ODIAAH generates MapReduce code, which the agent sends to the JobTracker. The JobTracker, in turn, sends the MapReduce code to a TaskTracker. 29 ODI and Hadoop Oracle Data Integrator Application Adapter for Hadoop is a Big Data Connector. ODI with ODIAAH is used to orchestrate the following data integration functions around the Hadoop environment: Moving data into Hadoop Transforming the data while inside Hadoop Extracting data from Hadoop The diagram shows how ODIAAH interacts with the Hadoop environment: The HiveQL language provides a relational projection of files that are stored in HDFS. The Hive-specific KMs in ODIAAH enable ODI to generate Hive code.

10 Hive takes the query, generates a set of MapReduce programs, and executes them on the stored files, producing a new set of files stored in HDFS that have a relational presentation. 30 What Is ODI Application Adapter for Hadoop? ODIAAH is a Big Data Connector that allows data integration developers to easily integrate and transform data within Hadoop by using Oracle Data Integrator. It has preconfigured, Hive-specific ODI KMs. The main functions of ODIAAH include: o Loading data into Hadoop from local file systems and HDFS o Performing validation and transformation of data within Hadoop o Loading processed data from Hadoop to Oracle for further processing and report generation The screenshot displays the KMs for Hadoop in the Projects section under the ODI Designer view.. 31 ODIAAH Knowledge Modules for Hadoop To facilitate the MapReduce implementation, ODI Application Adapter for Hadoop provides the Hivespecific knowledge modules. RKM Hive can be used to reverse-engineer into ODI table and view definitions from Hive, so that they can be used in an ODI interface. IKM File to Hive loads data from local or HDFS files into Hive tables. The next two knowledge modules perform transformations within Hadoop. IKM Hive Control Append applies SQL-like transformations on the data. It utilizes the Hive function library. IKM Hive Transform integrates data into a Hive target table after the data is transformed by a custom, user-defined script. CKM Hive can be used to define constraints to validate data being loaded into Hive. You can apply common constraints such as not null, foreign key, unique key, and primary key. IKM File/Hive to Oracle (OLH) uses Oracle Loader for Hadoop to load data from Hadoop into an Oracle database. This KM will use the Delimeted File InputFormat, the Hive InputFormat, or a user defined InputFormat(Java class) to stream the data. Different output modes can be chosen (JDBC, Direct Path, Datapump) 32 Review In this module, you should have learned about: What is Oracle Data Integrator? What are the main components of ODI? The benefits of the ODI declarative design The flexibility and efficiency of ODI knowledge modules What is the ODI Application Adapter for Hadoop? and What are the Knowledge modules specifically designed for Hadoop? 33 Quiz

11 MODULE 3: Loading Unstructured Data from File into Hive 34 Title slide Welcome to Module 3, Loading Unstructured Data from File into Hive. In this module, you learn how to use a Hive-specific knowledge module of ODI Application Adapter for Hadoop to move data from a file into a Hive server. 35 Module Topics In this module, you learn: The capabilities of the File to Hive knowledge module The steps for defining a data source in Oracle Data Integrator How to define and test the ODI connection to a Hive server How to reverse engineer data structures using the RKM knowledge module And through a demo you learn how to load data using the File to Hive knowledge module 36 Does ODIAAH Automate Loading Files into Hive? Simon first wants to learn how ODIAAH automates the loading of unstructured data in Hadoop HDFS. Simon will be examining the ODIAAH File to Hive knowledge module for automating the task of loading data into Hadoop. 37 Demos in this Self-Study The knowledge modules for Oracle Data Integrator Application Adapter for Hadoop were developed to simplify processing of unstructured and structured data on Hadoop. The knowledge modules were developed to facilitate the following processes: Loading data into Hive (which is the subject of our first demo) Transforming and validating data in Hive Loading processed data from Hive into Oracle 38 Steps for Defining Data Sources The ODI Big Data Appliance provides a preconfigured environment. However, beyond the predefined and preconfigured objects, you must still perform the steps shown on this slide to point to the location of your data sources. 1. The data server metadata describes the source or target data store. Data stores are used in the interface to specify the source and target. 2. Create the model metadata that contains the data stores and the association with the logical schema. 3. Define execution contexts to associate logical and physical architecture. During execution, depending on the selected context, the logical schema is mapped to the appropriate physical schema so that you can switch the context from development to test to deployment. After defining your data sources, you perform the following steps to define and execute your data integration jobs: 4. Design the interface. The interface specifies the source, target, mappings, rules, and KMs. Executing the

12 interface with a specified context migrates the data from the source physical location to the target physical location. 5. Design the package. The package enables you to create a process to execute interfaces, procedures, and other logic, as required. 39 Testing the ODI Connection to a Hive Server ODI uses JDBC to connect to Hive. The panel on the right side of this screenshot shows the location that was specified for a JDBC driver and a JDBC URL. When you test the connection, you are prompted to select a physical agent. In this example, the local agent is selected. However, in production environments, you will probably choose a stand-alone agent that you defined and deployed to a remote data source. If you have the Oracle Big Data Appliance, these connection definitions pointing to the included Hive server will be predefined for you. You will only need to test the connections. 40 Import the ODIAAH Knowledge Modules Licensed users of Oracle Data Integrator Application Adapter for Hadoop need to import the six Hivespecific knowledge Modules into their ODI project. 41 Reverse Engineer Hive Structures Into ODI You can use the RKM Hive knowledge module to reverse-engineer into ODI table and view definitions from Hive, so that they can be used in an ODI interface. 42 Loading Data Into Hadoop The ODI agent is submitting tasks to the Hive server via JDBC. The Hive server generates MapReduce jobs for local files outside of the Hadoop environment and/or HDFS files within the Hadoop environment. The MapReduce jobs create SQL-like relational representations of the data as Hive tables, even though the data remains stored in HDFS files. 43 Loading Weblog File to Hive One typical application of big data analysis involves loading massive amounts of Weblog data along with a relational representation that can be queried. In this slide, the image at the upper left corner shows an ODI package managing a series of 5 interfaces. The first interface loads a Weblog file into Hive. The code box on the right shows the Hive code that ODI generates by way of the knowledge modules that are part of the ODI Application Adapter for Hadoop. It is a simple Hive query that creates a table, parses the source data fields, and loads the data into the Hadoop Distributed File System (HDFS). The result is shown at the bottom of the slide. Not only is the data loaded into the HDFS file system, but also a Hive table representation of the data is projected on top of the HDFS file. 44 File to Hive (LOAD_DATA) Mapping We now examine an ODI interface for loading data from a file to a Hive server, using the File to Hive

13 knowledge module. This screenshot shows the Mapping view of the interface, a logical-level view with the source file represented on the left, showing all of its fields. On the right is the target datastore inside Hive, with a relational representation. 45 File to Hive (LOAD_DATA) Flow This screenshot shows the Flow view of the same interface, showing the more physical aspects of how the mapping will be performed from source to target. Note that the IKM File to Hive knowledge module is chosen in the IKM Selector field. This choice of an ODIAAH knowledge module influences ODI s generation of Hive code. The right-hand side describes what this knowledge module performs, its requirements and its restrictions. The left-hand side lists all of the options available within the knowledge module. 46 Demonstration of How to Load Data Into Hive Please click the link to run the first demo. 47 Review In this module, you should have learned about: The capabilities of the File to Hive knowledge module The steps for defining a data source in Oracle Data Integrator How to define and test the ODI connection to a Hive server How to reverse engineer data structures using the RKM knowledge module And through a demo you learned how to load data using the File to Hive knowledge module 48 Quiz Module 4: Transforming and Validating Data on Hive 49 Title Slide Welcome to Module 4, Transforming and Validating Data on Hive. In this module, you learn how ODIAAH can automate the transformation and validation of data that is in HDFS. 50 Module Topics In this module, you learn: How to use an ODIAAH knowledge module that utilizes the Hive function libarary to transform and validate data in Hive How to use an ODIAAH knowledge module that supports user defined scripts to transform data in Hive And through a demo you see these Hive-specific knowledge modules being used in ODI to perform data transformation and validation using a predefined knowledge module

14 51 Can ODIAAH Transform and Validate the Data Loaded into Hive? The previous module showed how ODIAAH automates loading of data files into Hadoop. In this module, Simon learns how ODIAAH can automate the transformation and validation of data that is in HDFS. 52 Demos in this Self-Study Previously, we learned how to use ODIAAH to load data from files into Hive. The next ODIAAH process that we examine is transforming and validating data in Hive. 53 Processing Data Inside Hadoop There are two ODIAAH knowledge modules by which the ODI agent can interact with the Hive Server to submit MapReduce jobs to transform the data once it is inside the Hadoop environment. The Hive Control Append knowledge module enables SQL-like transformations of the data, utilizing the standard Hive function library. The Hive Transform knowledge module enables you to pass in your own user-defined scripts to transform the data. 54 Hive Control Append KM Mapping We now examine an ODI interface for transforming data within a Hive server, which is a Hive to Hive mapping, using the Hive Control Append knowledge module. This screenshot shows the Mapping view of the interface, with two datastores on the left, and the target datastore on the right. In the target datastore, notice there are some Hive functions available to use, such as case/when, concatenate, and cast. This example shows use of the case function. When the Customer dear field equals 0, the value is Ms.. When the Customer dear field equals 1, the value is Mr. 55 Hive Control Append KM Flow This screenshot shows the Flow view of the same interface, showing the more physical aspects of the mapping from the two sources to the target all within the same Hive environment. Note that the Hive Control Append knowledge module is chosen in the IKM Selector field. 56 Can ODIAAH Also Use Customized Transformation Scripts? Simon and his team just learned how ODIAAH works with Hive to transform data that is stored in HDFS, using the standard SQL transformations and the library of Hive expressions. Next, the team learns how it can plug in a customized, user-defined transformation script that accepts the stream of input, transforms it, and outputs a stream. 57 Hive Transform KM Mapping We now examine a different ODI interface for transforming data within a Hive server, using the Hive

15 Transform knowledge module. This screenshot shows the Mapping view of the interface, with two datastores on the left, and the target datastore on the right. 58 Hive Transform KM Flow This screenshot shows the Flow view of the same interface, showing the more physical aspects of the mapping from a preprocessed weblog to a sessionized weblog. Note that the Hive Transform knowledge module is chosen in the IKM Selector field. Also note the TRANSFORM_SCRIPT_NAME and TRANSFORM_SCRIPT fields for specifying your custom user-defined script for transformation. 59 Demonstration of How to Transform and Validate Data in Hive Using ODIAAH Please click the link to run the second demonstration. 60 Review In this module, you should have learned about: How to use an ODIAAH knowledge module that utilizes the Hive function libarary to transform and validate data in Hive How to use an ODIAAH knowledge module that supports user defined scripts to transform data in Hive And through a demo you see these Hive-specific knowledge modules being used in ODI to perform data transformation and validation using a predefined knowledge module 61 Module 4 Quiz Module 5: Loading Processed Data in Hive into Oracle 62 Title Slide: Welcome to Module 5, Loading Processed Data in Hive Into Oracle. In this xmodule, you learn how to use ODIAAH to move the processed data from HDFS into an Oracle database. 63 Module Topics In this module, you learn: How to use an ODIAAH knowledge module for loading processed Hive data into an Oracle Database And through a demo you see how this Hive-specific knowledge module processes Hive data into an Oracle database. 64 How Do We Move the Processed Hive Data into an Oracle Database? Previous modules showed how ODIAAH automated the loading and transformation of data into Hadoop. In this module, Simon and his team learn how to use ODIAAH to move the processed data from HDFS into an Oracle database.

16 65 Demos in this Self-Study We ve examined ODIAAH processes for loading data into Hive and transforming data that is in Hive. The final ODIAAH process that we examine is loading processed data in Hive into Oracle. 66 ODI Loading Data Into Oracle Using OLH To load data from Hadoop into Oracle, the ODI agent submits the jobs to the Oracle Loader for Hadoop JobClient, which then creates MapReduce jobs. The File/Hive to Oracle knowledge module uses Oracle Loader for Hadoop to load the data from Hadoop into an Oracle database. 67 File/Hive to Oracle (OLH) KM Mapping We now examine a different ODI interface for moving the data from either an HDFS file or a Hive source into an Oracle environment, using the File/Hive to Oracle knowledge module. This knowledge module utilizes the Oracle Loader for Hadoop to load the data into Oracle. 68 File/Hive to Oracle (OLH) KM Flow This screenshot shows the Flow view of the same interface. It shows the File/Hive to Oracle knowledge module has been selected. The knowledge module s options are listed on the left-hand side. The first option listed indicates that data pump will be used to copy over the data to Oracle. The description of the knowledge module is on the right-hand side. 69 Packaging the Jobs Together You can tie your job executions together, such as a series of interfaces, by placing them in an ODI package. 1. In this package, the first interface takes a Weblog and loads it into the Hadoop HDFS file system, using the File to Hadoop knowledge module. 2. The dates in the Weblogs were not in a format that could be sorted, so the second interface transforms them to an ISO Date format. 3. The third interface takes the preprocessed data and sessionizes it. That means we take these logs of web events and re-construct who visited the site, what pages the visitors clicked on and in what order, and how long they spent there. We take multiple mouse clicks and group them by specific IP address and specific timeframe. 4. Next, there is a lot of data in these Weblogs, and we don t want to send all that data to Oracle, so the fourth interface filters out anything not necessary for the intended analysis. For this step, and the previous two steps, the interfaces make use of knowledge modules that produce transformation code that runs in Hadoop. 5. Finally, the fifth interface uses the File to Oracle Using OLH knowledge module to move the data into Oracle tables.

17 70 Monitoring the Job Executions The ODI Operator Navigator allows you to monitor the progress of your executions. In this example, the details of a session task are examined. 71 Demonstration Please click the control to start the third and final demonstration. 72 Review In this module, you should have learned about: How to use an ODIAAH knowledge module for loading processed Hive data into an Oracle Database And through a demo you saw how this Hive-specific knowledge module processed Hive data into an Oracle database. 73 Quiz 74 What Have Simon Howell and His Team Decided to Do? After careful investigation, Simon and his team have concluded that the ODI Application Adapter for Hadoop is indeed a more efficient alternative to manual MapReduce programming. They have decided to use ODIAAH in their first big data project! 75 Course Summary In this self-study, you should have learned how to: Explain the concepts of Oracle s approach to Big Data Describe the main features and functions of Oracle Data Integrator Application Adapter for Hadoop Use ODIAAH to load unstructured data from a file (from either local file system or an HDFS source ) into Hive Use ODIAAH to transform and validate data that is in a Hive server and use ODIAAH to load processed data in Hive into an Oracle database 76 For Further Information about Oracle Application Adapter for Hadoop For more information about Oracle Application Adapter for Hadoop, and related technologies, you can use the links provided on this slide. Thank you for taking this self-study course on ODI Application Adapter for Hadoop! End of ODIAAH self-study

Big Data The end of Data Warehousing?

Big Data The end of Data Warehousing? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Big data, data warehousing, advanced analytics, Hadoop, unstructured data Introduction If there was an Unwort