Free ebooks ==>

Size: px
Start display at page:

Download "Free ebooks ==>"

Transcription

1

2 BIG DATA ANALYTICS WITH ORACLE César Pérez

3 INTRODUCTION Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. With big data analytics, data scientists and others can analyze huge volumes of data that conventional analytics and business intelligence solutions can t touch. Consider this; it s possible that your organization could accumulate (if it hasn t already) billions of rows of data with hundreds of millions of data combinations in multiple data stores and abundant formats. High-performance analytics is necessary to process that much data in order to figure out what s important and what isn t. Enter big data analytics. Big data is now a reality: The volume, variety and velocity of data coming into your organization continue to reach unprecedented levels. This phenomenal growth means that not only must you understand big data in order to decipher the information that truly counts, but you also must understand the possibilities of what you can do with big data analytics. Using big data analytics you can extract only the relevant information from terabytes, petabytes and exabytes, and analyze it to transform your business decisions for the future. Becoming proactive with big data analytics isn t a one-time endeavor; it is more of a culture change a new way of gaining ground by freeing your analysts and decision makers to meet the future with sound knowledge and insight. On the other hand, business intelligence (BI) provides standard business reports, ad hoc reports, OLAP and even alerts and notifications based on analytics. This ad hoc analysis looks at the static past, which has its purpose in a limited number of situations. Oracle support for big data implementations, including Hadoop. throught Oracle and Hadoop is possible work in all steps of Analytical Process: Identify/formulate Problem, Data Preparation, Data Exploration, Transform and select, Buil Model, Validate model, Deploy Model and Evaluate/Monitor Results. This book presents the work possibilities that Oracle offers in the modern sectors of Big Data, Business Intelligence and Analytics. The most important tools of Oracle are presented for processing and analyzing large volumes of data in an orderly manner. In turn, these tools allow also extract the knowledge contained in the data.

4

5 INDEX INTRODUCTION BIG DATA CONCEPTS 1.1 DEFINING BIG DATA 1.2 THE IMPORTANCE OF BIG DATA 1.3 THE NEED FOR BIG DATA 1.4 KEY TECHNOLOGIES FOR EXTRACTING BUSINESS VALUE FROM BIG DATA Information Management for Big Data HADOOP 2.1 BUILDING A BIG DATA PLATFORM Acquire Big Data Organize Big Data Analyze Big Data Solution Spectrum 2.2 HADOOP 2.3 HADOOP COMPONENTS Benefits of Hadoop Limitations of Hadoop 2.4 GET DATA INTO HADOOP 2.5 HADOOP USES Prime Business Applications for Hadoop 2.6 HADOOP CHALLENGES ORACLE BIG DATA APPLIANCE 3.1 INTRODUCTION 3.2 ORACLE BIG DATA APPLIANCE BASIC CONFIGURATION 3.3 AUTO SERVICE REQUEST (ASR) 3.4 ORACLE ENGINEERED SYSTEMS FOR BIG DATA 3.5 SOFTWARE FOR BIG DATA Software Component Overview 3.6 ACQUIRING DATA FOR ANALYSIS Hadoop Distributed File System Apache Hive Oracle NoSQL Database 3.7 ORGANIZING BIG DATA 3.8 MAPREDUCE 3.9 ORACLE BIG DATA CONNECTORS Oracle SQL Connector for Hadoop Distributed File System Oracle Loader for Hadoop Oracle Data Integrator Application Adapter for Hadoop Oracle XQuery for Hadoop 3.10 ORACLE R ADVANCED ANALYTICS FOR HADOOP 3.11 ORACLE R SUPPORT FOR BIG DATA 3.12 ANALYZING AND VISUALIZING BIG DATA 3.13 ORACLE BUSINESS INTELLIGENCE FOUNDATION SUITE

6 Enterprise BI Platform OLAP Analytics Free ebooks ==> Scorecard and Strategy Management Mobile BI Enterprise Reporting 3.14 ORACLE BIG DATA LITE VIRTUAL MACHINE ADMINISTERING ORACLE BIG DATA APPLIANCE 4.1 MONITORING MULTIPLE CLUSTERS USING ORACLE ENTERPRISE MANAGER Using the Enterprise Manager Web Interface Using the Enterprise Manager Command-Line Interface 4.2 MANAGING OPERATIONS USING CLOUDERA MANAGER Monitoring the Status of Oracle Big Data Appliance Performing Administrative Tasks Managing CDH Services With Cloudera Manager 4.3 USING HADOOP MONITORING UTILITIES Monitoring MapReduce Jobs Monitoring the Health of HDFS 4.4 USING CLOUDERA HUE TO INTERACT WITH HADOOP 4.5 ABOUT THE ORACLE BIG DATA APPLIANCE SOFTWARE Software Components Unconfigured Software Allocating Resources Among Services 4.6 STOPPING AND STARTING ORACLE BIG DATA APPLIANCE Prerequisites Stopping Oracle Big Data Appliance Starting Oracle Big Data Appliance 4.7 MANAGING ORACLE BIG DATA SQL Adding and Removing the Oracle Big Data SQL Service Allocating Resources to Oracle Big Data SQL 4.8 SWITCHING FROM YARN TO MAPREDUCE SECURITY ON ORACLE BIG DATA APPLIANCE About Predefined Users and Groups About User Authentication About Fine-Grained Authorization About On-Disk Encryption Port Numbers Used on Oracle Big Data Appliance About Puppet Security 4.10 AUDITING ORACLE BIG DATA APPLIANCE About Oracle Audit Vault and Database Firewall Setting Up the Oracle Big Data Appliance Plug-in Monitoring Oracle Big Data Appliance 4.11 COLLECTING DIAGNOSTIC INFORMATION FOR ORACLE CUSTOMER SUPPORT 4.12 AUDITING DATA ACCESS ACROSS THE ENTERPRISE Configuration Capturing Activity Ad Hoc Reporting Summary ORACLE BIG DATA SQL

7 5.1 INTRODUCTION 5.2 SQL ON HADOOP 5.3 SQL ON MORE THAN HADOOP 5.4 UNIFYING METADATA 5.5 OPTIMIZING PERFORMANCE 5.6 SMART SCAN FOR HADOOP 5.7 ORACLE SQL DEVELOPER & DATA MODELER SUPPORT FOR ORACLE BIG DATA SQL Setting up Connections to Hive Using the Hive Connection Create Big Data SQL-enabled Tables Using Oracle Data Modeler Edit the Table Definitions Query All Your Data 5.8 USING ORACLE BIG DATA SQL FOR DATA ACCESS About Oracle External Tables About the Access Drivers for Oracle Big Data SQL About Smart Scan Technology About Data Security with Oracle Big Data SQL 5.9 INSTALLING ORACLE BIG DATA SQL Prerequisites for Using Oracle Big Data SQL Performing the Installation Running the Post-Installation Script for Oracle Big Data SQL Running the bds-exa-install Script bds-ex-install Syntax 5.10 CREATING EXTERNAL TABLES FOR ACCESSING BIG DATA About the Basic CREATE TABLE Syntax Creating an External Table for a Hive Table Obtaining Information About a Hive Table Using the CREATE_EXTDDL_FOR_HIVE Function Developing a CREATE TABLE Statement for ORACLE_HIVE Creating an External Table for HDFS Files Using the Default Access Parameters with ORACLE_HDFS Overriding the Default ORACLE_HDFS Settings Accessing Avro Container Files 5.11 ABOUT THE EXTERNAL TABLE CLAUSE TYPE Clause DEFAULT DIRECTORY Clause LOCATION Clause REJECT LIMIT Clause ACCESS PARAMETERS Clause 5.12 ABOUT DATA TYPE CONVERSIONS 5.13 QUERYING EXTERNAL TABLES 5.14 ABOUT ORACLE BIG DATA SQL ON ORACLE EXADATA DATABASE MACHINE Starting and Stopping the Big Data SQL Agent About the Common Directory Common Configuration Properties bigdata.properties bigdata-log4j.properties About the Cluster Directory

8 About Permissions HIVE USER DEFINED Free FUNCTIONS ebooks (UDFS) ==> INTRODUCTION The Three Little UDFs 6.2 THREE LITTLE HIVE UDFS: EXAMPLE Introduction Extending UDF 6.3 THREE LITTLE HIVE UDFS: EXAMPLE Introduction Extending GenericUDTF Using the UDTF 6.4 THREE LITTLE HIVE UDFS: EXAMPLE 3 ORACLE NO SQL Introduction Prefix Sum: Moving Average without State Orchestrating Partial Aggregation Aggregation Buffers: Connecting Algorithms with Execution Using the UDAF Summary 7.1 INTRODUCTION 7.2 DATA MODEL 7.3 API 7.4 CREATE, REMOVE, UPDATE, AND DELETE 7.5 ITERATION 7.6 BULK OPERATION API 7.7 ADMINISTRATION 7.8 ARCHITECTURE 7.9 IMPLEMENTATION Storage Nodes Client Driver 7.10 PERFORMANCE 7.11 CONCLUSION

9 1Chapter 1. BIG DATA CONCEPTS

10 1.1 DEFINING BIG DATA Free ebooks ==> Big data typically refers to the following types of data: Traditional enterprise data includes customer information from CRM systems, transactional ERP data, web store transactions, and general ledger data. Machine-generated /sensor data includes Call Detail Records ( CDR ), weblogs, smart meters, manufacturing sensors, equipment logs (often referred to as digital exhaust), trading systems data. Social data includes customer feedback streams, micro-blogging sites like Twitter, social media platforms like Facebook The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and But while it s often the most visible parameter, volume of data is not the only characteristic that matters. In fact, there are four key characteristics that define big data (Figure 1-1): Volume. Machine-generated data is produced in much larger quantities than non-traditional data. For instance, a single jet engine can generate 10TB of data in 30 minutes. With more than 25,000 airline flights per day, the daily volume of just this single data source runs into the Petabytes. Smart meters and heavy industrial equipment like oil refineries and drilling rigs generate similar data volumes, compounding the problem. Velocity. Social media data streams while not as massive as machinegenerated data produce a large influx of opinions and relationships valuable to customer relationship management. Even at 140 characters per tweet, the high velocity (or frequency) of Twitter data ensures large volumes (over 8 TB per day). Variety. Traditional data formats tend to be relatively well defined by a data schema and change slowly. In contrast, non-traditional data formats exhibit a dizzying rate of change. As new services are added, new sensors deployed, or new marketing campaigns executed, new data types are needed to capture the resultant information. Value. The economic value of different data varies significantly. Typically there is good information hidden amongst a larger body of non-traditional data; the challenge is identifying what is valuable and then transforming and extracting that data for analysis.

11 To make the most of big data, enterprises must evolve their IT infrastructures to handle these new high-volume, high-velocity, high-variety sources of data and integrate them with the pre-existing enterprise data to be analyzed. Big data is a relative term describing a situation where the volume, velocity and variety of data exceed an organization s storage or compute capacity for accurate and timely decision making. Some of this data is held in transactional data stores the byproduct of fastgrowing online activity. Machine-to-machine interactions, such as metering, call detail records, environmental sensing and RFID systems, generate their own tidal waves of data. All these forms of data are expanding, and that is coupled with fast-growing streams of unstructured and semistructured data from social media. That s a lot of data, but it is the reality for many organizations. By some estimates, organizations in all sectors have at least 100 terabytes of data, many with more than a petabyte. Even scarier, many predict this number to double every six months going forward, said futurist Thornton May, speaking at a webinar in Figure 1-1

12 1.2 THE IMPORTANCE OF BIG DATA Free ebooks ==> When big data is distilled and analyzed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, a stronger competitive position and greater innovation all of which can have a significant impact on the bottom line. For example, in the delivery of healthcare services, management of chronic or long-term conditions is expensive. Use of in-home monitoring devices to measure vital signs, and monitor progress is just one way that sensor data can be used to improve patient health and reduce both office visits and hospital admittance. Manufacturing companies deploy sensors in their products to return a stream of telemetry. In the automotive industry, systems such as General Motors OnStar or Renault s R- Link, deliver communications, security and navigation services. Perhaps more importantly, this telemetry also reveals usage patterns, failure rates and other opportunities for product improvement that can reduce development and assembly costs. The proliferation of smart phones and other GPS devices offers advertisers an opportunity to target consumers when they are in close proximity to a store, a coffee shop or a restaurant. This opens up new revenue for service providers and offers many businesses a chance to target new customers. Retailers usually know who buys their products. Use of social media and web log files from their ecommerce sites can help them understand who didn t buy and why they chose not to, information not available to them today. This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies through more accurate demand planning. Finally, social media sites like Facebook and LinkedIn simply wouldn t exist without big data. Their business model requires a personalized experience on the web, which can only be delivered by capturing and using all the available data about a user or member.

13 1.3 THE NEED FOR BIG DATA The term Big Data can be interpreted in many different ways. We defined Big Data as conforming to the volume, velocity, and variety attributes that characterize it. Note that Big Data solutions aren t a replacement for your existing warehouse solutions, and in our humble opinion, any vendor suggesting otherwise likely doesn t have the full gambit of experience or understanding of your investments in the traditional side of information management. We think it s best to start out this section with a couple of key Big Data principles we want you to keep in mind, before outlining some consider- ations as to when you use Big Data technologies, namely: Big Data solutions are ideal for analyzing not only raw structured data, but semistructured and unstructured data from a wide variety of sources. Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data isn t nearly as effective as a larger set of data from which to derive analysis. Big Data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined. When it comes to solving information management challenges using Big Data technologies, we suggest you consider the following: Is the reciprocal of the traditional analysis paradigm appropriate for the business task at hand? Better yet, can you see a Big Data platform complementing what you currently have in place for analysis and achieving synergy with existing solutions for better business outcomes? For example, typically, data bound for the analytic warehouse has to be cleansed, documented, and trusted before it s neatly placed into a strict warehouse schema (and, of course, if it can t fit into a traditional row and column format, it can t even get to the warehouse in most cases). In contrast, a Big Data solution is not only going to leverage data not typically suitable for a traditional warehouse environment, and in massive amounts of volume, but it s going to give up some of the formalities and strictness of the data. The benefit is that you can preserve the fidelity of data and gain access to mountains of information for exploration and discovery of business insights before running it through the due diligence that you re accustomed to; the data that can be included as a participant of a cyclic system, enriching the models in the warehouse. Big Data is well suited for solving information challenges that don t natively fit within a traditional relational database approach for handling the problem at hand.

14 It s important that you understand that conventional database technolo- gies are an important, and relevant, Free part ebooks of an overall ==> analytic solution. In fact, they become even more vital when used in conjunction with your Big Data platform. A good analogy here is your left and right hands; each offers individual strengths and optimizations for a task at hand. For example, if you ve ever played baseball, you know that one hand is better at throwing and the other at catching. It s likely the case that each hand could try to do the other task that it isn t a natural fit for, but it s very awkward (try it; better yet, film your- self trying it and you will see what we mean). What s more, you don t see baseball players catching with one hand, stopping, taking off their gloves, and throwing with the same hand either. The left and right hands of a base- ball player work in unison to deliver the best results. This is a loose analogy to traditional database and Big Data technologies: Your information plat- form shouldn t go into the future without these two important entities work- ing together, because the outcomes of a cohesive analytic ecosystem deliver premium results in the same way your coordinated hands do for baseball. There exists some class of problems that don t natively belong in traditional databases, at least not at first. And there s data that we re not sure we want in the warehouse, because perhaps we don t know if it s rich in value, it s unstructured, or it s too voluminous. In many cases, we can t find out the value per byte of the data until after we spend the effort and money to put it into the warehouse; but we want to be sure that data is worth saving and has a high value per byte before investing in it. Some organizations will need to rethink their data management strategies when they face hundreds of gigabytes of data for the first time. Others may be fine until they reach tens or hundreds of terabytes. But whenever an organization reaches the critical mass defined as big data for itself, change is inevitable. Organizations are moving away from viewing data integration as a standalone discipline to a mindset where data integration, data quality, metadata management and data governance are designed and used together. The traditional extract-transform-load (ETL) data approach has been augmented with one that minimizes data movement and improves processing power. Organizations are also embracing a holistic, enterprise view that treats data as a core enterprise asset. Finally, many organizations are retreating from reactive data management in favor of a managed and ultimately more proactive and predictive approach to managing information. The true value of big data lies not just in having it, but in harvesting it for fast, factbased decisions that lead to real business value. For example, disasters such as the recent financial meltdown and mortgage crisis might have been prevented with risk computation

15 on historical data at a massive scale. Financial institutions were essentially taking bundles of thousands of loans and looking at them as one. We now have the computing power to assess the probability of risk at the individual level. Every sector can benefit from this type of analysis. Big data provides gigantic statistical samples, which enhance analytic tool results. The general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis. However, organizations have been limited to using subsets of their data, or they were constrained to simplistic analysis because the sheer volume of data overwhelmed their IT platforms. What good is it to collect and store terabytes of data if you can t analyze it in full context, or if you have to wait hours or days to get results to urgent questions? On the other hand, not all business questions are better served by bigger data. Now, you have choices to suit both scenarios: Incorporate massive data volumes in analysis. If the business question is one that will get better answers by analyzing all the data, go for it. The game-changing technologies that extract real business value from big data all of it are here today. One approach is to apply high-performance analytics to analyze massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics. Determine upfront which data is relevant. The traditional modus operandi has been to store everything; only when you query it do you discover what is relevant. Oracle provides the ability to apply analytics on the front end to determine data relevance based on enterprise context. This analysis can be used to determine which data should be included in analytical processes and which can be placed in low-cost storage for later availability if needed.

16 1.4 KEY TECHNOLOGIES FOR EXTRACTING BUSINESS VALUE Free ebooks ==> FROM BIG DATA Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery and/or analysis. Furthermore, this analysis is needed in real time or near-real time, and it must be affordable, secure and achievable. Fortunately, a number of technology advancements have occurred or are under way that make it possible to benefit from big data and big data analytics. For starters, storage, server processing and memory capacity have become abundant and cheap. The cost of a gigabyte of storage has dropped from approximately $16 in February 2000 to less than $0.07 today. Storage and processing technologies have been designed specifically for large data volumes. Computing models such as parallel processing, clustering, virtualization, grid environments and cloud computing, coupled with high-speed connectivity, have redefined what is possible. Here are three key technologies that can help you get a handle on big data and even more importantly, extract meaningful business value from it. Information management for big data. Manage data as a strategic, core asset, with ongoing process control for big data analytics. High-performance analytics for big data. Gain rapid insights from big data and the ability to solve increasingly complex problems using more data. Flexible deployment options for big data. Choose between options for onpremises or hosted, software-as-a-service (SaaS) approaches for big data and big data analytics.

17 Information Management for Big Data Many organizations already struggle to manage their existing data. Big data will only add complexity to the issue. What data should be stored, and how long should we keep it? What data should be included in analytical processing, and how do we properly prepare it for analysis? What is the proper mix of traditional and emerging technologies? Big data will also intensify the need for data quality and governance, for embedding analytics into operational systems, and for issues of security, privacy and regulatory compliance. Everything that was problematic before will just grow larger. Oracle provides the management and governance capabilities that enable organizations to effectively manage the entire life cycle of big data analytics, from data to decision. SAS provides a variety of these solutions, including data governance, metadata management, analytical model management, run-time management and deployment management. With Oracle, this governance is an ongoing process, not just a one-time project. Proven methodology-driven approaches help organizations build processes based on their specific data maturity model. Oracle technology and implementation services enable organizations to fully exploit and govern their information assets to achieve competitive differentiation and sustained business success. Three key components work together in this realm: Unified data management capabilities, including data governance, data integration, data quality and metadata management. Complete analytics management, including model management, model deployment, monitoring and governance of the analytics information asset. Effective decision management capabilities to easily embed information and analytical results directly into business processes while managing the necessary business rules, workflow and event logic. High-performance, scalable solutions slash the time and effort required to filter,

18 aggregate and structure big data. By combining data integration, data quality and master data management Free in ebooks a unified development ==> and delivery environment, organizations can maximize each stage of the data management process. Oracle is unique for incorporating high-performance analytics and analytical intelligence into the data management process for highly efficient modeling and faster results. For instance, you can analyze all the information within an organization such as , product catalogs, wiki articles and blogs extract important concepts from that information, and look at the links among them to identify and assign weights to millions of terms and concepts. This organizational context is then used to assess data as it streams into the organization, churns out of internal systems, or sits in offline data stores. This upfront analysis identifies the relevant data that should be pushed to the enterprise data warehouse or to high-performance analytics.

19 2Chapter 2. HADOOP

20 2.1 BUILDING A BIG DATA PLATFORM Free ebooks ==> As with data warehousing, web stores or any IT platform, an infrastructure for big data has unique requirements. In considering all the components of a big data platform, it is important to remember that the end goal is to easily integrate your big data with your enterprise data to allow you to conduct deep analytics on the combined data set. The requirements in a big data infrastructure span data acquisition, data organization and data analysis.

21 Acquire Big Data The acquisition phase is one of the major changes in infrastructure from the days before big data. Because big data refers to data streams of higher velocity and higher variety, the infrastructure required to support the acquisition of big data must deliver low, predictable latency in both capturing data and in executing short, simple queries; be able to handle very high transaction volumes, often in a distributed environment; and support flexible, dynamic data structures. NoSQL databases are frequently used to acquire and store big data. They are well suited for dynamic data structures and are highly scalable. The data stored in a NoSQL database is typically of a high variety because the systems are intended to simply capture all data without categorizing and parsing the data into a fixed schema. For example, NoSQL databases are often used to collect and store social media data. While customer facing applications frequently change, underlying storage structures are kept simple. Instead of designing a schema with relationships between entities, these simple structures often just contain a major key to identify the data point, and then a content container holding the relevant data (such as a customer id and a customer profile). This simple and dynamic structure allows changes to take place without costly reorganizations at the storage layer (such as adding new fields to the customer profile).

22 2.1.2 Organize Big Data Free ebooks ==> In classical data warehousing terms, organizing data is called data integration. Because there is such a high volume of big data, there is a tendency to organize data at its initial destination location, thus saving both time and money by not moving around large volumes of data. The infrastructure required for organizing big data must be able to process and manipulate data in the original storage location; support very high throughput (often in batch) to deal with large data processing steps; and handle a large variety of data formats, from unstructured to structured. Hadoop is a new technology that allows large data volumes to be organized and processed while keeping the data on the original data storage cluster. Hadoop Distributed File System (HDFS) is the long-term storage system for web logs for example. These web logs are turned into browsing behavior (sessions) by running MapReduce programs on the cluster and generating aggregated results on the same cluster. These aggregated results are then loaded into a Relational DBMS system.

23 Analyze Big Data Since data is not always moved during the organization phase, the analysis may also be done in a distributed environment, where some data will stay where it was originally stored and be transparently accessed from a data warehouse. The infrastructure required for analyzing big data must be able to support deeper analytics such as statistical analysis and data mining, on a wider variety of data types stored in diverse systems; scale to extreme data volumes; deliver faster response times driven by changes in behavior; and automate decisions based on analytical models. Most importantly, the infrastructure must be able to integrate analysis on the combination of big data and traditional enterprise data. New insight comes not just from analyzing new data, but from analyzing it within the context of the old to provide new perspectives on old problems. For example, analyzing inventory data from a smart vending machine in combination with the events calendar for the venue in which the vending machine is located, will dictate the optimal product mix and replenishment schedule for the vending machine.

24 2.1.4 Solution Spectrum Free ebooks ==> Many new technologies have emerged to address the IT infrastructure requirements outlined above. At last count, there were over 120 open source key-value databases for acquiring and storing big data, while Hadoop has emerged as the primary system for organizing big data and relational databases maintain their footprint as a data warehouse and expand their reach into less structured data sets to analyze big data. These new systems have created a divided solutions spectrum comprised (Figure 2-1) of: Not Only SQL (NoSQL) solutions: developer-centric specialized systems SQL solutions: the world typically equated with the manageability, security and trusted nature of relational database management systems (RDBMS) NoSQL systems are designed to capture all data without categorizing and parsing it upon entry into the system, and therefore the data is highly varied. SQL systems, on the other hand, typically place data in well-defined structures and impose metadata on the data captured to ensure consistency and validate data types. Distributed file systems and transaction (key-value) stores are primarily used to capture data and are generally in line with the requirements discussed earlier in this paper. To interpret and distill information from the data in these solutions, a programming paradigm called MapReduce is used. MapReduce programs are custom written programs that run in parallel on the distributed data nodes. The key-value stores or NoSQL databases are the OLTP databases of the big data world; they are optimized for very fast data capture and simple query patterns. NoSQL databases are able to provide very fast performance because the data that is captured is quickly stored with a single indentifying key rather than being interpreted and cast into a schema. By doing so, NoSQL database can rapidly store large numbers of transactions. However, due to the changing nature of the data in the NoSQL database, any data organization effort requires programming to interpret the storage logic used. This, combined with the lack of support for complex query patterns, makes it difficult for end users to distill value out of data in a NoSQL database. To get the most from NoSQL solutions and turn them from specialized, developer-centric solutions into solutions for the enterprise, they must be combined with SQL solutions into a single proven infrastructure that meets the manageability and security requirements of

25 today s enterprises. Figure 2-1

26 2.2 HADOOP Free ebooks ==> Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. For starters, let s take a quick look at some of those terms and what they mean. Open-source software. Open source software differs from commercial software due to the broad and open network of developers that create and manage the programs. Traditionally, it s free to download, use and contribute to, though more and more commercial versions of Hadoop are becoming available. Framework. In this case, it means everything you need to develop and run your software applications is provided programs, tool sets, connections, etc. Distributed. Data is divided and stored across multiple computers, and computations can be run in parallel across multiple connected machines. Massive storage. The framework can store huge amounts of data by breaking the data into blocks and storing it on clusters of lower-cost commodity hardware. Faster processing. How? Hadoop can process large amounts of data in parallel across clusters of tightly connected low-cost computers for quick results. With the ability to economically store and process any kind of data (not just numerical or structured data), organizations of all sizes are taking cues from the corporate web giants that have used Hadoop to their advantage (Google, Yahoo, Etsy, ebay, Twitter, etc.), and they re asking What can Hadoop do for me? Since its inception, Hadoop has become one of the most talked about technologies. Why? One of the top reasons (and why it was invented) is its ability to handle huge amounts of data any kind of data quickly. With volumes and varieties of data growing each day, especially from social media and automated sensors, that s a key consideration for most organizations. Other reasons include: Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. Computing power. Its distributed computing model can quickly process very large volumes of data. The more computing nodes you use, the more processing power you have. Scalability. You can easily grow your system simply by adding more nodes. Little administration is required. Storage flexibility. Unlike traditional relational databases, you don t have to preprocess data before storing it. And that includes unstructured data like text, images and videos. You can store as much data as you want and decide how to use it later. Inherent data protection and self-healing capabilities. Data and application

27 processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.

28 2.3 HADOOP COMPONENTS Free ebooks ==> Hadoop components have funny names, which is sort of understandable knowing that Hadoop was the name of a yellow toy elephant owned by the son of one of its inventors. Here s a quick rundown on names you may hear. Currently three core components are included with your basic download from the Apache Software Foundation (Figure 2-2). HDFS the Java-based distributed file system that can store all kinds of data without prior organization. MapReduce a software programming model for processing large sets of data in parallel. YARN a resource management framework for scheduling and handling resource requests from distributed applications. Figure 2-2 Other components that have achieved top-level Apache project status and are available include: Pig a platform for manipulating data stored in HDFS. It consists of a compiler for MapReduce programs and a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loading, and basic analysis without having to write MapReduce programs. Hive a data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.)

29 HBase a nonrelational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for MapReduce jobs. Zookeeper an application that coordinates distributed processes. Ambari a web interface for managing, configuring and testing Hadoop services and components. Flume software that collects, aggregates and moves large amounts of streaming data into HDFS. Sqoop a connection and transfer mechanism that moves data between Hadoop and relational databases. Oozie a Hadoop job scheduler. In addition, commercial software distributions of Hadoop are growing. Two of the most prominent (Cloudera and Hortonworks) are startups formed by the framework s inventors. And there are plenty of others entering the Hadoop sphere. With distributions from software vendors, you pay for their version of the framework and receive additional software components, tools, training, documentation and other services.

30 2.3.1 Benefits of Hadoop Free ebooks ==> There are several reasons that 88 percent of organizations consider Hadoop an opportunity. It s inexpensive. Hadoop uses lower-cost commodity hardware to reliably store large quantities of data. Hadoop provides flexibility to scale out by simply adding more nodes. You can upload unstructured data without having to schematize it first. Dump any type of data into Hadoop and apply structure as needed for consuming applications. If capacity is available, Hadoop will start multiple copies of the same task for the same block of data. If a node goes down, jobs are automatically redirected to other working servers.

31 Limitations of Hadoop Management and high-availability capabilities for rationalizing Hadoop clusters with data center infrastructure are only now starting to emerge. Data security is fragmented, but new tools and technologies are surfacing. MapReduce is very batch-oriented and not suitable for iterative, multi-step analytics processing. The Hadoop ecosystem does not have easy-to-use, full-feature tools for data integration, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization. Skilled professionals with specialized Hadoop skills are in short supply and at a premium. MapReduce is file intensive. And because the nodes don t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is very inefficient for advanced analytic computing. Hadoop definitely provides economical data storage. But the next step is to manage the data and use analytics to quickly identify previously unknown insights. Enter SAS. More on that later.

32 2.4 GET DATA INTO HADOOP Free ebooks ==> There are numerous ways to get data into Hadoop. Here are just a few: You can load files to the file system using simple Java commands, and HDFS takes care of making multiple copies of data blocks and distributing those blocks over multiple nodes in the Hadoop system. If you have a large number of files, a shell script that will run multiple put commands in parallel will speed up the process. You don t have to write MapReduce code. Create a cron job to scan a directory for new files and put them in HDFS as they show up. This is useful for things like downloading at regular intervals. Mount HDFS as a file system and simply copy files or write files there. Use Sqoop to import structured data from a relational database to HDFS, Hive and HBase. It can also extract data from Hadoop and export it to relational databases and data warehouses. Use Flume to continuously load data from logs into Hadoop. Use third-party vendor connectors).

33 2.5 HADOOP USES Going beyond its original goal of searching millions (or billions) of web pages and returning relevant results, many organizations are looking to Hadoop as their next big data platform. Here are some of the more popular uses for the framework today. Low-cost storage and active data archive. The modest cost of commodity hardware makes Hadoop useful for storing and combining big data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not currently critical but could become useful later for business analytics. Staging area for a data warehouse and analytics store. One of the most prevalent uses is to stage large amounts of raw data for loading into an enterprise data warehouse (EDW) or an analytical store for activities such as advanced analytics, query and reporting, etc. Organizations are looking at Hadoop to handle new types of data (e.g., unstructured), as well as to offload some historical data from their EDWs. Data lake. Hadoop is often used to store large amounts of data without the constraints introduced by schemas commonly found in the SQL-based world. It is used as a lowcost compute-cycle platform that supports processing ETL and data quality jobs in parallel using hand-coded or commercial data management technologies. Refined results can then be passed to other systems (e.g., EDWs, analytic marts) as needed. Sandbox for discovery and analysis. Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can enable analytics. Big data analytics on Hadoop will help run current business more efficiently, uncover new opportunities and derive next-level competitive advantage. The sandbox setup provides a quick and perfect opportunity to innovate with minimal investment. Certainly Hadoop provides an economical platform for storing and processing large and diverse data. The next logical step is to transform and manage the diverse data and use analytics to quickly identify undiscovered insights.

34 2.5.1 Prime Business Applications for Hadoop Free ebooks ==> Hadoop is providing a data storage and analytical processing environment for a variety of business uses, including: Financial services: Insurance underwriting, fraud detection, risk mitigation and customer behavior analytics. Retail: Location-based marketing, personalized recommendations and website optimization. Telecommunications: Bandwidth allocation, network quality analysis and call detail records analysis. Health and life sciences: Genomics data in medical trials and prescription adherence. Manufacturing: Logistics and root cause for production failover. Oil and gas and other utilities: Predict asset failures, improve asset utilization and monitor equipment safety. Government: Sentiment analysis, fraud detection and smart city initiatives.

35 2.6 HADOOP CHALLENGES First of all, MapReduce is not a good match for all problems. It s good for simple requests for information and problems that can be broken up into independent units. But it is inefficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is very inefficient for advanced analytic computing. Second, there s a talent gap. Because it is a relatively new technology, it is difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. This talent gap is one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings. Other challenges include fragmented data security, though new tools and technologies are surfacing. And, Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.

36 3Chapter 3. ORACLE BIG DATA APPLIANCE

37 3.1 INTRODUCTION Oracle Big Data Appliance is an engineered system that provides a high-performance, secure platform for running diverse workloads on Hadoop and NoSQL systems, while integrating tightly with Oracle Database and Oracle Exadata Machine. Companies have been making business decisions for decades based on transactional data stored in relational databases. Beyond that critical data is a potential treasure trove of less structured data: weblogs, social media, , sensors, and photographs that can be mined for useful information. Oracle offers a broad and integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships. Learn how Oracle helps you acquire, organize, and analyze your big data. Oracle Big Data Appliance is an engineered system of hardware and software optimized to capture and analyze the massive volumes of unstructured data generated by social media feeds, , web logs, photographs, smart meters, sensors, and similar devices. Oracle Big Data Appliance is engineered to work with Oracle Exadata Database Machine and Oracle Exalytics In-Memory Machine to provide the most advanced analysis of all data types, with enterprise-class performance, availability, supportability, and security. The Oracle Linux operating system and Cloudera s Distribution including Apache Hadoop (CDH) underlie all other software components installed on Oracle Big Data Appliance.

38 3.2 ORACLE BIG DATA APPLIANCE BASIC CONFIGURATION Free ebooks ==> Oracle Big Data Appliance Configuration Generation Utility acquires information from you, such as IP addresses and software preferences, that are required for deploying Oracle Big Data Appliance. After guiding you through a series of pages, the utility generates a set of configuration files. These files help automate the deployment process and ensure that Oracle Big Data Appliance is configured to your specifications. Choose the option that describes the type of hardware installation you are configuring: One or more new Big Data Appliance racks being installed: You enter all new data for this choice. One or more Big Data Appliance racks being added to an existing group of Big Data Appliances: This choice activates the Import button, so that you can select the BdaDeploy.json file that was used to configure the last rack in the group. One or two in-rack expansion kits being added to a Big Data Appliance starter rack: This choice activates the Import button, so that you can select the BdaDeploy.json file that was last used to configure the rack (either the starter rack or one in-rack expansion kit). An in-process configuration using a saved master.xml configuration file: This choice activates the Import button, so that you can select the master.xml file and continue the configuration. The next figure shows the Customer Details page of the Oracle Big Data Appliance Configuration Generation Utility.

39

40 3.3 AUTO SERVICE REQUEST (ASR) Free ebooks ==> Auto Service Request (ASR) is designed to automatically open service requests when specific Oracle Big Data Appliance hardware faults occur. ASR detects faults in the most common server components, such as disks, fans, and power supplies, and automatically opens a service request when a fault occurs. ASR monitors only server components and does not detect all possible faults. ASR is not a replacement for other monitoring mechanisms, such as SMTP and SNMP alerts, within the customer data center. It is a complementary mechanism that expedites and simplifies the delivery of replacement hardware. ASR should not be used for downtime events in high-priority systems. For high-priority events, contact Oracle Support Services directly. When ASR detects a hardware problem, ASR Manager submits a service request to Oracle Support Services. In many cases, Oracle Support Services can begin work on resolving the issue before the administrator is even aware the problem exists. An message is sent to both the My Oracle Support account and the technical contact for Oracle Big Data Appliance to notify them of the creation of the service request. A service request may not be filed automatically on some occasions. This can happen because of the unreliable nature of the SNMP protocol or a loss of connectivity to ASR Manager. Oracle recommends that customers continue to monitor their systems for faults and call Oracle Support Services if they do not receive notice that a service request has been filed automatically. The next figure shows the network connections between ASR and Oracle Big Data Appliance.

41

42 3.4 ORACLE ENGINEERED SYSTEMS FOR BIG DATA Free ebooks ==> Oracle Big Data Appliance is an engineered system comprising both hardware and software components. The hardware is optimized to run the enhanced big data software components. Oracle Big Data Appliance delivers: A complete and optimized solution for big data Single-vendor support for both hardware and software An easy-to-deploy solution Tight integration with Oracle Database and Oracle Exadata Database Machine Oracle provides a big data platform that captures, organizes, and supports deep analytics on extremely large, complex data streams flowing into your enterprise from many data sources. You can choose the best storage and processing location for your data depending on its structure, workload characteristics, and end-user requirements. Oracle Database enables all data to be accessed and analyzed by a large user community using identical methods. By adding Oracle Big Data Appliance in front of Oracle Database, you can bring new sources of information to an existing data warehouse. Oracle Big Data Appliance is the platform for acquiring and organizing big data so that the relevant portions with true business value can be analyzed in Oracle Database. For maximum speed and efficiency, Oracle Big Data Appliance can be connected to Oracle Exadata Database Machine running Oracle Database. Oracle Exadata Database Machine provides outstanding performance in hosting data warehouses and transaction processing databases. Moreover, Oracle Exadata Database Machine can be connected to Oracle Exalytics In-Memory Machine for the best performance of business intelligence and planning applications. The InfiniBand connections between these engineered systems provide high parallelism, which enables high-speed data transfer for batch or query workloads. The next figure shows the relationships among these engineered systems.

43

44 3.5 SOFTWARE FOR BIG DATA Free ebooks ==> The Oracle Linux operating system and Cloudera s Distribution including Apache Hadoop (CDH) underlie all other software components installed on Oracle Big Data Appliance. CDH is an integrated stack of components that have been tested and packaged to work together. CDH has a batch processing infrastructure that can store files and distribute work across a set of computers. Data is processed on the same computer where it is stored. In a single Oracle Big Data Appliance rack, CDH distributes the files and workload across 18 servers, which compose a cluster. Each server is a node in the cluster. The software framework consists of these primary components: File system: The Hadoop Distributed File System (HDFS) is a highly scalable file system that stores large files across multiple servers. It achieves reliability by replicating data across multiple servers without RAID technology. It runs on top of the Linux file system on Oracle Big Data Appliance. MapReduce engine: The MapReduce engine provides a platform for the massively parallel execution of algorithms written in Java. Oracle Big Data Appliance 3.0 runs YARN by default. Administrative framework: Cloudera Manager is a comprehensive administrative tool for CDH. In addition, you can use Oracle Enterprise Manager to monitor both the hardware and software on Oracle Big Data Appliance. Apache projects: CDH includes Apache projects for MapReduce and HDFS, such as Hive, Pig, Oozie, ZooKeeper, HBase, Sqoop, and Spark. Cloudera applications: Oracle Big Data Appliance installs all products included in Cloudera Enterprise Data Hub Edition, including Impala, Search, and Navigator. CDH is written in Java, and Java is the language for applications development. However, several CDH utilities and other software available on Oracle Big Data Appliance provide graphical, web-based, and other language interfaces for ease of use.

45 Software Component Overview The major software components perform three basic tasks: Acquire Organize Analyze and visualize The best tool for each task depends on the density of the information and the degree of structure. The next figure shows the relationships among the tools and identifies the tasks that they perform. The next figure shows the Oracle Big Data Appliance Software structutre

46 3.6 ACQUIRING DATA FOR ANALYSIS Free ebooks ==> Databases used for online transaction processing (OLTP) are the traditional data sources for data warehouses. The Oracle solution enables you to analyze traditional data stores with big data in the same Oracle data warehouse. Relational data continues to be an important source of business intelligence, although it runs on separate hardware from Oracle Big Data Appliance. Oracle Big Data Appliance provides these facilities for capturing and storing big data: Hadoop Distributed File System Apache Hive Oracle NoSQL Database

47 Hadoop Distributed File System Cloudera s Distribution including Apache Hadoop (CDH) on Oracle Big Data Appliance uses the Hadoop Distributed File System (HDFS). HDFS stores extremely large files containing record-oriented data. On Oracle Big Data Appliance, HDFS splits large data files into chunks of 256 megabytes (MB), and replicates each chunk across three different nodes in the cluster. The size of the chunks and the number of replications are configurable. Chunking enables HDFS to store files that are larger than the physical storage of one server. It also allows the data to be processed in parallel across multiple computers with multiple processors, all working on data that is stored locally. Replication ensures the high availability of the data: if a server fails, the other servers automatically take over its work load. HDFS is typically used to store all types of big data.

48 3.6.2 Apache Hive Free ebooks ==> Hive is an open-source data warehouse that supports data summarization, ad hoc querying, and data analysis of data stored in HDFS. It uses a SQL-like language called HiveQL. An interpreter generates MapReduce code from the HiveQL queries. By storing data in Hive, you can avoid writing MapReduce programs in Java. Hive is a component of CDH and is always installed on Oracle Big Data Appliance. Oracle Big Data Connectors can access Hive tables.

49 Oracle NoSQL Database Oracle NoSQL Database is a distributed key-value database built on the proven storage technology of Berkeley DB Java Edition. Whereas HDFS stores unstructured data in very large files, Oracle NoSQL Database indexes the data and supports transactions. But unlike Oracle Database, which stores highly structured data, Oracle NoSQL Database has relaxed consistency rules, no schema structure, and only modest support for joins, particularly across storage nodes. NoSQL databases, or Not Only SQL databases, have developed over the past decade specifically for storing big data. However, they vary widely in implementation. Oracle NoSQL Database has these characteristics: Uses a system-defined, consistent hash index for data distribution Supports high availability through replication Provides single-record, single-operation transactions with relaxed consistency guarantees Provides a Java API Oracle NoSQL Database is designed to provide highly reliable, scalable, predictable, and available data storage. The key-value pairs are stored in shards or partitions (that is, subsets of data) based on a primary key. Data on each shard is replicated across multiple storage nodes to ensure high availability. Oracle NoSQL Database supports fast querying of the data, typically by key lookup. An intelligent driver links the NoSQL database with client applications and provides access to the requested key-value on the storage node with the lowest latency. Oracle NoSQL Database includes hashing and balancing algorithms to ensure proper data distribution and optimal load balancing, replication management components to handle storage node failure and recovery, and an easy-to-use administrative interface to monitor the state of the database. Oracle NoSQL Database is typically used to store customer profiles and similar data for identifying and analyzing big data. For example, you might log in to a website and see advertisements based on your stored customer profile (a record in Oracle NoSQL Database) and your recent activity on the site (web logs currently streaming into HDFS). Oracle NoSQL Database is an optional component of Oracle Big Data Appliance and runs on a separate cluster from CDH.

50

51 3.7 ORGANIZING BIG DATA Oracle Big Data Appliance provides several ways of organizing, transforming, and reducing big data for analysis: MapReduce Oracle Big Data Connectors Oracle R Support for Big Data

52 3.8 MAPREDUCE Free ebooks ==> The MapReduce engine provides a platform for the massively parallel execution of algorithms written in Java. MapReduce uses a parallel programming model for processing data on a distributed system. It can process vast amounts of data quickly and can scale linearly. It is particularly effective as a mechanism for batch processing of unstructured and semistructured data. MapReduce abstracts lower-level operations into computations over a set of keys and values. Although big data is often described as unstructured, incoming data always has some structure. However, it does not have a fixed, predefined structure when written to HDFS. Instead, MapReduce creates the desired structure as it reads the data for a particular job. The same data can have many different structures imposed by different MapReduce jobs. A simplified description of a MapReduce job is the successive alternation of two phases: the Map phase and the Reduce phase. Each Map phase applies a transform function over each record in the input data to produce a set of records expressed as key-value pairs. The output from the Map phase is input to the Reduce phase. In the Reduce phase, the Map output records are sorted into key-value sets, so that all records in a set have the same key value. A reducer function is applied to all the records in a set, and a set of output records is produced as key-value pairs. The Map phase is logically run in parallel over each record, whereas the Reduce phase is run in parallel over all key values. Oracle Big Data Appliance uses the Yet Another Resource Negotiator (YARN) implementation of MapReduce by default. You have the option of using classic MapReduce (MRv1) instead. You cannot use both implementations in the same cluster; you can activate either the MapReduce or the YARN service.

53 3.9 ORACLE BIG DATA CONNECTORS Oracle Big Data Connectors facilitate data access between data stored in CDH and Oracle Database. The connectors are licensed separately from Oracle Big Data Appliance and include: Oracle SQL Connector for Hadoop Distributed File System Oracle Loader for Hadoop Oracle XQuery for Hadoop Oracle R Advanced Analytics for Hadoop Oracle Data Integrator Application Adapter for Hadoop

54 3.9.1 Oracle SQL Connector for Hadoop Distributed File System Free ebooks ==> Oracle SQL Connector for Hadoop Distributed File System (Oracle SQL Connector for HDFS) provides read access to HDFS from an Oracle database using external tables. An external table is an Oracle Database object that identifies the location of data outside of the database. Oracle Database accesses the data by using the metadata provided when the external table was created. By querying the external tables, users can access data stored in HDFS as if that data were stored in tables in the database. External tables are often used to stage data to be transformed during a database load. You can use Oracle SQL Connector for HDFS to: Access data stored in HDFS files Access Hive tables. Access Data Pump files generated by Oracle Loader for Hadoop Load data extracted and transformed by Oracle Data Integrator

55 Oracle Loader for Hadoop Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. It can read and load data from a wide variety of formats. Oracle Loader for Hadoop partitions the data and transforms it into a database-ready format in Hadoop. It optionally sorts records by primary key before loading the data or creating output files. The load runs as a MapReduce job on the Hadoop cluster.

56 3.9.3 Oracle Data Integrator Application Adapter for Hadoop Free ebooks ==> Oracle Data Integrator (ODI) extracts, transforms, and loads data into Oracle Database from a wide range of sources. In ODI, a knowledge module (KM) is a code template dedicated to a specific task in the data integration process. You use Oracle Data Integrator Studio to load, select, and configure the KMs for your particular application. More than 150 KMs are available to help you acquire data from a wide range of third-party databases and other data repositories. You only need to load a few KMs for any particular job. Oracle Data Integrator Application Adapter for Hadoop contains the KMs specifically for use with big data.

57 Oracle XQuery for Hadoop Oracle XQuery for Hadoop runs transformations expressed in the XQuery language by translating them into a series of MapReduce jobs, which are executed in parallel on the Hadoop cluster. The input data can be located in HDFS or Oracle NoSQL Database. Oracle XQuery for Hadoop can write the transformation results to HDFS, Oracle NoSQL Database, or Oracle Database.

58 3.10 ORACLE R ADVANCED ANALYTICS FOR HADOOP Free ebooks ==> Oracle R Advanced Analytics for Hadoop is a collection of R packages that provides: Interfaces to work with Hive tables, Apache Hadoop compute infrastructure, local R environment and database tables Predictive analytic techniques written in R or Java as Hadoop MapReduce jobs that can be applied to data in HDFS files Using simple R functions, you can copy data between R memory, the local file system, HDFS, and Hive. You can write mappers and reducers in R, schedule these R programs to execute as Hadoop MapReduce jobs, and return the results to any of those locations.

59 ORACLE R SUPPORT FOR BIG DATA R is an open-source language and environment for statistical analysis and graphing It provides linear and nonlinear modeling, standard statistical methods, time-series analysis, classification, clustering, and graphical data displays. Thousands of open-source packages are available in the Comprehensive R Archive Network (CRAN) for a spectrum of applications, such as bioinformatics, spatial statistics, and financial and marketing analysis. The popularity of R has increased as its functionality matured to rival that of costly proprietary statistical packages. Analysts typically use R on a PC, which limits the amount of data and the processing power available for analysis. Oracle eliminates this restriction by extending the R platform to directly leverage Oracle Big Data Appliance. Oracle R Distribution is installed on all nodes of Oracle Big Data Appliance. Oracle R Advanced Analytics for Hadoop provides R users with high-performance, native access to HDFS and the MapReduce programming framework, which enables R programs to run as MapReduce jobs on vast amounts of data. Oracle R Advanced Analytics for Hadoop is included in the Oracle Big Data Connectors. Oracle R Enterprise is a component of the Oracle Advanced Analytics option to Oracle Database. It provides: Transparent access to database data for data preparation and statistical analysis from R Execution of R scripts at the database server, accessible from both R and SQL A wide range of predictive and data mining in-database algorithms Oracle R Enterprise enables you to store the results of your analysis of big data in an Oracle database, or accessed for display in dashboards and applications. Both Oracle R Advanced Analytics for Hadoop and Oracle R Enterprise make Oracle Database and the Hadoop computational infrastructure available to statistical users without requiring them to learn the native programming languages of either one.

60 3.12 ANALYZING AND VISUALIZING BIG DATA After big data is transformed and loaded in Oracle Database, you can use the full spectrum of Oracle business intelligence solutions and decision support products to further analyze and visualize all your data.

61 ORACLE BUSINESS INTELLIGENCE FOUNDATION SUITE Oracle Business Intelligence Foundation Suite, a comprehensive, modern and marketleading BI platform provides the industry s best in class platform for ad hoc query and analysis, dashboards, enterprise reporting, mobile analytics, scorecards, multidimensional OLAP, and predictive analytics, on an architecturally integrated business intelligence foundation. This enabling technology for custom and packaged business intelligence applications helps organizations drive innovation, and optimize processes while, delivering extreme performance. Oracle Business Intelligence Foundation Suite includes the following capabilities:

62 Enterprise BI Platform Transform IT from a cost center to a business asset by standardizing on a single, scalable BI platform that empowers business users to easily create their own reports with information relevant to them.

63 OLAP Analytics The industry-leading multi-dimensional online analytical processing (OLAP) server, designed to help business users forecast likely business performance levels and deliver what-if analyses for varying conditions.

64 Scorecard and Strategy Management Define strategic goals and objectives that can be cascaded to every level of the enterprise, enabling employees to understand their impact on achieving success and align their actions accordingly.

65 Mobile BI Business doesn t stop just because you re on the go. Make sure critical information is reaching you wherever you are.

66 Enterprise Reporting Provides a single, Web-based platform for authoring, managing, and delivering interactive reports, dashboards, and all types of highly formatted documents.

67 ORACLE BIG DATA LITE VIRTUAL MACHINE Oracle Big Data Appliance Version 2.5 was released recently. Some great new features in this release- including a continued security focus (on-disk encryption and automated configuration of Sentry for data authorization) and updates to Cloudera Distribution of Apache Hadoop and Cloudera Manager. With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine. Oracle Big Data Lite provides an integrated environment to help you get started with the Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you to begin using the system right away. The following components are included on Oracle Big Data Lite Virtual Machine v 2.5: Oracle Enterprise Linux 6.4 Oracle Database 12c Release 1 Enterprise Edition ( ) Cloudera s Distribution including Apache Hadoop (CDH4.6) Cloudera Manager Cloudera Enterprise Technology, including: Cloudera RTQ (Impala 1.2.3) Cloudera RTS (Search 1.2) Oracle Big Data Connectors 2.5 Oracle SQL Connector for HDFS Oracle Loader for Hadoop Oracle Data Integrator 11g Oracle R Advanced Analytics for Hadoop Oracle XQuery for Hadoop Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54) Oracle JDeveloper 11g Oracle SQL Developer 4.0 Oracle Data Integrator 12cR1

68 Oracle R Distribution Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle s big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It s been configured to run on a developer class computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.

69 4Chapter 4. ADMINISTERING ORACLE BIG DATA APPLIANCE

70 4.1 MONITORING MULTIPLE CLUSTERS USING ORACLE Free ebooks ==> ENTERPRISE MANAGER An Oracle Enterprise Manager plug-in enables you to use the same system monitoring tool for Oracle Big Data Appliance as you use for Oracle Exadata Database Machine or any other Oracle Database installation. With the plug-in, you can view the status of the installed software components in tabular or graphic presentations, and start and stop these software services. You can also monitor the health of the network and the rack components. Oracle Enterprise Manager enables you to monitor all Oracle Big Data Appliance racks on the same InfiniBand fabric. It provides summary views of both the rack hardware and the software layout of the logical clusters.

71 Using the Enterprise Manager Web Interface After opening Oracle Enterprise Manager web interface, logging in, and selecting a target cluster, you can drill down into these primary areas: InfiniBand network: Network topology and status for InfiniBand switches and ports. See the net figure. Hadoop cluster: Software services for HDFS, MapReduce, and ZooKeeper. Oracle Big Data Appliance rack: Hardware status including server hosts, Oracle Integrated Lights Out Manager (Oracle ILOM) servers, power distribution units (PDUs), and the Ethernet switch. The next figure shows a small section of the cluster home page. YARN Page in Oracle Enterprise Manager To monitor Oracle Big Data Appliance using Oracle Enterprise Manager: 1. Download and install the plug-in. 2. Log in to Oracle Enterprise Manager as a privileged user. 3. From the Targets menu, choose Big Data Appliance to view the Big Data page. You can see the overall status of the targets already discovered by Oracle Enterprise Manager. 4. Select a target cluster to view its detail pages. 5. Expand the target navigation tree to display the components. Information is available at all levels. 6. Select a component in the tree to display its home page. 7. To change the display, choose an item from the drop-down menu at the top left of the main display area.

72 4.1.2 Using the Enterprise Manager Command-Line Interface The Enterprise Manager command-line interface (emcli) is installed on Oracle Big Data Appliance along with all the other software. It provides the same functionality as the web interface. You must provide credentials to connect to Oracle Management Server.

73 4.2 MANAGING OPERATIONS USING CLOUDERA MANAGER Cloudera Manager is installed on Oracle Big Data Appliance to help you with Cloudera s Distribution including Apache Hadoop (CDH) operations. Cloudera Manager provides a single administrative interface to all Oracle Big Data Appliance servers configured as part of the Hadoop cluster. Cloudera Manager simplifies the performance of these administrative tasks: Monitor jobs and services Start and stop services Manage security and Kerberos credentials Monitor user activity Monitor the health of the system Monitor performance metrics Track hardware use (disk, CPU, and RAM) Cloudera Manager runs on the ResourceManager node (node03) and is available on port To use Cloudera Manager: Open a browser and enter a URL like the following: In this example, bda1 is the name of the appliance, node03 is the name of the server, example.com is the domain, and 7180 is the default port number for Cloudera Manager. Log in with a user name and password for Cloudera Manager. Only a user with administrative privileges can change the settings. Other Cloudera Manager users can view the status of Oracle Big Data Appliance.

74 4.2.1 Monitoring the Status of Oracle Big Data Appliance Free ebooks ==> In Cloudera Manager, you can choose any of the following pages from the menu bar across the top of the display: Home: Provides a graphic overview of activities and links to all services controlled by Cloudera Manager. See the next figure. Clusters: Accesses the services on multiple clusters. Hosts: Monitors the health, disk usage, load, physical memory, swap space, and other statistics for all servers in the cluster. Diagnostics: Accesses events and logs. Cloudera Manager collects historical information about the systems and services. You can search for a particular phrase for a selected server, service, and time period. You can also select the minimum severity level of the logged messages included in the search: TRACE, DEBUG, INFO, WARN, ERROR, or FATAL. Audits: Displays the audit history log for a selected time range. You can filter the results by user name, service, or other criteria, and download the log as a CSV file. Charts: Enables you to view metrics from the Cloudera Manager time-series data store in a variety of chart types, such as line and bar. Backup: Accesses snapshot policies and scheduled replications. Administration: Provides a variety of administrative options, including Settings, Alerts, Users, and Kerberos. The next figure shows the Cloudera Manager home page.

75 Performing Administrative Tasks As a Cloudera Manager administrator, you can change various properties for monitoring the health and use of Oracle Big Data Appliance, add users, and set up Kerberos security. To access Cloudera Manager Administration: Log in to Cloudera Manager with administrative privileges. Click Administration, and select a task from the menu.

76 4.2.3 Managing CDH Services With Cloudera Manager Free ebooks ==> Cloudera Manager provides the interface for managing these services: HDFS Hive Hue Oozie YARN ZooKeeper You can use Cloudera Manager to change the configuration of these services, stop, and restart them. Additional services are also available, which require configuration before you can use them

77 4.3 USING HADOOP MONITORING UTILITIES You also have the option of using the native Hadoop utilities. These utilities are read-only and do not require authentication. Cloudera Manager provides an easy way to obtain the correct URLs for these utilities. On the YARN service page, expand the Web UI submenu.

78 4.3.1 Monitoring MapReduce Jobs Free ebooks ==> You can monitor MapReduce jobs using the resource manager interface. To monitor MapReduce jobs: Open a browser and enter a URL like the following: In this example, bda1 is the name of the rack, node03 is the name of the server where the YARN resource manager runs, and 8088 is the default port number for the user interface. The next figure shows the YARN resource manager interface.

79 Monitoring the Health of HDFS You can monitor the health of the Hadoop file system by using the DFS health utility on the first two nodes of a cluster. To monitor HDFS: Open a browser and enter a URL like the following: In this example, bda1 is the name of the rack, node01 is the name of the server where the dfshealth utility runs, and is the default port number for the user interface. The next figure shows the DFS health utility interface.

80 4.4 USING CLOUDERA HUE TO INTERACT WITH HADOOP Free ebooks ==> Hue runs in a browser and provides an easy-to-use interface to several applications to support interaction with Hadoop and HDFS. You can use Hue to perform any of the following tasks: Query Hive data stores Create, load, and delete Hive tables Work with HDFS files and directories Create, submit, and monitor MapReduce jobs Monitor MapReduce jobs Create, edit, and submit workflows using the Oozie dashboard Manage users and groups Hue is automatically installed and configured on Oracle Big Data Appliance. It runs on port 8888 of the ResourceManager node (node03). To use Hue: Log in to Cloudera Manager and click the hue service on the Home page. On the hue page, click Hue Web UI. Bookmark the Hue URL, so that you can open Hue directly in your browser. The following URL is an example: Log in with your Hue credentials. Oracle Big Data Appliance is not configured initially with any Hue user accounts. The first user who connects to Hue can log in with any user name and password, and automatically becomes an administrator. This user can create other user and administrator accounts. The next figure shows the Hive Query Editor.

81 4.5 ABOUT THE ORACLE BIG DATA APPLIANCE SOFTWARE The following sections identify the software installed on Oracle Big Data Appliance. Some components operate with Oracle Database and later releases. This section contains the following topics: Software Components Unconfigured Software Allocating Resources Among Services

82 4.5.1 Software Components Free ebooks ==> These software components are installed on all servers in the cluster. Oracle Linux, required drivers, firmware, and hardware verification utilities are factory installed. All other software is installed on site. The optional software components may not be configured in your installation. You do not need to install additional software on Oracle Big Data Appliance. Doing so may result in a loss of warranty and support. Base image software: Oracle Linux 6.4 (upgrades stay at 5.8) with Oracle Unbreakable Enterprise Kernel version 2 (UEK2) Java HotSpot Virtual Machine 7 version 25 (JDK 7u25) Oracle R Distribution MySQL Database Advanced Edition Puppet, firmware, Oracle Big Data Appliance utilities Oracle InfiniBand software Mammoth installation: Cloudera s Distribution including Apache Hadoop Release 5 (5.1.0) including: Apache Hive 0.12 Apache HBase Apache Sentry Apache Spark Cloudera Impala Cloudera Search Cloudera Manager Release 5 (5.1.1) including Cloudera Navigator Oracle Database Instant Client 12.1 Oracle Big Data SQL (optional) Oracle NoSQL Database Community Edition or Enterprise Edition 12c Release 1 Version (optional) Oracle Big Data Connectors 4.0 (optional): Oracle SQL Connector for Hadoop Distributed File System (HDFS) Oracle Loader for Hadoop Oracle Data Integrator Agent Oracle XQuery for Hadoop

83 Oracle R Advanced Analytics for Hadoop The next figure shows the relationships among the major Major Software Components of Oracle Big Data Appliance.

84 4.5.2 Unconfigured Software Free ebooks ==> Your Oracle Big Data Appliance license includes all components in Cloudera Enterprise Data Hub Edition. All CDH components are installed automatically by the Mammoth utility. Do not download them from the Cloudera website. However, you must use Cloudera Manager to add the following services before you can use them: Apache Flume Apache HBase Apache Spark Apache Sqoop Cloudera Impala Cloudera Navigator Cloudera Search To add a service: 1. Log in to Cloudera Manager as the admin user. 2. On the Home page, expand the cluster menu in the left panel and choose Add a Service to open the Add Service wizard. The first page lists the services you can add. 3. Follow the steps of the wizard. You can find the RPM files on the first server of each cluster in /opt/oracle/bdamammoth/bdarepo/rpms/noarch.

85 Allocating Resources Among Services You can allocate resources to each service HDFS, YARN, Oracle Big Data SQL, Hive, and so forth as a percentage of the total resource pool. Cloudera Manager automatically calculates the recommended resource management settings based on these percentages. The static service pools isolate services on the cluster, so that a high load on one service as a limited impact on the other services. To allocate resources among services: Log in as admin to Cloudera Manager. Open the Clusters menu at the top of the page, then select Static Service Pools under Resource Management. Select Configuration. Follow the steps of the wizard, or click Change Settings Directly to edit the current settings.

86 4.6 STOPPING AND STARTING ORACLE BIG DATA APPLIANCE Free ebooks ==> This section describes how to shut down Oracle Big Data Appliance gracefully and restart it. Prerequisites Stopping Oracle Big Data Appliance Starting Oracle Big Data Appliance

87 Prerequisites You must have root access. Passwordless SSH must be set up on the cluster, so that you can use the dcli utility. To ensure that passwordless-ssh is set up: Log in to the first node of the cluster as root. Use a dcli command to verify it is working. This command should return the IP address and host name of every node in the cluster: # dcli -C hostname : bda1node01.example.com : bda1node02.example.com... If you do not get these results, then set up dcli on the cluster: # setup-root-ssh -C

88 4.6.2 Stopping Oracle Big Data Appliance Follow these procedures to shut down all Oracle Big Data Appliance software and hardware components. Note: The following services stop automatically when the system shuts down. You do not need to take any action: Oracle Enterprise Manager agent Auto Service Request agents Task 1 Stopping All Managed Services Use Cloudera Manager to stop the services it manages, including flume, hbase, hdfs, hive, hue, mapreduce, oozie, and zookeeper. 1. Log in to Cloudera Manager as the admin user. 2. In the Status pane of the opening page, expand the menu for the cluster and click Stop, and then click Stop again when prompted to confirm. See the nect figure. To navigate to this page, click the Home tab, and then the Status subtab. 3. On the Command Details page, click Close when all processes are stopped. 4. In the same pane under Cloudera Management Services, expand the menu for the mgmt service and click Stop. 5. Log out of Cloudera Manager. The next figure shows the stopping HDFS Services Task 2 Stopping Cloudera Manager Server

89 Follow this procedure to stop Cloudera Manager Server. 1. Log in as root to the node where Cloudera Manager runs (initially node03). The remaining tasks presume that you are logged in to a server as root. You can enter the commands from any server by using the dcli command. This example runs the pwd command on node03 from any node in the cluster: # dcli -c node03 pwd 2. Stop the Cloudera Manager server: # service cloudera-scm-server stop Stopping cloudera-scm-server: [ OK ] Verify that the server is stopped: # service cloudera-scm-server status cloudera-scm-server is stopped After stopping Cloudera Manager, you cannot access it using the web console. Task 3 Stopping Oracle Data Integrator Agent If Oracle Data Integrator Application Adapter for Hadoop is installed on the cluster, then stop the agent. 1. Check the status of the Oracle Data Integrator service: # dcli -C service odi-agent status 2. Stop the Oracle Data Integrator agent, if it is running: # dcli -C service odi-agent stop

90 3. Ensure that the Oracle Data Integrator service stopped running: Free ebooks ==> # dcli -C service odi-agent status Task 4 Dismounting NFS Directories All nodes share an NFS directory on node03, and additional directories may also exist. If a server with the NFS directory (/opt/exportdir) is unavailable, then the other servers hang when attempting to shut down. Thus, you must dismount the NFS directories first. 1. Locate any mounted NFS directories: # dcli -C mount grep shareddir : bda1node03.example.com:/opt/exportdir on /opt/shareddir type nfs (rw,tcp,soft,intr,timeo=10,retrans=10,addr= ) : bda1node03.example.com:/opt/exportdir on /opt/shareddir type nfs (rw,tcp,soft,intr,timeo=10,retrans=10,addr= ) : /opt/exportdir on /opt/shareddir type none (rw,bind).. The sample output shows a shared directory on node03 ( ). 2. Dismount the shared directory: # dcli -C umount /opt/shareddir 3. Dismount any custom NFS directories. Task 5 Stopping the Servers The Linux shutdown -h command powers down individual servers. You can use the dcli -g command to stop multiple servers. 1. Create a file that lists the names or IP addresses of the other servers in the cluster, that is, not including the one you are logged in to.

91 2. Stop the other servers: # dcli -g filename shutdown -h now For filename, enter the name of the file that you created in step Stop the server you are logged in to: # shutdown -h now Task 6 Stopping the InfiniBand and Cisco Switches To stop the network switches, turn off a PDU or a breaker in the data center. The switches only turn off when power is removed. The network switches do not have power buttons. They shut down only when power is removed To stop the switches, turn off all breakers in the two PDUs.

92 4.6.3 Starting Oracle Big Data Appliance Free ebooks ==> Follow these procedures to power up the hardware and start all services on Oracle Big Data Appliance. Task 1 Powering Up Oracle Big Data Appliance 1. Switch on all 12 breakers on both PDUs. 2. Allow 4 to 5 minutes for Oracle ILOM and the Linux operating system to start on the servers. 3. If password-based, on-disk encryption is enabled, then log in and mount the Hadoop directories on those servers: $ mount-hadoop-dirs Enter password to mount Hadoop directories: password If the servers do not start automatically, then you can start them locally by pressing the power button on the front of the servers, or remotely by using Oracle ILOM. Oracle ILOM has several interfaces, including a commandline interface (CLI) and a web console. Use whichever interface you prefer. For example, you can log in to the web interface as root and start the server from the Remote Power Control page. The URL for Oracle ILOM is the same as for the host, except that it typically has a -c or -ilom extension. This URL connects to Oracle ILOM for bda1node4: Task 2 Starting the HDFS Software Services Use Cloudera Manager to start all the HDFS services that it controls. 1. Log in as root to the node where Cloudera Manager runs (initially node03). Note: The remaining tasks presume that you are logged in to a server as root. You can enter the commands from any server by using the dcli command. This example runs the pwd command on node03 from any node in the cluster: # dcli -c node03 pwd 2. Verify that the Cloudera Manager started automatically on node03:

93 # service cloudera-scm-server status cloudera-scm-server (pid 11399) is running 3. If it is not running, then start it: # service cloudera-scm-server start 4. Log in to Cloudera Manager as the admin user. 5. In the Status pane of the opening page, expand the menu for the cluster and click Start, and then click Start again when prompted to confirm. See the next figure. To navigate to this page, click the Home tab, and then the Status subtab. 6. On the Command Details page, click Close when all processes are started. 7. In the same pane under Cloudera Management Services, expand the menu for the mgmt service and click Start. 8. Log out of Cloudera Manager (optional). Task 3 Starting Oracle Data Integrator Agent If Oracle Data Integrator Application Adapter for Hadoop is used on this cluster, then start the agent. 1. Check the status of the agent: # /opt/oracle/odiagent/agent_standalone/oracledi/agent/bin/startcmd.sh OdiPingAgent [-AGENT_NAME=agent_name] 2. Start the agent:

94 # /opt/oracle/odiagent/agent_standalone/oracledi/agent/bin/agent.sh [- NAME=agent_name] [-PORT=port_number] Free ebooks ==>

95 4.7 MANAGING ORACLE BIG DATA SQL Oracle Big Data SQL is registered with Cloudera Manager as an add-on service. You can use Cloudera Manager to start, stop, and restart the Oracle Big Data SQL service or individual role instances, the same way as a CDH service. Cloudera Manager also monitors the health of the Oracle Big Data SQL service, reports service outages, and sends alerts if the service is not healthy.

96 4.7.1 Adding and Removing the Oracle Big Data SQL Service Free ebooks ==> Oracle Big Data SQL is an optional service on Oracle Big Data Appliance. It may be installed with the other client software during the initial software installation or an upgrade. Use Cloudera Manager to determine whether it is installed. A separate license is required; Oracle Big Data SQL is not included with the Oracle Big Data Appliance license. You cannot use Cloudera Manager to add or remove the Oracle Big Data SQL service from a CDH cluster on Oracle Big Data Appliance. Instead, log in to the server where Mammoth is installed (usually the first node of the cluster) and use the following commands in the bdacli utility: To enable Oracle Big Data SQL bdacli enable big_data_sql To disable Oracle Big Data SQL: bdacli disable big_data_sql

97 Allocating Resources to Oracle Big Data SQL You can modify the property values in a Linux kernel Control Group (Cgroup) to reserve resources for Oracle Big Data SQL. To modify the resource management configuration settings: 1. Log in as admin to Cloudera Manager. 2. On the Home page, click bigdatasql from the list of services. 3. On the bigdatasql page, click Configuration. 4. Under Category, expand BDS Server Default Group and click Resource Management. 5. Modify the values of the following properties as required: Cgroup CPU Shares Cgroup I/O Weight Cgroup Memory Soft Limit Cgroup Memory Hard Limit 6. Click Save Changes. 7. From the Actions menu, click Restart. The next figure shows the bigdatasql service configuration page.

98

99 4.8 SWITCHING FROM YARN TO MAPREDUCE 1 Oracle Big Data Appliance uses the Yet Another Resource Negotiator (YARN) implementation of MapReduce by default. You have the option of using classic MapReduce (MRv1) instead. You cannot use both implementations in the same cluster; you can activate either the MapReduce or the YARN service. To switch a cluster to MRv1: 1. Log in to Cloudera Manager as the admin user. 2. Stop the YARN service: a. Locate YARN in the list of services on the Status tab of the Home page. b. Expand the YARN menu and click Stop. 3. On the cluster menu, click Add a Service to start the Add Service wizard: a. Select MapReduce for the type of service you want to add. b. Select hdfs/zookeeper as a dependency (default). c. Customize the role assignments: JobTracker: Click the field to display a list of nodes in the cluster. Select the third node. TaskTracker: For a six-node cluster, keep the TaskTrackers on all nodes (default). For larger clusters, remove the TaskTrackers from the first two nodes. d. On the Review Changes page, change the parameter values: TaskTracker Local Data Directory List: Change the default group and group 1 to /u12/hadoop/mapred.. /u01/hadoop/mapred. JobTracker Local Data Directory List: Change the default group to /u12/hadoop/mapred.. /u01/hadoop/mapred. e. Complete the steps of the wizard with no further changes. Click Finish to save the configuration and exit. 4. Update the Hive service configuration: a. On the Status tab of the Home page, click hive to display the hive page. b. Expand the Configuration submenu and click View and Edit. c. Select mapreduce as the value of the MapReduce Service property. d. Click Save Changes. 5. Repeat step 4 to update the Oozie service configuration to use mapreduce.

100 6. On the Status tab of the Home page, expand the hive and oozie menus and choose Restart. 7. Optional: Expand the yarn service menu and choose Delete. If you retain the yarn service, then after every cluster restart, you will see Memory overcommit validation warnings, and you must manually stop the yarn service. 8. Update the MapReduce service configuration: a. On the Status tab of the Home page, click mapreduce to display the mapreduce page. b. Expand the Configuration submenu and click View and Edit. c. Under Category, expand TaskTracker Default Group, and then click Resource Management. d. Set the following properties: Java Heap Size of TaskTracker in Bytes: Reset to the default value of 1 GiB. Maximum Number of Simultaneous Map Tasks: Set to either 15 for Sun Fire X4270 M2 racks or 20 for all other racks. Maximum Number of Simultaneous Reduce Tasks: Set to either 10 for Sun Fire X4270 M2 racks or 13 for all other racks. e. Click Save Changes. 9. Add overrides for nodes 3 and 4 (or nodes 1 and 2 in a six-node cluster). 10. Click the mapreduce1 service to display the mapreduce page: 11. Expand the Actions menu and select Enable High Availability to start the Enable JobTracker High Availability wizard: a. On the Assign Roles page, select the fourth node (node04) for the Standby JobTracker. b. Complete the steps of the wizard with no further changes. Click Finish to save the configuration and exit.

101 12. Verify that all services in the cluster are healthy with no configuration issues. 13. Reconfigure Perfect Balance for the MRv1 cluster: a. Log in as root to a node of the cluster. b. Configure Perfect Balance on all nodes of the cluster: c. $ dcli C /opt/oracle/orabalancer-[version]/bin/configure.sh

102 4.9 SECURITY ON ORACLE BIG DATA APPLIANCE You can take precautions to prevent unauthorized use of the software and data on Oracle Big Data Appliance. This section contains these topics: 1. About Predefined Users and Groups 2. About User Authentication 3. About Fine-Grained Authorization 4. About On-Disk Encryption 5. Port Numbers Used on Oracle Big Data Appliance 6. About Puppet Security

103 About Predefined Users and Groups Every open-source package installed on Oracle Big Data Appliance creates one or more users and groups. Most of these users do not have login privileges, shells, or home directories. They are used by daemons and are not intended as an interface for individual users. For example, Hadoop operates as the hdfs user, MapReduce operates as mapred, and Hive operates as hive. You can use the oracle identity to run Hadoop and Hive jobs immediately after the Oracle Big Data Appliance software is installed. This user account has login privileges, a shell, and a home directory. Oracle NoSQL Database and Oracle Data Integrator run as the oracle user. Its primary group is oinstall. Do not delete, re-create, or modify the users that are created during installation, because they are required for the software to operate. The next table identifies the operating system users and groups that are created automatically during installation of Oracle Big Data Appliance software for use by CDH components and other software packages. User Name Group Used By Login Rights flume flume Apache Flume parent and nodes No hbase hbase Apache HBase processes No hdfs hadoop NameNode, DataNode No hive hive Hive metastore and server processes No hue hue Hue processes No mapred hadoop ResourceManager, NodeManager, Hive Thrift daemon Yes mysql mysql MySQL server Yes

104 oozie oozie Oozie server No oracle dba, oinstall Oracle NoSQL Database, Oracle Loader for Hadoop, Oracle Data Integrator, and the Oracle DBA Yes puppet puppet Puppet parent (puppet nodes run as root) No sqoop sqoop Apache Sqoop metastore No svctag Auto Service Request No zookeeper zookeeper ZooKeeper processes No

105 About User Authentication Oracle Big Data Appliance supports Kerberos security as a software installation option.

106 4.9.3 About Fine-Grained Authorization Free ebooks ==> The typical authorization model on Hadoop is at the HDFS file level, such that users either have access to all of the data in the file or none. In contrast, Apache Sentry integrates with the Hive and Impala SQL-query engines to provide fine-grained authorization to data and metadata stored in Hadoop. Oracle Big Data Appliance automatically configures Sentry during software installation, beginning with Mammoth utility version 2.5.

107 About On-Disk Encryption On-disk encryption protects data that is at rest on disk. When on-disk encryption is enabled, Oracle Big Data Appliance automatically encrypts and decrypts data stored on disk. On-disk encryption does not affect user access to Hadoop data, although it can have a minor impact on performance. Password-based encryption encodes Hadoop data based on a password, which is the same for all servers in a cluster. You can change the password at any time by using the mammoth-reconfig update command. If a disk is removed from a server, then the encrypted data remains protected until you install the disk in a server (the same server or a different one), startup the server, and provide the password. If a server is powered off and removed from an Oracle Big Data Appliance rack, then the encrypted data remains protected until you restart server and provide the password. You must enter the password after every startup of every server to enable access to the data. On-disk encryption is an option that you can select during the initial installation of the software by the Mammoth utility. You can also enable or disable on-disk encryption at any time by using either the mammoth-reconfig or bdacli utilities.

108 4.9.5 Port Numbers Used on Oracle Big Data Appliance Free ebooks ==> The next table identifies the port numbers that might be used in addition to those used by CDH. To view the ports used on a particular server: 1. In Cloudera Manager, click the Hosts tab at the top of the page to display the Hosts page. 2. In the Name column, click a server link to see its detail page. 3. Scroll down to the Ports section. Oracle Big Data Appliance Port Numbers Service Port Automated Service Monitor (ASM) HBase master service (node01) MySQL Database 3306 Oracle Data Integrator Agent Oracle NoSQL Database administration 5001 Oracle NoSQL Database processes 5010 to 5020 Oracle NoSQL Database registration 5000 Port map 111 Puppet master service 8140

109 Puppet node service 8139 rpc.statd 668 ssh 22 xinetd (service tag) 6481

110 4.9.6 About Puppet Security Free ebooks ==> The puppet node service (puppetd) runs continuously as root on all servers. It listens on port 8139 for kick requests, which trigger it to request updates from the puppet master. It does not receive updates on this port. The puppet master service (puppetmasterd) runs continuously as the puppet user on the first server of the primary Oracle Big Data Appliance rack. It listens on port 8140 for requests to push updates to puppet nodes. The puppet nodes generate and send certificates to the puppet master to register initially during installation of the software. For updates to the software, the puppet master signals ( kicks ) the puppet nodes, which then request all configuration changes from the puppet master node that they are registered with. The puppet master sends updates only to puppet nodes that have known, valid certificates. Puppet nodes only accept updates from the puppet master host name they initially registered with. Because Oracle Big Data Appliance uses an internal network for communication within the rack, the puppet master host name resolves using /etc/hosts to an internal, private IP address.

111 AUDITING ORACLE BIG DATA APPLIANCE You can use Oracle Audit Vault and Database Firewall to create and monitor the audit trails for HDFS and MapReduce on Oracle Big Data Appliance. This section describes the Oracle Big Data Appliance plug-in: 1. About Oracle Audit Vault and Database Firewall 2. Setting Up the Oracle Big Data Appliance Plug-in 3. Monitoring Oracle Big Data Appliance

112 About Oracle Audit Vault and Database Firewall Free ebooks ==> Oracle Audit Vault and Database Firewall secures databases and other critical components of IT infrastructure in these key ways: 1. Provides an integrated auditing platform for your enterprise. 2. Captures activity on Oracle Database, Oracle Big Data Appliance, operating systems, directories, file systems, and so forth. 3. Makes the auditing information available in a single reporting framework so that you can understand the activities across the enterprise. You do not need to monitor each system individually; you can view your computer infrastructure as a whole. Audit Vault Server provides a web-based, graphic user interface for both administrators and auditors. You can configure CDH/Hadoop clusters on Oracle Big Data Appliance as secured targets. The Audit Vault plug-in on Oracle Big Data Appliance collects audit and logging data from these services: 1. HDFS: Who makes changes to the file system. 2. Hive DDL: Who makes Hive database changes. 3. MapReduce: Who runs MapReduce jobs that correspond to file access. 4. Oozie workflows: Who runs workflow activities. The Audit Vault plug-in is an installation option. The Mammoth utility automatically configures monitoring on Oracle Big Data Appliance as part of the software installation process.

113 Setting Up the Oracle Big Data Appliance Plug-in The Mammoth utility on Oracle Big Data Appliance performs all the steps needed to setup the plug-in, using information that you provide. To set up the Audit Vault plug-in for Oracle Big Data Appliance: 1. Ensure that Oracle Audit Vault and Database Firewall Server Release is up and running on the same network as Oracle Big Data Appliance. 2. Complete the Audit Vault Plug-in section of Oracle Big Data Appliance Configuration Generation Utility. 3. Install the Oracle Big Data Appliance software using the Mammoth utility. An Oracle representative typically performs this step. You can also add the plug-in at a later time using either bdacli or mammoth-reconfig. When the software installation is complete, the Audit Vault plug-in is installed on Oracle Big Data Appliance, and Oracle Audit Vault and Database Firewall is collecting its audit information. You do not need to perform any other installation steps.

114 Monitoring Oracle Big Data Appliance Free ebooks ==> After installing the plug-in, you can monitor Oracle Big Data Appliance the same as any other secured target. Audit Vault Server collects activity reports automatically. The following procedure describes one type of monitoring activity. To view an Oracle Big Data Appliance activity report: 1. Log in to Audit Vault Server as an auditor. 2. Click the Reports tab. 3. Under Built-in Reports, click Audit Reports. 4. To browse all activities, in the Activity Reports list, click the Browse report data icon for All Activity. 5. Add or remove the filters to list the events. Event names include ACCESS, CREATE, DELETE, and OPEN. 6. Click the Single row view icon in the first column to see a detailed report. The next figure shows the beginning of an activity report, which records access to a Hadoop sequence file.

115

116 4.11 COLLECTING DIAGNOSTIC INFORMATION FOR ORACLE Free ebooks ==> CUSTOMER SUPPORT If you need help from Oracle Support to troubleshoot CDH issues, then you should first collect diagnostic information using the bdadiag utility with the cm option. To collect diagnostic information: 1. Log in to an Oracle Big Data Appliance server as root. 2. Run bdadiag with at least the cm option. You can include additional options on the command line as appropriate. # bdadiag cm The command output identifies the name and the location of the diagnostic file. 3. Go to My Oracle Support at 4. Open a Service Request (SR) if you have not already done so. 5. Upload the bz2 file into the SR. If the file is too large, then upload it to sftp.oracle.com, as described in the next procedure. To upload the diagnostics to ftp.oracle.com: 6. Open an SFTP client and connect to sftp.oracle.com. Specify port 2021 and remote directory /support/incoming/target, where target is the folder name given to you by Oracle Support. 7. Log in with your Oracle Single Sign-on account and password. 8. Upload the diagnostic file to the new directory. 9. Update the SR with the full path and the file name.

117 AUDITING DATA ACCESS ACROSS THE ENTERPRISE Security has been an important theme across recent Big Data Appliance releases. Our most recent release includes encryption of data at rest and automatic configuration of Sentry for data authorization. This is in addition to the security features previously added to the BDA, including Kerberos-based authentication, network encryption and auditing. Auditing data access across the enterprise - including databases, operating systems and Hadoop - is critically important and oftentimes required for SOX, PCI and other regulations. Let s take a look at a demonstration of how Oracle Audit Vault and Database Firewall delivers comprehensive audit collection, alerting and reporting of activity on an Oracle Big Data Appliance and Oracle Database 12c.

118 Configuration In this scenario, we ve set up auditing for both the BDA and Oracle Database 12c. The Audit Vault Server is deployed to its own secure server and serves as mission control for auditing. It is used to administer audit policies, configure activities that are tracked on the secured targets and provide robust audit reporting and alerting. In many ways, Audit Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources into an audit schema and then delivers both pre-built and ad hoc reporting capabilities. For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database 12c monitored targets; these agents are responsible for managing collectors that gather activity data. This is a secure agent deployment; the Audit Vault Server has a trusted relationship with each agent. To set up the trusted relationship, the agent makes an activation request to the Audit Vault Server; this request is then activated (or approved ) by the AV Administrator. The monitored target then applies an AV Server generated Agent Activation Key to complete the activation. On the BDA, these installation and configuration steps have all been automated for you. Using the BDA s Configuration Generation Utility, you simply specify that you would like to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the audit data. Mammoth - the BDA s installation tool - uses this information to configure the audit processing. Specifically, it sets up audit trails across the following services: HDFS: collects all file access activity MapReduce: identifies who ran what jobs on the cluster Oozie: audits who ran what as part of a workflow Hive: captures changes that were made to the Hive metadata There is much more flexibility when monitoring the Oracle Database. You can create audit policies for SQL statements, schema objects, privileges and more. Check out the auditor s

119 guide for more details. In our demonstration, we kept it simple: we are capturing all select statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user and any unsuccessful attempts at selecting from any table in any schema. Now that we are capturing activity across the BDA and Oracle Database 12c, we ll set up an alert to fire whenever there is suspicious activity attempted over sensitive HR data in Hadoop: In the alert definition found above, a critical alert is defined as three unsuccessful attempts from a given IP address to access data in the HR directory. Alert definitions are extremely flexible - using any audited field as input into a conditional expression. And, they are automatically delivered to the Audit Vault Server s monitoring dashboard - as well as via to appropriate security administrators. Now that auditing is configured, we ll generate activity by two different users: oracle and DrEvil. We ll then see how the audit data is consolidated in the Audit Vault Server and how auditors can interrogate that data.

120 Capturing Activity The demonstration is driven by a few scripts that generate different types of activity by both the oracle and DrEvil users. These activities include: An oozie workflow that removes salary data from HDFS Numerous HDFS commands that upload files, change file access privileges, copy files and list the contents of directories and files Hive commands that query, create, alter and drop tables Oracle Database commands that connect as different users, create and drop users, select from tables and delete records from a table After running the scripts, we log into the Audit Vault Server as an auditor. Immediately, we see our alert has been triggered by the users activity. Drilling down on the alert reveals DrEvil s three failed attempts to access the sensitive data in HDFS: Now that we see the alert triggered in the dashboard, let s see what other activity is taking place on the BDA and in the Oracle Database.

121 Ad Hoc Reporting Audit Vault Server delivers rich reporting capabilities that enables you to better understand the activity that has taken place across the enterprise. In addition to the numerous reports that are delivered out of box with Audit Vault, you can create your own custom reports that meet your own personal needs. Here, we are looking at a BDA monitoring report that focuses on Hadoop activities that occurred in the last 24 hours: As you can see, the report tells you all of the key elements required to understand: 1) when the activity took place, 2) the source service for the event, 3) what object was referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the ip address (or host) that initiated the event, and 7) how the object was modified or accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed attempts to open the sensitive salaries.txt file. Notice, events may be related to one another. The Hive command ALTER TABLE my_salarys RENAME TO my_salaries will generate two events. The first event is sourced from the Metastore; the alter table command is captured and the metadata definition is updated. The Hive command also impacts HDFS; the table name is represented by an HDFS folder. Therefore, an HDFS event is logged that renames the my_salarys folder to my_salaries. Next, consider an Oozie workflow that performs a simple task: delete a file salaries2.txt in HDFS. This Oozie worflow generates the following events:

122 1. First, an Oozie workflow event is generated indicating the start of the workflow. 2. The workflow definition is read from the workflow.xml file found in HDFS. 3. An Oozie working directory is created 4. The salaries2.txt file is deleted from HDFS 5. Oozie runs its clean-up process The Audit Vault reports are able to reveal all of the underlying activity that is executed by the Oozie workflow. It s flexible reporting allows you to sequence these independent events into a logical series of related activities. The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few unsuccessful attempts to access sensitive salary data in HDFS. But, what other unsuccessful events have occured recently across our data platform? We ll use Audit Vault s ad hoc reporting capabilities to answer that question. Report filters enable users to search audit data based on a range of conditions. Here, we ll keep it pretty simple; let s find all failed access attempts across both the BDA and the Oracle Database within the last two hours:

123 Again, DrEvil s activity stands out. As you can see, DrEvil is attempting to access sensitive salary data not only in HDFS - but also in the Oracle Database.

124 Summary Security and integration with the rest of the Oracle ecosystem are two tablestakes that are critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewall s auditing of data across the BDA, databases and operating systems epitomizes this goal - providing a single repository and reporting environment for all your audit data.

125 5Chapter 5. ORACLE BIG DATA SQL

126 5.1 INTRODUCTION Free ebooks ==> Big Data SQL is Oracle s unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores. Oracle Big Data SQL supports queries against vast amounts of big data stored in multiple data sources, including Hadoop. You can view and analyze data from various data stores together, as if it were all stored in an Oracle database. Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the complete SQL syntax. You can execute the most complex SQL SELECT statements against data in Hadoop, either manually or using your existing applications, to tease out the most significant insights.

127 5.2 SQL ON HADOOP As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press. From good, ol Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it. It does, however, prompt the question: Why SQL? There s an argument to be made for SQL simply being a form of skill reuse. If people and tools already speak SQL, then give the people what they know. In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves. If there were a better language for querying Big Data, the community would have turned it up by now. The reality is that the SQL language endures because it is uniquely suited to querying datasets. Consider, SQL is a declarative language for operating on relations in data. It s a domain-specific language where the domain is datasets. In and of itself, that s powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler. It s set theory set into a programming language. It goes beyond just the language itself. SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on. SQL lets us think about answers, which lets us get more done. SQL on Hadoop, then, is somewhat obvious. As data gets bigger, we would prefer to only have to reason about answers.

128 5.3 SQL ON MORE THAN HADOOP Free ebooks ==> For all the obvious goodness of SQL on Hadoop, there s a somewhat obvious drawback. Specifically, data rarely lives in a single place. Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before. If SQL on Hadoop is separate from SQL on a DBMS, run the risk of constructing every IT architect s least favorite solution: the stovepipe. If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores. Ideally, in a Big Data world, SQL should play data where it lies, using the declarative power of the language to provide answers from all data. This is why we think Oracle Big Data SQL is obvious too. It s just a little more complicated than SQL on any one thing. To pull it off, we have to do a few things: Maintain the valuable characteristics of the system storing the data Unify metadata to understand how to execute queries Optimize execution to take advantage of the systems storing the data For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things. For Hadoop, the two most valuable storage characteristics are scalability and schema-onread. Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data. Schema-on-read is at least equally important if not more. As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format. So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop. Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

129

130 5.4 UNIFYING METADATA Free ebooks ==> To unify metadata for planning and executing SQL queries, we require a catalog of some sort. What tables do I have? What are their column names and types? Are there special options defined on the tables? Who can see which data in these tables? Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables. Tables in Hadoop or NoSQL databases are defined as external tables in Oracle. This makes sense, given that the data is external to the DBMS. Wait a minute, don t lots of vendors have external tables over HDFS, including Oracle? Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop. The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers. That causes both poor parallelism and removes the value of schema-on-read. The external tables Big Data SQL presents are different. They leverage the Hive metastore or user definitions to determine both parallelism and read semantics. That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel. If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read. Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS. Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

131 5.5 OPTIMIZING PERFORMANCE Being able to join data from Hadoop with Oracle Database is a feat in and of itself. However, given the size of data in Hadoop, it ends up being a lot of data to shift around. In order to optimize performance, we must take advantage of what each system can do. In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance. Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself. The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement. The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude. Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement. It does this via something we call Smart Scan for Hadoop. Moving the work to the data is straightforward. Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers. Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local. Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

132 5.6 SMART SCAN FOR HADOOP Free ebooks ==> Consider this: most queries don t select all columns, and most queries have some kind of predicate on them. Moving unneeded columns and rows is, by definition, excess data movement and impeding performance. Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance. For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields and status and only wanted results from the state of Texas. Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading. It applies parsing functions to our JSON data, discards any documents which do not contain TX for the state attribute. Then, for those documents which do match, it projects out only the and status attributes to merge with the rest of the data. Rather than moving every field, for every document, we re able to cut down 100s of TB to 100s of GB. The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer. So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures. Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics. Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck. It s SQL that plays data where it lies, letting you place data where you think it belongs.

133 5.7 ORACLE SQL DEVELOPER & DATA MODELER SUPPORT FOR ORACLE BIG DATA SQL Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle Big Data SQL. The tools allow you to connect to Hive, use the SQL Worksheet to query, create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle external tables that dynamically access data sources defined in the Hive metastore. Let s take a look at what it takes to get started and then preview this new capability.

134 5.7.1 Setting up Connections to Hive The first thing you need to do is set up a JDBC connection to Hive. Follow these steps to set up the connection: DOWNLOAD AND UNZIP JDBC DRIVERS Cloudera provides high performance JDBC drivers that are required for connectivity: Download the Hive Drivers from the Cloudera Downloads page to a local directory Unzip the archive unzip Cloudera_HiveJDBC_ zip Two zip files are contained within the archive. Unzip the JDBC4 archive to a target directory that is accessible to SQL Developer (e.g. /home/oracle/jdbc below): unzip Cloudera_HiveJDBC4_ zip -d /home/oracle/jdbc/ Now that the JDBC drivers have been extracted, update SQL Developer to use the new drivers. UPDATE SQL DEVELOPER TO USE THE CLOUDERA HIVE JDBC DRIVERS Update the preferences in SQL Developer to leverage the new drivers: Start SQL Developer Go to Tools -> Preferences Navigate to Database -> Third Party JDBC Drivers Add all of the jar files contained in the zip to the Third-party JDBC Driver Path. It should look like the picture below:

135 Restart SQL Developer CREATE A CONNECTION Now that SQL Developer is configured to access Hive, let s create a connection to Hiveserver2. Click the New Connection button in the SQL Developer toolbar. You ll need to have an ID, password and the port where Hiveserver2 is running: The example above is creating a connection called hive which connects to Hiveserver2 on

136 localhost running on port The Database field is optional; here we are specifying the default database. Free ebooks ==>

137 Using the Hive Connection The Hive connection is now treated like any other connection in SQL Developer. The tables are organized into Hive databases; you can review the tables data, properties, partitions, indexes, details and DDL: And, you can use the SQL Worksheet to run custom queries, perform DDL operations - whatever is supported in Hive: Here, we ve altered the definition of a hive table and then queried that table in the worksheet.

138 5.7.3 Create Big Data SQL-enabled Tables Using Oracle Data Modeler Free ebooks ==> Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables. Let s create a few tables using the metadata from the Hive Metastore. Invoke the import wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item. You will see the same connections found in the SQL Developer connection navigator: After selecting the hive connection and a database, select the tables to import: There could be any number of tables here - in our case we will select three tables to import. After completing the import, the logical table definitions appear in our palette:

139 You can update the logical table definitions - and in our case we will want to do so. For example, the recommended column in Hive is defined as a string (i.e. there is no precision) - which the Data Modeler casts as a varchar2(4000). We have domain knowledge and understand that this field is really much smaller - so we ll update it to the appropriate size: Now that we re comfortable with the table definitions, let s generate the DDL and create the tables in Oracle Database 12c. Use the Data Modeler DDL Preview to generate the DDL for those tables - and then apply the definitions in the Oracle Database SQL Worksheet:

140

141 Edit the Table Definitions The SQL Developer table editor has been updated so that it now understands all of the properties that control Big Data SQL external table processing. For example, edit table movieapp_log_json: You can update the source cluster for the data, how invalid records should be processed, how to map hive table columns to the corresponding Oracle table columns (if they don t match), and much more.

142 5.7.5 Query All Your Data You now have full Oracle SQL access to data across the platform. In our example, we can combine data from Hadoop with data in our Oracle Database. The data in Hadoop can be in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big Data SQL can access it! Below, we re combining click data from the JSON-based movie application log with data in our Oracle Database tables to determine how the company s customers rate blockbuster movies:

143 5.8 USING ORACLE BIG DATA SQL FOR DATA ACCESS Oracle Big Data SQL supports queries against vast amounts of big data stored in multiple data sources, including Hadoop. You can view and analyze data from various data stores together, as if it were all stored in an Oracle database. Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the complete SQL syntax. You can execute the most complex SQL SELECT statements against data in Hadoop, either manually or using your existing applications, to tease out the most significant insights.

144 5.8.1 About Oracle External Tables Oracle Big Data SQL provides external tables with next generation performance gains. An external table is an Oracle Database object that identifies and describes the location of data outside of a database. You can query an external table using the same SQL SELECT syntax that you use for any other database tables. External tables use access drivers to parse the data outside the database. Each type of external data requires a unique access driver. This release of Oracle Big Data SQL includes two access drivers for big data: one for accessing data stored in Apache Hive, and the other for accessing data stored in Hadoop Distributed File System (HDFS) files.

145 About the Access Drivers for Oracle Big Data SQL By querying external tables, you can access data stored in HDFS and Hive tables as if that data was stored in tables in an Oracle database. Oracle Database accesses the data by using the metadata provided when the external table was created. Oracle Database supports two new access drivers for Oracle Big Data SQL: 1. ORACLE_HIVE: Enables you to create Oracle external tables over Apache Hive data sources. Use this access driver when you already have Hive tables defined for your HDFS data sources. ORACLE_HIVE can also access data stored in other locations, such as HBase, that have Hive tables defined for them. 2. ORACLE_HDFS: Enables you to create Oracle external tables directly over files stored in HDFS. This access driver uses Hive syntax to describe a data source, assigning default column names of COL_1, COL_2, and so forth. You do not need to create a Hive table manually as a separate step. Instead of acquiring the metadata from a Hive metadata store the way that ORACLE_HIVE does, the ORACLE_HDFS access driver acquires all of the necessary information from the access parameters. The ORACLE_HDFS access parameters are required to specify the metadata, and are stored as part of the external table definition in Oracle Database. Oracle Big Data SQL uses these access drivers to optimize query performance.

146 5.8.3 About Smart Scan Technology External tables do not have traditional indexes, so that queries against them typically require a full table scan. However, Oracle Big Data SQL extends SmartScan capabilities, such as filter-predicate offloads, to Oracle external tables with the installation of Exadata storage server software on Oracle Big Data Appliance. This technology enables Oracle Big Data Appliance to discard a huge portion of irrelevant data up to 99 percent of the total and return much smaller result sets to Oracle Exadata Database Machine. End users obtain the results of their queries significantly faster, as the direct result of a reduced load on Oracle Database and reduced traffic on the network.

147 About Data Security with Oracle Big Data SQL Oracle Big Data Appliance already provides numerous security features to protect data stored in a CDH cluster on Oracle Big Data Appliance: 1. Kerberos authentication: Requires users and client software to provide credentials before accessing the cluster. 2. Apache Sentry authorization: Provides fine-grained, role-based authorization to data and metadata. 3. On-disk encryption: Protects the data on disk and at rest. For normal user access, the data is automatically decrypted. 4. Oracle Audit Vault and Database Firewall monitoring: The Audit Vault plug-in on Oracle Big Data Appliance collects audit and logging data from MapReduce, HDFS, and Oozie services. You can then use Audit Vault Server to monitor these services on Oracle Big Data Appliance Oracle Big Data SQL adds the full range of Oracle Database security features to this list. You can apply the same security policies and rules to your Hadoop data that you apply to your relational data.

148 5.9 INSTALLING ORACLE BIG DATA SQL Oracle Big Data SQL is available only on Oracle Exadata Database Machine connected to Oracle Big Data Appliance. You must install the Oracle Big Data SQL software on both systems.

149 Prerequisites for Using Oracle Big Data SQL Oracle Exadata Database Machine must comply with the following requirements: 1. Compute servers run Oracle Database and Oracle Enterprise Manager Grid Control or later. 2. Storage servers run Exadata storage server software or Oracle Exadata Database Machine is configured on the same InfiniBand subnet as Oracle Big Data Appliance 4. Oracle Exadata Database Machine is connected to Oracle Big Data Appliance by the InfiniBand network.

150 5.9.2 Performing the Installation Take these steps to install the Oracle Big Data SQL software on Oracle Big Data Appliance and Oracle Exadata Database Machine: 1. Download My Oracle Support one-off patch for RDBMS On Oracle Exadata Database Machine, install patch on: Oracle Enterprise Manager Grid Control home Oracle Database homes 3. On Oracle Big Data Appliance, install or upgrade the software to the latest version. 4. You can select Oracle Big Data SQL as an installation option when using the Oracle Big Data Appliance Configuration Generation Utility. 5. Download and install Mammoth patch from Oracle Automated Release Updates. 6. If Oracle Big Data SQL is not enabled during the installation, then use the bdacli utility: # bdacli enable big_data_sql 7. On Oracle Exadata Database Machine, run the post-installation script. 8. You can use Cloudera Manager to verify that Oracle Big Data SQL is up and running.

151 Running the Post-Installation Script for Oracle Big Data SQL To run the Oracle Big Data SQL post-installation script: 1. On Oracle Exadata Database Machine, ensure that the Oracle Database listener is running and listening on an interprocess communication (IPC) interface. 2. Verify the name of the Oracle installation owner. Typically, the oracle user owns the installation. 3. Verify that the same user name (such as oracle) exists on Oracle Big Data Appliance. 4. Download the bds-exa-install.sh installation script from the node where Mammoth is installed, typically the first node in the cluster. You can use a command such as wget or curl. This example copies the script from bda1node07: wget 5. As root, run the script and pass it the system identifier (SID). In this example, the SID is orcl:./bds-exa-install.sh oracle_sid=orcl 6. Repeat step 5 for each database instance. When the script completes, Oracle Big Data SQL is running on the database instance. However, if events cause the Oracle Big Data SQL agent to stop, then you must restart it.

152 5.9.4 Running the bds-exa-install Script Free ebooks ==> The bds-exa-install script generates a custom installation script that is run by the owner of the Oracle home directory. That secondary script installs all the files need by Oracle Big Data SQL into the $ORACLE_HOME/bigdatasql directory. It also creates the database directory objects, and the database links for the multithreaded Oracle Big Data SQL agent. If the operating system user who owns Oracle home is not named oracle, then use the install-user option to specify the owner. Alternatively, you can use the generate-only option to create the secondary script, and then run it as the owner of $ORACLE_HOME.

153 bds-ex-install Syntax The following is the bds-exa-install syntax:./bds-exa-install.sh oracle_sid=name [option] The option names are preceded by two hyphens ( ): generate-only={true false} Set to true to generate the secondary script, but not run it, or false to generate and run it in one step (default). install-user=user_name The operating system user who owns the Oracle Database installation. The default values is oracle.

154 5.10 CREATING EXTERNAL TABLES FOR ACCESSING BIG DATA The SQL CREATE TABLE statement has a clause specifically for creating external tables. The information that you provide in this clause enables the access driver to read data from an external source and prepare the data for the external table.

155 About the Basic CREATE TABLE Syntax The following is the basic syntax of the CREATE TABLE statement for external tables: CREATE TABLE table_name (column_name datatype, column_name datatype[, ]) ORGANIZATION EXTERNAL (external_table_clause); You specify the column names and data types the same as for any other table. ORGANIZATION EXTERNAL identifies the table as an external table. The external_table_clause identifies the access driver and provides the information that it needs to load the data. See About the External Table Clause.

156 Creating an External Table for a Hive Table You can easily create an Oracle external table for data in Apache Hive. Because the metadata is available to Oracle Database, you can query the data dictionary for information about Hive tables. Then you can use a PL/SQL function to generate a basic SQL. CREATE TABLE EXTERNAL ORGANIZATION statement. You can then modify the statement before execution to customize the external table.

157 Obtaining Information About a Hive Table The DBMS_HADOOP PL/SQL package contains a function named CREATE_EXTDDL_FOR_HIVE. It contains the CREATE_EXTDDL_FOR_HIVE function, which returns the data dictionary language (DDL) for an external table. This function requires you to provide basic information about the Hive table: Name of the Hadoop cluster Name of the Hive database Name of the Hive table Whether the Hive table is partitioned You can obtain this information by querying the ALL_HIVE_TABLES data dictionary view. It displays information about all Hive tables that you can access from Oracle Database. This example shows that the current user has access to an unpartitioned Hive table named RATINGS_HIVE_TABLE in the default database. A user named JDOE is the owner. SQL> SELECT cluster_id, database_name, owner, table_name, partitioned FROM all_hive_tables; CLUSTER_ID DATABASE_NAME OWNER TABLE_NAME PARTITIONED hadoop1 default jdoe ratings_hive_table UN-PARTITIONED You can query these data dictionary views to discover information about

158 Using the CREATE_EXTDDL_FOR_HIVE Function Free ebooks ==> With the information from the data dictionary, you can use the CREATE_EXTDDL_FOR_HIVE function of DBMS_HADOOP. This example specifies a database table name of RATINGS_DB_TABLE in the current schema. The function returns the text of the CREATE TABLE command in a local variable named DDLout, but does not execute it. DECLARE DDLout VARCHAR2(4000); BEGIN dbms_hadoop.create_extddl_for_hive( CLUSTER_ID=> hadoop1, DB_NAME=> default, HIVE_TABLE_NAME=> ratings_hive_table, HIVE_PARTITION=>FALSE, TABLE_NAME=> ratings_db_table, PERFORM_DDL=>FALSE, TEXT_OF_DDL=>DDLout ); dbms_output.put_line(ddlout); END; / When this procedure runs, the PUT_LINE function displays the CREATE TABLE command: CREATE TABLE ratings_db_table ( c0 VARCHAR2(4000), c1 VARCHAR2(4000), c2 VARCHAR2(4000), c3 VARCHAR2(4000), c4 VARCHAR2(4000), c5 VARCHAR2(4000), c6 VARCHAR2(4000), c7 VARCHAR2(4000)) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=hadoop1 com.oracle.bigdata.tablename=default.ratings_hive_table ) ) PARALLEL 2 REJECT LIMIT UNLIMITED You can capture this information in a SQL script, and use the access parameters to change the Oracle table name, the column names, and the data types as desired before executing it. You might also use access parameters to specify a date format mask. The ALL_HIVE_COLUMNS view shows how the default column names and data types are derived. This example shows that the Hive column names are C0 to C7, and that the Hive STRING data type maps to VARCHAR2(4000):

159 SQL> SELECT table_name, column_name, hive_column_type, oracle_column_type FROM all_hive_columns; TABLE_NAME COLUMN_NAME HIVE_COLUMN_TYPE ORACLE_COLUMN_TYPE - ratings_hive_table c0 string VARCHAR2(4000) ratings_hive_table c1 string VARCHAR2(4000) ratings_hive_table c2 string VARCHAR2(4000) ratings_hive_table c3 string VARCHAR2(4000) ratings_hive_table c4 string VARCHAR2(4000) ratings_hive_table c5 string VARCHAR2(4000) ratings_hive_table c6 string VARCHAR2(4000) ratings_hive_table c7 string VARCHAR2(4000) 8 rows selected.

160 Developing a CREATE TABLE Statement for ORACLE_HIVE Free ebooks ==> You can choose between using DBMS_HADOOP and developing a CREATE TABLE statement from scratch. In either case, you may need to set some access parameters to modify the default behavior of ORACLE_HIVE. Using the Default ORACLE_HIVE Settings The following statement creates an external table named ORDER to access Hive data: CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), description VARCHAR2(100), order_total NUMBER (8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive); Because no access parameters are set in the statement, the ORACLE_HIVE access driver uses the default settings to do the following: Connects to the default Hadoop cluster. Uses a Hive table named order. An error results if the Hive table does not have fields named CUST_NUM, ORDER_NUM, DESCRIPTION, and ORDER_TOTAL. Sets the value of a field to NULL if there is a conversion error, such as a CUST_NUM value longer than 10 bytes. Overriding the Default ORACLE_HIVE Settings You can set properties in the ACCESS PARAMETERS clause of the external table clause, which override the default behavior of the access driver. The following clause includes the com.oracle.bigdata.overflow access parameter. When this clause is used in the previous example, it truncates the data for the DESCRIPTION column that is longer than 100 characters, instead of throwing an error: (TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.overflow={ action: truncate, col : DESCRIPTION } )) 1. The next example sets most of the available parameters for ORACLE_HIVE: CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER(8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.tablename: order_db.order_summary com.oracle.bigdata.colmap: { col : ITEM_CNT, \ field : order_line_item_count } com.oracle.bigdata.overflow: { action : TRUNCATE, \ col : DESCRIPTION }

161 ] com.oracle.bigdata.erroropt: [{ action : replace, \ value : INVALID_NUM, \ col :[ CUST_NUM, ORDER_NUM ]},\ { action : reject, \ col : ORDER_TOTAL} The parameters make the following changes in the way that the ORACLE_HIVE access driver locates the data and handles error conditions: com.oracle.bigdata.tablename: Handles differences in table names. ORACLE_HIVE looks for a Hive table named ORDER_SUMMARY in the ORDER.DB database. com.oracle.bigdata.colmap: Handles differences in column names. The Hive ORDER_LINE_ITEM_COUNT field maps to the Oracle ITEM_CNT column. com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated. com.oracle.bigdata.erroropt: Replaces bad data. Errors in the data for CUST_NUM or ORDER_NUM set the value to INVALID_NUM.

162 Creating an External Table for HDFS Files Free ebooks ==> The ORACLE_HDFS access driver enables you to access many types of data that are stored in HDFS, but which do not have Hive metadata. You can define the record format of text data, or you can specify a SerDe for a particular data format. You must create the external table for HDFS files manually, and provide all the information the access driver needs to locate the data, and parse the records and fields. The following are some examples of CREATE TABLE ORGANIZATION EXTERNAL statements.

163 Using the Default Access Parameters with ORACLE_HDFS The following statement creates a table named ORDER to access the data in all files stored in the /usr/cust/summary directory in HDFS: CREATE TABLE ORDER (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_total (NUMBER 8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hdfs) LOCATION ( hdfs:/usr/cust/summary/* ); Because no access parameters are set in the statement, the ORACLE_HDFS access driver uses the default settings to do the following: Connects to the default Hadoop cluster. Reads the files as delimited text, and the fields as type STRING. Assumes that the number of fields in the HDFS files match the number of columns (three in this example). Assumes the fields are in the same order as the columns, so that CUST_NUM data is in the first field, ORDER_NUM data is in the second field, and ORDER_TOTAL data is in the third field. Rejects any records in which the value causes a data conversion error: If the value for CUST_NUM exceeds 10 characters, the value for ORDER_NUM exceeds 20 characters, or the value of ORDER_TOTAL cannot be converted to NUMBER.

164 Overriding the Default ORACLE_HDFS Settings Free ebooks ==> You can use many of the same access parameters with ORACLE_HDFS as ORACLE_HIVE. Accessing a Delimited Text File The following example is equivalent to the one shown in Overriding the Default ORACLE_HIVE Settings. The external table access a delimited text file stored in HDFS. CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hdfs ACCESS PARAMETERS ( com.oracle.bigdata.colmap: { col : item_cnt, \ field : order_line_item_count } com.oracle.bigdata.overflow: { action : TRUNCATE, \ col : DESCRIPTION } com.oracle.bigdata.erroropt: [{ action : replace, \ value : INVALID NUM, \ col :[ CUST_NUM, ORDER_NUM ]}, \ { action : reject, \ col : ORDER_TOTAL}] ) LOCATION ( hdfs:/usr/cust/summary/* )); The parameters make the following changes in the way that the ORACLE_HDFS access driver locates the data and handles error conditions: com.oracle.bigdata.colmap: Handles differences in column names. ORDER_LINE_ITEM_COUNT in the HDFS files matches the ITEM_CNT column in the external table. com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated. com.oracle.bigdata.erroropt: Replaces bad data. Errors in the data for CUST_NUM or ORDER_NUM set the value to INVALID_NUM.

165 Accessing Avro Container Files The next example uses a SerDe to access Avro container files. CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hdfs ACCESS PARAMETERS ( com.oracle.bigdata.rowformat: \ SERDE org.apache.hadoop.hive.serde2.avro.avroserde com.oracle.bigdata.fileformat: \ INPUTFORMAT org.apache.hadoop.hive.ql.io.avro.avrocontainerinputformat \ OUTPUTFORMAT org.apache.hadoop.hive.ql.io.avro.avrocontaineroutputformat com.oracle.bigdata.colmap: { col : item_cnt, \ field : order_line_item_count } com.oracle.bigdata.overflow: { action : TRUNCATE, \ col : DESCRIPTION } LOCATION ( hdfs:/usr/cust/summary/* )); The access parameters provide the following information to the ORACLE_HDFS access driver: com.oracle.bigdata.rowformat: Identifies the SerDe that the access driver needs to use to parse the records and fields. The files are not in delimited text format. com.oracle.bigdata.fileformat: Identifies the Java classes that can extract records and output them in the desired format. com.oracle.bigdata.colmap: Handles differences in column names. ORACLE_HDFS matches ORDER_LINE_ITEM_COUNT in the HDFS files with the ITEM_CNT column in the external table. com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated.

166 5.11 ABOUT THE EXTERNAL TABLE CLAUSE Free ebooks ==> CREATE TABLE ORGANIZATION EXTERNAL takes the external_table_clause as its argument. It has the following subclauses: TYPE Clause DEFAULT DIRECTORY Clause LOCATION Clause REJECT LIMIT Clause ORACLE_HIVE Access Parameters

167 TYPE Clause The TYPE clause identifies the access driver. The type of access driver determines how the other parts of the external table definition are interpreted. Specify one of the following values for Oracle Big Data SQL: ORACLE_HDFS: Accesses files in an HDFS directory. ORACLE_HIVE: Accesses a Hive table. The ORACLE_DATAPUMP and ORACLE_LOADER access drivers are not associated with Oracle Big Data SQL.

168 DEFAULT DIRECTORY Clause Free ebooks ==> The DEFAULT DIRECTORY clause identifies an Oracle Database directory object. The directory object identifies an operating system directory with files that the external table reads and writes. ORACLE_HDFS and ORACLE_HIVE use the default directory solely to write log files on the Oracle Database system.

169 LOCATION Clause The LOCATION clause identifies the data source. ORACLE_HDFS LOCATION Clause The LOCATION clause for ORACLE_HDFS contains a comma-separated list of file locations. The files must reside in the HDFS file system on the default cluster. A location can be any of the following: A fully qualified HDFS name, such as /user/hive/warehouse/hive_seed/hive_types. ORACLE_HDFS uses all files in the directory. A fully qualified HDFS file name, such as /user/hive/warehouse/hive_seed/hive_types/hive_types.csv A URL for an HDFS file or a set of files, such as hdfs:/user/hive/warehouse/hive_seed/hive_types/*. Just a directory name is invalid. The file names can contain any pattern-matching character described in the next table. Pattern-Matching Characters Character Description? Matches any one character * Matches zero or more characters [abc] Matches one character in the set {a, b, c} [a-b] [^a] Matches one character in the range {a b}. The character must be less than or equal to b. Matches one character that is not in the character set or range {a}. The carat (^) must immediately follow the left bracket, with no spaces. \c Removes any special meaning of c. The backslash is the escape character. {ab\,cd} Matches a string from the set {ab, cd}. The escape character (\) removes the meaning of the comma as a path separator. {ab\,c{de\,fh} Matches a string from the set {ab, cde, cfh}. The escape character (\) removes the meaning of the comma as a path separator. ORACLE_HIVE LOCATION Clause Do not specify the LOCATION clause for ORACLE_HIVE; it raises an error. The data is

170 stored in Hive, and the access parameters and the metadata store provide the necessary information. Free ebooks ==>

171 REJECT LIMIT Clause Limits the number of conversion errors permitted during a query of the external table before Oracle Database stops the query and returns an error. Any processing error that causes a row to be rejected counts against the limit. The reject limit applies individually to each parallel query (PQ) process. It is not the total of all rejected rows for all PQ processes.

172 ACCESS PARAMETERS Clause The ACCESS PARAMETERS clause provides information that the access driver needs to load the data correctly into the external table. See CREATE TABLE ACCESS PARAMETERS Clause.

173 ABOUT DATA TYPE CONVERSIONS When the access driver loads data into an external table, it verifies that the Hive data can be converted to the data type of the target column. If they are incompatible, then the access driver returns an error. Otherwise, it makes the appropriate data conversion. Hive typically provides a table abstraction layer over data stored elsewhere, such as in HDFS files. Hive uses a serializer/deserializer (SerDe) to convert the data as needed from its stored format into a Hive data type. The access driver then converts the data from its Hive data type to an Oracle data type. For example, if a Hive table over a text file has a BIGINT column, then the SerDe converts the data from text to BIGINT. The access driver then converts the data from BIGINT (a Hive data type) to NUMBER (an Oracle data type). Performance is better when one data type conversion is performed instead of two. The data types for the fields in the HDFS files should therefore indicate the data that is actually stored on disk. For example, JSON is a clear text format, therefore all data in a JSON file is text. If the Hive type for a field is DATE, then the SerDe converts the data from string (in the data file) to a Hive date. Then the access driver converts the data from a Hive date to an Oracle date. However, if the Hive type for the field is string, then the SerDe does not perform a conversion, and the access driver converts the data from string to an oracle date. Queries against the external table are faster in the second example, because the access driver performs the only data conversion. The next table identifies the data type conversions that ORACLE_HIVE can make when loading data into an external table. Supported Hive to Oracle Data Type Conversions Hive Type Data VARCHAR2, CHAR, NCHAR2, NCHAR, CLOB NUMBER, FLOAT, BINARY_NUMBER, BINARY_FLOAT BLOB DATE, TIMESTAMP, INTERVAL YEAR TO RAW TIMESTAMP WITH MONTH, TZ, TIMESTAMP INTERVAL DAY WITH LOCAL TZ TO SECOND INT SMALLINT TINYINT BIGINT yes yes yes yes no no DOUBLE FLOAT yes yes yes yes no no DECIMAL yes yes no no no no BOOLEAN yes Foot 1 yes Foot 2 yes Footref 2 yes no no BINARY yes no yes yes no no

174 STRING yes yes yes yes yes yes Free ebooks ==> TIMESTAMP yes no no no yes no STRUCT ARRAY UNIONTYPE MAP yes no no no no no Footnote 1 FALSE maps to the string FALSE, and TRUE maps to the string TRUE. Footnote 2 FALSE maps to 0, and TRUE maps to 1.

175 QUERYING EXTERNAL TABLES Users can query external tables using the SQL SELECT statement, the same as they query any other table. Granting User Access Users who query the data on a Hadoop cluster must have READ access in Oracle Database to the external table and to the database directory object that points to the cluster directory. About Error Handling By default, a query returns no data if an error occurs while the value of a column is calculated. Processing continues after most errors, particularly those thrown while the column values are calculated. Use the com.oracle.bigdata.erroropt parameter to determine how errors are handled. About the Log Files You can use these access parameters to customize the log files: com.oracle.bigdata.log.exec com.oracle.bigdata.log.qc

176 5.14 ABOUT ORACLE BIG DATA SQL ON ORACLE EXADATA Free ebooks ==> DATABASE MACHINE Oracle Big Data SQL runs exclusively on systems with Oracle Big Data Appliance connected to Oracle Exadata Database Machine. The Oracle Exadata Storage Server Software is deployed on a configurable number of Oracle Big Data Appliance servers. These servers combine the functionality of a CDH node and an Oracle Exadata Storage Server. The Mammoth utility on installs the Big Data SQL software on both Oracle Big Data Appliance and Oracle Exadata Database Machine. The information in this section explains the changes that Mammoth makes to the Oracle Database system. Oracle SQL Connector for HDFS provides access to Hadoop data for all Oracle Big Data Appliance racks, including those that are not connected to Oracle Exadata Database Machine. However, it does not offer the performance benefits of Oracle Big Data SQL, and it is not included under the Oracle Big Data Appliance license.

177 Starting and Stopping the Big Data SQL Agent The agtctl utility starts and stops the multithreaded Big Data SQL agent. It has the following syntax: agtctl {startup shutdown} bds_clustername

178 About the Common Directory The common directory contains configuration information that is common to all Hadoop clusters. This directory is located on the Oracle Database system under the Oracle home directory. The oracle file system user (or whichever user owns the Oracle Database instance) owns the common directory. A database directory named ORACLE_BIGDATA_CONFIG points to the common directory.

179 Common Configuration Properties The Mammoth installation process creates the following files and stores them in the common directory: bigdata.properties bigdata-log4j.properties The Oracle DBA can edit these configuration files as necessary.

180 bigdata.properties Free ebooks ==> Thebigdata.properties file in the common directory contains property-value pairs that define the Java class paths and native library paths required for accessing data in HDFS. These properties must be set: bigdata.cluster.default java.classpath.hadoop java.classpath.hive java.classpath.oracle The following list describes all properties permitted in bigdata.properties. bigdata.properties bigdata.cluster.default The name of the default Hadoop cluster. The access driver uses this name when the access parameters do not specify a cluster. Required. Changing the default cluster name might break external tables that were created previously without an explicit cluster name. bigdata.cluster.list A comma-separated list of Hadoop cluster names. Optional. java.classpath.hadoop The Hadoop class path. Required. java.classpath.hive The Hive class path. Required. java.classpath.oracle The path to the Oracle JXAD Java JAR file. Required. java.classpath.user The path to user JAR files. Optional. java.libjvm.file The full file path to the JVM shared library (such as libjvm.so). Required. java.options A comma-separated list of options to pass to the JVM. Optional. This example sets the maximum heap size to 2 GB, and verbose logging for Java Native Interface (JNI) calls: Xmx2048m,-verbose=jni

181 LD_LIBRARY_PATH A colon separated (:) list of directory paths to search for the Hadoop native libraries. Recommended. If you set this option, then do not set java.library path in java.options. The next example shows a sample bigdata.properties file. # bigdata.properties # # Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved. # # NAME # bigdata.properties - Big Data Properties File # # DESCRIPTION # Properties file containing parameters for allowing access to Big Data # Fixed value properties can be added here # java.libjvm.file=$oracle_home/jdk/jre/lib/amd64/server/libjvm.so java.classpath.oracle=$oracle_home/hadoopcore/jlib/*:$oracle_home/hadoop/jlib/hver- 2/*:$ORACLE_HOME/dbjava/lib/* java.classpath.hadoop=$hadoop_home/*:$hadoop_home/lib/* java.classpath.hive=$hive_home/lib/* LD_LIBRARY_PATH=$ORACLE_HOME/jdk/jre/lib bigdata.cluster.default=hadoop_cl_1

182 bigdata-log4j.properties Free ebooks ==> The bigdata-log4j.properties file in the common directory defines the logging behavior of queries against external tables in the Java code. Any log4j properties are allowed in this file. The next example shows a sample bigdata-log4j.properties file with the relevant log4j properties. # bigdata-log4j.properties # # Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved. # # NAME # bigdata-log4j.properties - Big Data Logging Properties File # # DESCRIPTION # Properties file containing logging parameters for Big Data # Fixed value properties can be added here bigsql.rootlogger=info,console log4j.rootlogger=debug, file log4j.appender.console=org.apache.log4j.consoleappender log4j.appender.console.target=system.err log4j.appender.console.layout=org.apache.log4j.patternlayout log4j.appender.console.layout.conversionpattern=%d{yy/mm/dd HH:mm:ss} %p %c{2}: %m%n log4j.appender.file=org.apache.log4j.rollingfileappender log4j.appender.file.layout=org.apache.log4j.patternlayout log4j.appender.file.layout.conversionpattern=%d{yy/mm/dd HH:mm:ss} %p %c{2}: %m%n log4j.logger.oracle.hadoop.sql=all, file bigsql.log.dir=. bigsql.log.file=bigsql.log log4j.appender.file.file=$oracle_home/bigdatalogs/bigdata-log4j.log

183 About the Cluster Directory The cluster directory contains configuration information for a CDH cluster. Each cluster that Oracle Database will access using Oracle Big Data SQL has a cluster directory. This directory is located on the Oracle Database system under the common directory. For example, a cluster named bda1_cl_1 would have a directory by the same name (bda1_cl_1) in the common directory. The cluster directory contains the CDH client configuration files for accessing the cluster, such as the following: core-site.xml hdfs-site.xml hive-site.xml mapred-site.xml (optional) log4j property files (such as hive-log4j.properties) A database directory object points to the cluster directory. Users who want to access the data in a cluster must have read access to the directory object.

184 About Permissions Free ebooks ==> The oracle operating system user (or whatever user owns the Oracle Database installation directory) must have the following setup: READ/WRITE access to the database directory that points to the log directory. These permissions enable the access driver to create the log files, and for the user to read them. A corresponding oracle operating system user defined on Oracle Big Data Appliance, with READ access in the operating system to the HDFS directory where the source data is stored.

185 6Chapter 6. HIVE USER DEFINED FUNCTIONS (UDFs)

186 6.1 INTRODUCTION Free ebooks ==> User-defined Functions (UDFs) have a long history of usefulness in SQL-derived languages. While query languages can be rich in their expressiveness, there s just no way they can anticipate all the things a developer wants to do. Thus, the custom UDF has become commonplace in our data manipulation toolbox. Apache Hive is no different in this respect from other SQL-like languages. Hive allows extensibility via both Hadoop Streaming and compiled Java. However, largely because of the underlying MapReduce paradigm, all Hive UDFs are not created equally. Some UDFs are intended for map-side execution, while others are portable and can be run on the reduce-side. Moreover, UDF behavior via streaming requires that queries be formatted so as to direct script execution where we desire it. The intricacies of where and how a UDF executes may seem like minutiae, but we would be disappointed time spent coding a cumulative sum UDF only executed on single rows. To that end, I m going to spend the rest of the week diving into the three primary types of Java-based UDFs in Hive.

187 The Three Little UDFs Hive provides three classes of UDFs that most users are interested in: UDFs, UDTFs, and UDAFs. Broken down simply, the three classes can be explained as such: UDFs User Defined Functions; these operate row-wise, generally during map execution. They re the simplest UDFs to write, but constrained in their functionality. UDTFs User Defined Table-Generating Functions; these also execute row-wise, but they produce multiple rows of output (i.e., they generate a table). The most common example of this is Hive s explode function. UDAFs User Defined Aggregating Functions; these can execute on either the map-side or the reduce-side and far more flexible than UDFs. The challenge, however, is that in writing UDAFs we have to think not just about what to do with a single row, or even a group of rows. Here, one has to consider partial aggregation and serialization between map and reduce proceses.

188 6.2 THREE LITTLE HIVE UDFS: EXAMPLE 1 Free ebooks ==> Introduction In the ongoing series of posts explaining the in s and out s of Hive User Defined Functions, we re starting with the simplest case. Of the three little UDFs, today s entry built a straw house: simple, easy to put together, but limited in applicability. We ll walk through important parts of the code, but you can grab the whole source from github here.

189 Extending UDF The first few lines of interest are very = moving_avg, value = _FUNC_(x, n) - Returns the moving mean of a set of numbers over a window of n observations = false, stateful = true) public class UDFSimpleMovingAverage extends UDF We re extending the UDF class with some decoration. The decoration is important for usability and functionality. The description decorator allows us to give the Hive some information to show users about how to use our UDF and what it s method signature will be. The UDFType decoration tells Hive what sort of behavior to expect from our function. A deterministic UDF will always return the same output given a particular input. A squareroot computing UDF will always return the same square root for 4, so we can say it is deterministic; a call to get the system time would not be. The stateful annotation of the UDFType decoration is relatively new to Hive (e.g., CDH4 and above). The stateful directive allows Hive to keep some static variables available across rows. The simplest example of this is a row-sequence, which maintains a static counter which increments with each row processed. Since square-root and row-counting aren t terribly interesting, we ll use the stateful annotation to build a simple moving average function. We ll return to the notion of a moving average later when we build a UDAF, so as to compare the two approaches. private DoubleWritable result = new DoubleWritable(); private static ArrayDeque<Double> window; int windowsize; public UDFSimpleMovingAverage() { } result.set(0); The above code is basic initialization. We make a double in which to hold the result, but it needs to be of class DoubleWritable so that MapReduce can properly serialize the data. We use a deque to hold our sliding window, and we need to keep track of the window s size. Finally, we implement a constructor for the UDF class. public DoubleWritable evaluate(doublewritable v, IntWritable n) { double sum = 0.0; double moving_average;

190 double residual; if (window == null) Free ebooks ==> { } window = new ArrayDeque<Double>(); Here s the meat of the class: the evaluate method. This method will be called on each row by the map tasks. For any given row, we can t say whether or not our sliding window exists, so we initialize it if it s null. //slide the window if (window.size() == n.get()) window.pop(); window.addlast(new Double(v.get())); // compute the average for (Iterator<Double> i = window.iterator(); i.hasnext();) sum += i.next().doublevalue(); Here we deal with the deque and compute the sum of the window s elements. Deques are essentially double-ended queues, so they make excellent sliding windows. If the window is full, we pop the oldest element and add the current value. moving_average = sum/window.size(); result.set(moving_average); return result; Computing the moving average without weighting is simply dividing the sum of our window by its size. We then set that value in our Writable variable and return it. The value is then emitted as part of the map task executing the UDF function.

191 6.3 THREE LITTLE HIVE UDFS: EXAMPLE Introduction In our ongoing exploration of Hive UDFs, we ve covered the basic row-wise UDF. Today we ll move to the UDTF, which generates multiple rows for every row processed. This UDF built its house from sticks: it s slightly more complicated than the basic UDF and allows us an opportunity to explore how Hive functions manage type checking.

192 6.3.2 Extending GenericUDTF Free ebooks ==> Our UDTF is going to produce pairwise combinations of elements in a comma-separated string. So, for a string column Apples, Bananas, Carrots we ll produce three rows: Apples, Bananas Apples, Carrots Bananas, Carrots As with the UDF, the first few lines are a simple class extension with a decorator so that Hive can describe what the function = pairwise, value = _FUNC_(doc) - emits pairwise combinations of an input array ) public class PairwiseUDTF extends GenericUDTF { private PrimitiveObjectInspector stringoi = null; We also create an object of PrimitiveObjectInspector, which we ll use to ensure that the input is a string. Once this is done, we need to override methods for initialization, row processing, and { public StructObjectInspector initialize(objectinspector[] args) throws UDFArgumentException if (args.length!= 1) { throw new UDFArgumentException( pairwise() takes exactly one argument ); } if (args[0].getcategory()!= ObjectInspector.Category.PRIMITIVE && ((PrimitiveObjectInspector) args[0]).getprimitivecategory()!= PrimitiveObjectInspector.PrimitiveCategory.STRING) { throw new UDFArgumentException( pairwise() takes a string as a parameter ); }

193 stringoi = (PrimitiveObjectInspector) args[0]; This UDTF is going to return an array of structs, so the initialize method needs to return astructobjectinspector object. Note that the arguments to the constructor come in as an array of ObjectInspector objects. This allows us to handle arguments in a normal fashion but with the benefit of methods to broadly inspect type. We only allow a single argument the string column to be processed so we check the length of the array and validate that the sole element is both a primitive and a string. The second half of the initialize method is more interesting: List<String> fieldnames = new ArrayList<String>(2); List<ObjectInspector> fieldois = new ArrayList<ObjectInspector>(2); fieldnames.add( membera ); fieldnames.add( memberb ); fieldois.add(primitiveobjectinspectorfactory.javastringobjectinspector); fieldois.add(primitiveobjectinspectorfactory.javastringobjectinspector); } return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldois); Here we set up information about what the UDTF returns. We need this in place before we start processing rows, otherwise Hive can t correctly build execution plans before submitting jobs to MapReduce. The structures we re returning will be two strings per struct, which means we ll needobjectinspector objects for both the values and the names of the fields. We create two lists, one of strings for the name, the other of ObjectInspector objects. We pack them manually and then use a factor to get the StructObjectInspector which defines the actual return value. Now we re ready to actually do some processing, so we override the process public void process(object[] record) throws HiveException { final String document = (String) stringoi.getprimitivejavaobject(record[0]); if (document == null) { return; } String[] members = document.split(, );

194 java.util.arrays.sort(members); Free ebooks ==> for (int i = 0; i < members.length - 1; i++) for (int j = 1; j < members.length; j++) if (!members[i].equals(members[j])) } forward(new Object[] {members[i],members[j]}); This is simple pairwise expansion, so the logic isn t anything more than a nested for-loop. There are, though, some interesting things to note. First, to actually get a string object to operate on, we have to use an ObjectInspector and some typecasting. This allows us to bail out early if the column value is null. Once we have the string, splitting, sorting, and looping is textbook stuff. The last notable piece is that the process method does not return anything. Instead, we callforward to emit our newly created structs. From the context of those used to database internals, this follows the producer-consumer models of most RDBMs. From the context of those used to MapReduce semantics, this is equivalent to calling write on the Context public void close() throws HiveException { } // do nothing If there were any cleanup to do, we d take care of it here. But this is simple emission, so our override doesn t need to do anything.

195 Using the UDTF Once we ve built our UDTF, we can access it via Hive by adding the jar and assigning it to a temporary function. However, mixing the results of a UDTF with other columns from the base table requires that we use a LATERAL VIEW. #Add the Jar add jar /mnt/shared/market_basket_example/pairwise.jar; #Create a function CREATE temporary function pairwise AS com.oracle.hive.udtf.pairwiseudtf ; # view the pairwise expansion output SELECT m1, m2, COUNT(*) FROM market_basket LATERAL VIEW pairwise(basket) pwise AS m1,m2 GROUP BY m1,m2;

196 6.4 THREE LITTLE HIVE UDFS: EXAMPLE 3 Free ebooks ==> Introduction In the final installment in the series on Hive UDFs, we re going to tackle the least intuitive of the three types: the User Defined Aggregating Function. While they re challenging to implement, UDAFs are necessary if we want functions for which the distinction of mapside v. reduce-side operations are opaque to the user. If a user is writing a query, most would prefer to focus on the data they re trying to compute, not which part of the plan is running a given function. The UDAF also provides a valuable opportunity to consider some of the nuances of distributed programming and parallel database operations. Since each task in a MapReduce job operates in a bit of a vacuum (e.g. Map task A does not know what data Map task B has), a UDAF has to explicitly account for more operational states than a simple UDF. We ll return to the notion of a simple Moving Average function, but ask yourself: how do we compute a moving average if we don t have state or order around the data?

197 Prefix Sum: Moving Average without State In order to compute a moving average without state, we re going to need a specialized parallel algorithm. For moving average, the trick is to use a prefix sum, effectively keeping a table of running totals for quick computation (and recomputation) of our moving average. A full discussion of prefix sums for moving averages is beyond length of a blog post, but John Jenq provides an excellent discussion of the technique as applied to CUDA implementations. What we ll cover here is the necessary implementation of a pair of classes to store and operate on our prefix sum entry within the UDAF. public class PrefixSumMovingAverage { static class PrefixSumEntry implements Comparable { int period; double value; double prefixsum; double subsequencetotal; double movingaverage; public int compareto(object other) { PrefixSumEntry o = (PrefixSumEntry)other; if (period < o.period) return -1; if (period > o.period) return 1; return 0; } } Here we have the definition of our moving average class and the static inner class which serves as an entry in our table. What s important here are some of the variables we define for each entry in the table: the time-index or period of the value (its order), the value itself, the prefix sum, the subsequence total, and the moving average itself. Every entry in our

198 table requires not just the current value to compute the moving average, but also sum of entries in our moving Free average ebooks window. ==> It s the pair of these two values which allows prefix sum methods to work their magic. //class variables private int windowsize; private ArrayList<PrefixSumEntry> entries; public PrefixSumMovingAverage() { windowsize = 0; } public void reset() { windowsize = 0; entries = null; } public boolean isready() { return (windowsize > 0); } The above are simple initialization routines: a constructor, a method to reset the table, and a boolean method on whether or not the object has a prefix sum table on which to operate. From here, there are 3 important methods to examine: add, merge, and serialize. The first is intuitive, as we scan rows in Hive we want to add them to our prefix sum table. The second are important because of partial aggregation. We cannot say ahead of time where this UDAF will run, and partial aggregation may be required. That is, it s entirely possible that some values may run through the UDAF during a map task, but then be passed to a reduce task to be combined with other values. The serialize method will allow Hive to pass the partial results from the map side to the reduce side. The merge method allows reducers to combine the results of partial aggregations from the map tasks.

199 unchecked ) public void add(int period, double v) { //Add a new entry to the list and update table PrefixSumEntry e = new PrefixSumEntry(); e.period = period; e.value = v; entries.add(e); // do we need to ensure this is sorted? //if (needssorting(entries)) Collections.sort(entries); // update the table // prefixsums first double prefixsum = 0; for(int i = 0; i < entries.size(); i++) { PrefixSumEntry thisentry = entries.get(i); prefixsum += thisentry.value; thisentry.prefixsum = prefixsum; entries.set(i, thisentry); } The first part of the add task is simple: we add the element to the list and update our table s prefix sums. // now do the subsequence totals and moving averages for(int i = 0; i < entries.size(); i++) { double subsequencetotal; double movingaverage;

200 PrefixSumEntry thisentry = entries.get(i); Free ebooks ==> PrefixSumEntry backentry = null; if (i >= windowsize) backentry = entries.get(i-windowsize); if (backentry!= null) { subsequencetotal = thisentry.prefixsum - backentry.prefixsum; } else { subsequencetotal = thisentry.prefixsum; } movingaverage = subsequencetotal/(double)windowsize; thisentry.subsequencetotal = subsequencetotal; thisentry.movingaverage = movingaverage; } entries.set(i, thisentry); In the second half of the add function, we compute our moving averages based on the prefix sums. It s here you can see the hinge on which the algorithm swings: thisentry.prefixsum - backentry.prefixsum that offset between the current table entry and it s nth predecessor makes the whole thing work. public ArrayList<DoubleWritable> serialize() { ArrayList<DoubleWritable> result = new ArrayList<DoubleWritable>(); result.add(new DoubleWritable(windowSize)); if (entries!= null) { for (PrefixSumEntry i : entries)

201 { result.add(new DoubleWritable(i.period)); result.add(new DoubleWritable(i.value)); } } } return result; The serialize method needs to package the results of our algorithm to pass to another instance of the same algorithm, and it needs to do so in a type that Hadoop can serialize. In the case of a method like sum, this would be relatively simple: we would only need to pass the sum up to this point. However, because we cannot be certain whether this instance of our algorithm has seen all the values, or seen them in the correct order, we actually need to serialize the whole table. To do this, we create a list ofdoublewritables, pack the window size at its head, and then each period and value. This gives us a structure that s easy to unpack and merge with other lists with the same unchecked ) public void merge(list<doublewritable> other) { if (other == null) return; // if this is an empty buffer, just copy in other // but deserialize the list if (windowsize == 0) { windowsize = (int)other.get(0).get(); entries = new ArrayList<PrefixSumEntry>(); // we re serialized as period, value, period, value for (int i = 1; i < other.size(); i+=2) { PrefixSumEntry e = new PrefixSumEntry();

202 e.period = (int)other.get(i).get(); Free ebooks ==> e.value = other.get(i+1).get(); entries.add(e); } } Merging results is perhaps the most complicated thing we need to handle. First, we check the case in which there was no partial result passed just return and continue. Second, we check to see if this instance of PrefixSumMovingAverage already has a table. If it doesn t, we can simply unpack the serialized result and treat it as our window. // if we already have a buffer, we need to add these entries else { // we re serialized as period, value, period, value for (int i = 1; i < other.size(); i+=2) { PrefixSumEntry e = new PrefixSumEntry(); e.period = (int)other.get(i).get(); e.value = other.get(i+1).get(); entries.add(e); } } The third case is the non-trivial one: if this instance has a table and receives a serialized table, we must merge them together. Consider a Reduce task: as it receives outputs from multiple Map tasks, it needs to merge all of them together to form a larger table. Thus, merge will be called many times to add these results and reassemble a larger time series. // sort and recompute Collections.sort(entries); // update the table // prefixsums first double prefixsum = 0;

203 for(int i = 0; i < entries.size(); i++) { PrefixSumEntry thisentry = entries.get(i); prefixsum += thisentry.value; thisentry.prefixsum = prefixsum; } entries.set(i, thisentry); This part should look familiar, it s just like the add method. Now that we have new entries in our table, we need to sort by period and recompute the moving averages. In fact, the rest of the merge method is exactly like the add method, so we might consider putting sorting and recomputing in a separate method.

204 6.4.3 Orchestrating Partial Aggregation Free ebooks ==> We ve got a clever little algorithm for computing moving average in parallel, but Hive can t do anything with it unless we create a UDAF that understands how to use our algorithm. At this point, we need to start writing some real UDAF code. As before, we extend a generic class, in this case GenericUDAFEvaluator. public static class GenericUDAFMovingAverageEvaluator extends GenericUDAFEvaluator { // input inspectors for PARTIAL1 and COMPLETE private PrimitiveObjectInspector periodoi; private PrimitiveObjectInspector inputoi; private PrimitiveObjectInspector windowsizeoi; // input inspectors for PARTIAL2 and FINAL // list for MAs and one for residuals private StandardListObjectInspector loi; As in the case of a UDTF, we create ObjectInspectors to handle type checking. However, notice that we have inspectors for different states: PARTIAL1, PARTIAL2, COMPLETE, and FINAL. These correspond to the different states in which our UDAF may be executing. Since our serialized prefix sum table isn t the same input type as the values our add method takes, we need different type checking for public ObjectInspector init(mode m, ObjectInspector[] parameters) throws HiveException { super.init(m, parameters); // initialize input inspectors if (m == Mode.PARTIAL1 m == Mode.COMPLETE) { assert(parameters.length == 3); periodoi = (PrimitiveObjectInspector) parameters[0];

205 inputoi = (PrimitiveObjectInspector) parameters[1]; windowsizeoi = (PrimitiveObjectInspector) parameters[2]; } Here s the beginning of our overrided initialization function. We check the parameters for two modes, PARTIAL1 and COMPLETE. Here we assume that the arguments to our UDAF are the same as the user passes in a query: the period, the input, and the size of the window. If the UDAF instance is consuming the results of our partial aggregation, we need a different ObjectInspector. Specifically, this one: else { } loi = (StandardListObjectInspector) parameters[0]; Similar to the UDTF, we also need type checking on the output types but for both partial and full aggregation. In the case of partial aggregation, we re returning lists of DoubleWritables: // init output object inspectors if (m == Mode.PARTIAL1 m == Mode.PARTIAL2) { // The output of a partial aggregation is a list of doubles representing the // moving average being constructed. // the first element in the list will be the window size // return ObjectInspectorFactory.getStandardListObjectInspector( } PrimitiveObjectInspectorFactory.writableDoubleObjectInspector); But in the case of FINAL or COMPLETE, we re dealing with the types that will be returned to the Hive user, so we need to return a different output. We re going to return a list of structs that contain the period, moving average, and residuals (since they re cheap to compute). else { // The output of FINAL and COMPLETE is a full aggregation, which is a

206 // list of DoubleWritable structs that represent the final histogram as Free ebooks ==> // (x,y) pairs of bin centers and heights. ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>(); foi.add(primitiveobjectinspectorfactory.writabledoubleobjectinspector); foi.add(primitiveobjectinspectorfactory.writabledoubleobjectinspector); foi.add(primitiveobjectinspectorfactory.writabledoubleobjectinspector); ArrayList<String> fname = new ArrayList<String>(); fname.add( period ); fname.add( moving_average ); fname.add( residual ); return ObjectInspectorFactory.getStandardListObjectInspector( } ObjectInspectorFactory.getStandardStructObjectInspector(fname, foi) ); Next come methods to control what happens when a Map or Reduce task is finished with its data. In the case of partial aggregation, we need to serialize the data. In the case of full aggregation, we need to package the result for Hive public Object terminatepartial(aggregationbuffer agg) throws HiveException { // return an ArrayList where the first parameter is the window size MaAgg myagg = (MaAgg) agg; return myagg.prefixsum.serialize(); public Object terminate(aggregationbuffer agg) throws HiveException { // final return value goes here MaAgg myagg = (MaAgg) agg;

207 if (myagg.prefixsum.tablesize() < 1) { return null; } else { ArrayList<DoubleWritable[]> result = new ArrayList<DoubleWritable[]>(); for (int i = 0; i < myagg.prefixsum.tablesize(); i++) { double residual = myagg.prefixsum.getentry(i).value - myagg.prefixsum.getentry(i).movingaverage; DoubleWritable[] entry = new DoubleWritable[3]; entry[0] = new DoubleWritable(myagg.prefixSum.getEntry(i).period); entry[1] = new DoubleWritable(myagg.prefixSum.getEntry(i).movingAverage); entry[2] = new DoubleWritable(residual); result.add(entry); } return result; } } We also need to provide instruction on how Hive should merge the results of partial aggregation. Fortunately, we already handled this in our PrefixSumMovingAverage class, so we can just call unchecked public void merge(aggregationbuffer agg, Object partial) throws HiveException {

208 // if we re merging two separate sets we re creating one table that s doubly long Free ebooks ==> if (partial!= null) { MaAgg myagg = (MaAgg) agg; List<DoubleWritable> partialmovingaverage = (List<DoubleWritable>) loi.getlist(partial); myagg.prefixsum.merge(partialmovingaverage); } } Of course, merging and serializing isn t very useful unless the UDAF has logic for iterating over values. The iterate method handles this and as one would expect relies entirely on theprefixsummovingaverage class we public void iterate(aggregationbuffer agg, Object[] parameters) throws HiveException { assert (parameters.length == 3); if (parameters[0] == null parameters[1] == null parameters[2] == null) { return; } MaAgg myagg = (MaAgg) agg; // Parse out the window size just once if we haven t done so before. We need a window of at least 1, // otherwise there s no window. if (!myagg.prefixsum.isready()) { int windowsize = PrimitiveObjectInspectorUtils.getInt(parameters[2], windowsizeoi);

209 if (windowsize < 1) { throw new HiveException(getClass().getSimpleName() + needs a window size >= 1 ); } myagg.prefixsum.allocate(windowsize); } //Add the current data point and compute the average int p = PrimitiveObjectInspectorUtils.getInt(parameters[0], inputoi); double v = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputoi); myagg.prefixsum.add(p,v); }

210 6.4.4 Aggregation Buffers: Connecting Algorithms with Execution Free ebooks ==> One might notice that the code for our UDAF references an object of type AggregationBuffer quite a lot. This is because the AggregationBuffer is the interface which allows us to connect our custom PrefixSumMovingAverage class to Hive s execution framework. While it doesn t constitute a great deal of code, it s glue that binds our logic to Hive s execution framework. We implement it as such: // Aggregation buffer definition and manipulation methods static class MaAgg implements AggregationBuffer { PrefixSumMovingAverage prefixsum; public AggregationBuffer getnewaggregationbuffer() throws HiveException { MaAgg result = new MaAgg(); reset(result); } return result;

211 Using the UDAF The goal of a good UDAF is that, no matter how complicated it was for us to implement, it s that it be simple for our users. For all that code and parallel thinking, usage of the UDAF is very straightforward: ADD JAR /mnt/shared/hive_udfs/dist/lib/moving_average_udf.jar; CREATE TEMPORARY FUNCTION moving_avg AS com.oracle.hadoop.hive.ql.udf.generic.genericudafmovingaverage ; #get the moving average for a single tail number SELECT TailNum,moving_avg(timestring, delay, 4) FROM ts_example WHERE TailNum= N967CA GROUP BY TailNum LIMIT 100; Here we re applying the UDAF to get the moving average of arrival delay from a particular flight. It s a really simple query for all that work we did underneath. We can do a bit more and leverage Hive s abilities to handle complex types as columns, here s a query which creates a table of timeseries as arrays. #create a set of moving averages for every plane starting with N #Note: this UDAF blows up unpleasantly in heap; there will be data volumes for which you need to throw #excessive amounts of memory at the problem CREATE TABLE moving_averages AS SELECT TailNum, moving_avg(timestring, delay, 4) as timeseries FROM ts_example WHERE TailNum LIKE N% GROUP BY TailNum;

212 6.4.6 Summary Free ebooks ==> We ve covered all manner of UDFs: from simple class extensions which can be written very easily, to very complicated UDAFs which require us to think about distributed execution and plan orchestration done by query engines. With any luck, the discussion has provided you with the confidence to go out and implement your own UDFs or at least pay some attention to the complexities of the ones in use every day.

213 7Chapter 7. ORACLE No SQL

214 7.1 INTRODUCTION Free ebooks ==> NoSQL databases represent a recent evolution in enterprise application architecture, continuing the evolution of the past twenty years. In the 1990 s, vertically integrated applications gave way to client-server architectures, and more recently, client-server architectures gave way to three-tier web application architectures. In parallel, the demands of web-scale data analysis added map-reduce processing into the mix and data architects started eschewing transactional consistency in exchange for incremental scalability and large-scale distribution. The NoSQL movement emerged out of this second ecosystem. NoSQL is often characterized by what it s not depending on whom you ask, it s either not only a SQL-based relational database management system or it s simply not a SQLbased RDBMS. While those definitions explain what NoSQL is not, they do little to explain what NoSQL is. Consider the fundamentals that have guided data management for the past forty years. RDBMS systems and large-scale data management have been characterized by the transactional ACID properties of Atomicity, Consistency, Isolation, and Durability. In contrast, NoSQL is sometimes characterized by the BASE acronym: Basically Available: Use replication to reduce the likelihood of data unavailability and use sharding, or partitioning the data among many different storage servers, to make any remaining failures partial. The result is a system that is always available, even if subsets of the data become unavailable for short periods of time. Soft state: While ACID systems assume that data consistency is a hard requirement, NoSQL systems allow data to be inconsistent and relegate designing around such inconsistencies to application developers. Eventually consistent: Although applications must deal with instantaneous consistency, NoSQL systems ensure that at some future point in time the data assumes a consistent state. In contrast to ACID systems that enforce consistency at transaction commit, NoSQL guarantees consistency only at some undefined future time. NoSQL emerged as companies, such as Amazon, Google, LinkedIn and Twitter struggled to deal with unprecedented data and operation volumes under tight latency constraints. Analyzing high-volume, real time data, such as web-site click streams, provides significant business advantage by harnessing unstructured and semi-structured data sources to create more business value. Traditional relational databases were not up to the task, so enterprises built upon a decade of research on distributed hash tables (DHTs) and either conventional relational database systems or embedded key/value stores, such as Oracle s Berkeley DB, to develop highly available, distributed key-value stores.. Although some of the early NoSQL solutions built their systems atop existing relational database engines, they quickly realized that such systems were designed for SQL-based access patterns and latency demands that are quite different from those of NoSQL systems, so these same organizations began to develop brand new storage layers. In contrast, Oracle s Berkeley DB product line was the original key/value store; Oracle Berkeley DB Java Edition has been in commercial use for over eight years. By using Oracle Berkeley DB Java Edition as the underlying storage engine beneath a NoSQL

215 system, Oracle brings enterprise robustness and stability to the NoSQL landscape. Furthermore, until recently, integrating NoSQL solutions with an enterprise application architecture required manual integration and custom development; Oracle s NoSQL Database provides all the desirable features of NoSQL solutions necessary for seamless integration into an enterprise application architecture. The next figure shows a canonical acquireorganize-analyze data cycle, demonstrating how Oracle s NoSQL Database fits into such an ecosystem. Oracle-provided adapters allow the Oracle NoSQL Database to integrate with a Hadoop MapReduce framework or with the Oracle Database in-database MapReduce, Data Mining, R-based analytics, or whatever business needs demand. The Oracle NoSQL Database, with its No Single Point of Failure architecture is the right solution when data access is simple in nature and application demands exceed the volume or latency capability of traditional data management solutions. For example, clickstream data from high volume web sites, high-throughput event processing, and social networking communications all represent application domains that produce extraordinary volumes of simple keyed data. Monitoring online retail behavior, accessing customer profiles, pulling up appropriate customer ads and storing and forwarding real-time communication are examples of domains requiring the ultimate in low-latency access. Highly distributed applications such as real-time sensor aggregation and scalable authentication also represent domains well-suited to Oracle NoSQL Database.

216 7.2 DATA MODEL Free ebooks ==> Oracle NoSQL Database leverages the Oracle Berkeley DB Java Edition High Availability storage engine to provide distributed, highly-available key/value storage for large-volume, latency-sensitive applications or web services. It can also provide fast, reliable, distributed storage to applications that need to integrate with ETL processing. In its simplest form, Oracle NoSQL Database implements a map from user-defined keys (Strings) to opaque data items. It records version numbers for key/data pairs, but maintains the single latest version in the store. Applications need never worry about reconciling incompatible versions because Oracle NoSQL Database uses single-master replication; the master node always has the most up-todate value for a given key, while read-only replicas might have slightly older versions. Applications can use version numbers to ensure consistency for read-modify-write operations. Oracle NoSQL Database hashes keys to provide good distribution over a collection of computers that provide storage for the database. However, applications can take advantage of subkey capabilities to achieve data locality. A key is the concatenation of a Major Key Path and a Minor Key Path, both of which are specified by the application. All records sharing a Major Key Path are co-located to achieve datalocality. Within a co-located collection of Major Key Paths, the full key, comprised of both the Major and Minor Key Paths, provides fast, indexed lookup. For example, an application storing user profiles might use the profile-name as a Major Key Path and then have several Minor Key Paths for different components of that profile such as address, name, phone number, etc. Because applications have complete control over the composition and interpretation of keys, different Major Key Paths can have entirely different Minor Key Path structures. Continuing our previous example, one might store user profiles and application profiles in the same store and maintain different Minor Key Paths for each. Prefix key compression makes storage of key groups efficient. While many NoSQL databases state that they provide eventual consistency, Oracle NoSQL Database provides several different consistency policies. At one end of the spectrum, applications can specify absolute consistency, which guarantees that all reads return the most recently written value for a designated key. At the other end of the spectrum, applications capable of tolerating inconsistent data can specify weak consistency, allowing the database to return a value efficiently even if it is not entirely up to date. In between these two extremes, applications can specify time-based consistency to constrain how old a record might be or version-based consistency to support both atomicity for read-modify-write operations and reads that are at least as recent as the specified version. The next figure shows how the range of flexible consistency policies enables developers to easily create business solutions providing data guarantees while meeting application latency and scalability requirements.

217 Oracle NoSQL Database also provides a range of durability policies that specify what guarantees the system makes after a crash. At one extreme, applications can request that write requests block until the record has been written to stable storage on all copies. This has obvious performance and availability implications, but ensures that if the application successfully writes data, that data will persist and can be recovered even if all the copies become temporarily unavailable due to multiple simultaneous failures. At the other extreme, applications can request that write operations return as soon as the system has recorded the existence of the write, even if the data is not persistent anywhere. Such a policy provides the best write performance, but provides no durability guarantees. By specifying when the database writes records to disk and what fraction of the copies of the record must be persistent (none, all, or a simple majority), applications can enforce a wide range of durability policies.

218 7.3 API Free ebooks ==> Incorporating Oracle NoSQL Database into applications is straightforward. APIs for basic Create, Read, Update and Delete (CRUD) operations and a collection of iterators are packaged in a single jar file. Applications can use the APIs from one or more client processes that access a stand-alone Oracle NoSQL Database server process, alleviating the need to set up multi-system configurations for initial development and testing.

219 7.4 CREATE, REMOVE, UPDATE, AND DELETE Data create and update operations are provided by several put methods. The putifabsent method implements creation while the putifpresent method implements update. The put method does both, adding a new key/value pair if the key is not currently present in the database or updating the value if the key does exist. Updating a key/value pair generates a new version of the pair, so the API also includes a conditional putifversion method that allows applications to implement consistent readmodify-write semantics. The delete and deleteifversion methods unconditionally and conditionally remove key/value pairs from the database, respectively. Just as putifversion ensures read-modifywrite semantics, deleteifversion provides deletion of a specific version. The get method retrieves items from the database. The code sample below demonstrates the use of the various CRUD APIs. All code samples asume that you have already opened an Oracle NoSQL Database, referenced by the variable store.

220 7.5 ITERATION Free ebooks ==> In addition to basic CRUD operations, Oracle NoSQL Database supports two types of iteration: unordered iteration over records and ordered iteration within a Major Key set. In the case of unordered iteration over the entire store, the result is not transactional; the iteration runs at an isolation level of read-committed, which means that the result set will contain only key/value pairs that have been persistently written to the database, but there are no guarantees of semantic consistency across key/value pairs. The API supports both individual key/value returns using several storeiterator methods and bulk key/value returns within a Major Key Path via the various multigetiterator methods. The example below demonstrates iterating over an entire store, returning each key/value pair individually. Note that although the iterator returns only a single key/value pair at a time, the storeiterator method takes a second parameter of batchsize, indicating how many key/value pairs to fetch per network round trip. This allows applications to simultaneously use network bandwidth efficiently, while maintaining the simplicity of key-at-a-time access in the API. Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the complete SQL syntax. You can execute the most complex SQL SELECT statements against data in Hadoop, either manually or using your existing applications, to tease out the most significant insights.

221 7.6 BULK OPERATION API In addition to providing single-record operations, Oracle NoSQL Database supports the ability to bundle a collection of operations together using the execute method, providing transactional semantics across multiple updates on records with the same Major Key Path. For example, let s assume that we have the Major Key Path Katana from the previous example, with several different Minor Key Paths, containing attributes of the Katana, such as length and year of construction. Imagine that we discover that we have an incorrect length and year of construction currently in our store. We can update multiple records atomically using a sequence of operations as shown below.

222 7.7 ADMINISTRATION Free ebooks ==> Oracle NoSQL Database comes with an Administration Service, accessible from both a command line interface and a web console. Using the service, administrators can configure a database instance, start it, stop it, and monitor system performance, without manually editing configuration files, writing shell scripts, or performing explicit database operations. The Administration Service is itself a highly-available service, but consistent with the Oracle NoSQL Database No Single Point of Failure philosophy, the ongoing operation of an installation is not dependent upon the availability of the Administration Service. Thus, both the database and the Administration Service remain available during configuration changes. In addition to facilitating configuration changes, the Administration Service also collects and maintains performance statistics and logs important system events, providing online monitoring and input to performance tuning.

223 7.8 ARCHITECTURE We present the Oracle NoSQL Database architecture by following the execution of an operation through the logical components of the system and then discussing how those components map to actual hardware and software operation. We will create the key/value pair Katana and sword. The next figure depicts the method invocation putifabsent( Katana, sword )1. The application issues the putifabsent method to the Client Driver (step 1). The client driver hashes the key Katana to select one of a fixed number of partitions (step 2). Thenumber of partitions is fixed and set by an administrator at system configuration time and is chosen to be significantly larger than the maximum number of storage nodes expected in the store. In this example, our store contains 25 storage nodes, so we might have configured the system to have 25,000 partitions. Each partition is assigned to a particular replication group. The driver consults the partition table (step 3) to map the partition number to a replication group. A replication group consists of some (configurable) number of replication nodes. Every replication group consists of the same number of replication nodes. The number of replication nodes in a replication group dictates the number of failures from which the system is resilient; a system with three nodes per group can withstand two failures while continuing to service read requests. Its ability to withstand failures on writes is based upon the configured durability policy. If the application does not require a majority of participants to acknowledge a write, then the system can also withstand up to two failures for writes. A five-node group can withstand up to four failures for reads and up to two failures for writes, even if the application demands a durability policy requiring a majority of sites to acknowledge a write operation.

224 Given a replication group, the Client Driver next consults the Replication Group State Table (RGST) (step Free 4). For ebooks each replication ==> group, the RGST contains information about each replication node comprising the group (step 5). Based upon information in the RGST, such as the identity of the master and the load on the various nodes in a replication group, the Client Driver selects the node to which to send the request and forwards the request to the appropriate node (step 6). In this case, since we are issuing a write operation, the request must go to the master node. The replication node then applies the operation. In the case of a putifabsent, if the key exists, the operation has no effect and returns an error, indicating that the specified entry is already present in the store. If the key does not exist, the replication node adds the key/value pair to the store and then propagates the new key/value pair to the other nodes in the replication group (step 7).

225 7.9 IMPLEMENTATION An Oracle NoSQL Database installation consists of two major pieces: a Client Driver and a collection of Storage Nodes. As shown in Figure 3, the client driver implements the partition map and the RGST, while storage nodes implement the replication nodes comprising replication groups. In this section, we ll take a closer look at each of these components.

226 7.9.1 Storage Nodes Free ebooks ==> A storage node (SN) is typically a physical machine with its own local persistent storage, either disk or solid state, a CPU with one or more cores, memory, and an IP address. A system with more storage nodes will provide greater aggregate throughput or storage capacity than one with fewer nodes, and systems with a greater degree of replication in replication groups can provide decreased request latency over installations with smaller degrees of replication. A Storage Node Agent (SNA) runs on each storage node, monitoring that node s behavior. The SNA both receives configuration from and reports monitoring information to the Administration Service, which interfaces to the Oracle NoSQL Database monitoring dashboard. The SNA collects operational data from the storage node on an ongoing basis and then delivers it to the Administration Service when asked for it. A storage node serves one or more replication nodes. Each replication node belongs to a single replication group. The nodes in a single replication group all serve the same data. Each group has a designated master node that handles all data modification operations (create, update, and delete). The other nodes are read-only replicas, but may assume the role of master should the master node fail. A typical installation uses a replication factor of three in the replication groups, to ensure that the system can survive at least two simultaneous faults and still continue to service read operations. Applications requiring greater or lesser reliability can adjust this parameter accordingly. The next figure shows an installation with 30 replication groups (0-29). Each replication group has a replication factor of 3 (one master and two replicas) spread across two data centers. Note that we place two of the replication nodes in the larger of the two data centers and the last replication node in the smaller one. This sort of arrangement might be appropriate for an application that uses the larger data center for its primary data access, maintaining the smaller data center in case of catastrophic failure of the primary data center. The 30 replication groups are stored on 30 storage nodes, spread across the two data centers.

227 Replication nodes support the Oracle NoSQL Database API via RMI calls from the client and obtain data directly from or write data directly to the log-structured storage system, which provides outstanding write performance, while maintaining index structures that provide low-latency read performance as well. The Oracle NoSQL Database storage engine pioneered the use of log-structured storage in key/value databases since its initial deployment in 2003 and has been proven in several open-source NoSQL solutions, such as Dynamo, Voldemort, and GenieDB, as well as in Enterprise deployments. Oracle NoSQL Database uses replication to ensure data availability in the case of failure. Its singlemaster architecture requires that writes are applied at the master node and then propagated to the replicas. In the case of failure of the master node, the nodes in a replication group automatically hold a reliable election (using the Paxos protocol), electing one of the remaining nodes to be the master. The new master then assumes write responsibility.

Big Data The end of Data Warehousing?

Big Data The end of Data Warehousing? Big Data The end of Data Warehousing? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Big data, data warehousing, advanced analytics, Hadoop, unstructured data Introduction If there was an Unwort

More information

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA Keywords: Big Data, Oracle Big Data Appliance, Hadoop, NoSQL, Oracle

More information

<Insert Picture Here> Introduction to Big Data Technology

<Insert Picture Here> Introduction to Big Data Technology Introduction to Big Data Technology The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise Solutions Group The following is intended to

More information

Acquiring Big Data to Realize Business Value

Acquiring Big Data to Realize Business Value Acquiring Big Data to Realize Business Value Agenda What is Big Data? Common Big Data technologies Use Case Examples Oracle Products in the Big Data space In Summary: Big Data Takeaways

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX 1 Successful companies know that analytics are key to winning customer loyalty, optimizing business processes and beating their

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

The Hadoop Paradigm & the Need for Dataset Management

The Hadoop Paradigm & the Need for Dataset Management The Hadoop Paradigm & the Need for Dataset Management 1. Hadoop Adoption Hadoop is being adopted rapidly by many different types of enterprises and government entities and it is an extraordinarily complex

More information

Introduction to Data Science

Introduction to Data Science UNIT I INTRODUCTION TO DATA SCIENCE Syllabus Introduction of Data Science Basic Data Analytics using R R Graphical User Interfaces Data Import and Export Attribute and Data Types Descriptive Statistics

More information

Progress DataDirect For Business Intelligence And Analytics Vendors

Progress DataDirect For Business Intelligence And Analytics Vendors Progress DataDirect For Business Intelligence And Analytics Vendors DATA SHEET FEATURES: Direction connection to a variety of SaaS and on-premises data sources via Progress DataDirect Hybrid Data Pipeline

More information

Automate Transform Analyze

Automate Transform Analyze Competitive Intelligence 2.0 Turning the Web s Big Data into Big Insights Automate Transform Analyze Introduction Today, the web continues to grow at a dizzying pace. There are more than 1 billion websites

More information

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value KNOWLEDGENT INSIGHTS volume 1 no. 5 October 7, 2011 Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value Today s growing commercial, operational and regulatory

More information

Evolving To The Big Data Warehouse

Evolving To The Big Data Warehouse Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from

More information

New Approaches to Big Data Processing and Analytics

New Approaches to Big Data Processing and Analytics New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing

More information

Fast Innovation requires Fast IT

Fast Innovation requires Fast IT Fast Innovation requires Fast IT Cisco Data Virtualization Puneet Kumar Bhugra Business Solutions Manager 1 Challenge In Data, Big Data & Analytics Siloed, Multiple Sources Business Outcomes Business Opportunity:

More information

Building a Data Strategy for a Digital World

Building a Data Strategy for a Digital World Building a Data Strategy for a Digital World Jason Hunter, CTO, APAC Data Challenge: Pushing the Limits of What's Possible The Art of the Possible Multiple Government Agencies Data Hub 100 s of Service

More information

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Big Data - Some Words BIG DATA 8/31/2017. Introduction BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1 What is Big Data Big Data means

More information

Big Data For Oil & Gas

Big Data For Oil & Gas Big Data For Oil & Gas Jay Hollingsworth - 郝灵杰 Industry Principal Oil & Gas Industry Business Unit 1 The following is intended to outline our general product direction. It is intended for information purposes

More information

Hortonworks DataFlow. Accelerating Big Data Collection and DataFlow Management. A Hortonworks White Paper DECEMBER Hortonworks DataFlow

Hortonworks DataFlow. Accelerating Big Data Collection and DataFlow Management. A Hortonworks White Paper DECEMBER Hortonworks DataFlow Hortonworks DataFlow Accelerating Big Data Collection and DataFlow Management A Hortonworks White Paper DECEMBER 2015 Hortonworks DataFlow 2015 Hortonworks www.hortonworks.com 2 Contents What is Hortonworks

More information

The Emerging Data Lake IT Strategy

The Emerging Data Lake IT Strategy The Emerging Data Lake IT Strategy An Evolving Approach for Dealing with Big Data & Changing Environments bit.ly/datalake SPEAKERS: Thomas Kelly, Practice Director Cognizant Technology Solutions Sean Martin,

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Hortonworks and The Internet of Things

Hortonworks and The Internet of Things Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded

More information

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization Composite Software, Inc. June 2011 TABLE OF CONTENTS INTRODUCTION... 3 DATA FEDERATION... 4 PROBLEM DATA CONSOLIDATION

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

A Guide to Best Practices

A Guide to Best Practices APRIL 2014 Putting the Data Lake to Work A Guide to Best Practices SPONSORED BY CONTENTS Introduction 1 What Is a Data Lake and Why Has It Become Popular? 1 The Initial Capabilities of a Data Lake 1 The

More information

Shine a Light on Dark Data with Vertica Flex Tables

Shine a Light on Dark Data with Vertica Flex Tables White Paper Analytics and Big Data Shine a Light on Dark Data with Vertica Flex Tables Hidden within the dark recesses of your enterprise lurks dark data, information that exists but is forgotten, unused,

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Strategic Briefing Paper Big Data

Strategic Briefing Paper Big Data Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which

More information

Oracle NoSQL Database Overview Marie-Anne Neimat, VP Development

Oracle NoSQL Database Overview Marie-Anne Neimat, VP Development Oracle NoSQL Database Overview Marie-Anne Neimat, VP Development June14, 2012 1 Copyright 2012, Oracle and/or its affiliates. All rights Agenda Big Data Overview Oracle NoSQL Database Architecture Technical

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

Oracle Big Data SQL brings SQL and Performance to Hadoop

Oracle Big Data SQL brings SQL and Performance to Hadoop Oracle Big Data SQL brings SQL and Performance to Hadoop Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data SQL, Hadoop, Big Data Appliance, SQL, Oracle, Performance, Smart Scan Introduction

More information

Controlling Costs and Driving Agility in the Datacenter

Controlling Costs and Driving Agility in the Datacenter Controlling Costs and Driving Agility in the Datacenter Optimizing Server Infrastructure with Microsoft System Center Microsoft Corporation Published: November 2007 Executive Summary To help control costs,

More information

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture BIG DATA Architecture Hierarchy of knowledge Data: Element (fact, figure, etc.) which is basic information that can be to be based on decisions, reasoning, research and which is treated by the human or

More information

@Pentaho #BigDataWebSeries

@Pentaho #BigDataWebSeries Enterprise Data Warehouse Optimization with Hadoop Big Data @Pentaho #BigDataWebSeries Your Hosts Today Dave Henry SVP Enterprise Solutions Davy Nys VP EMEA & APAC 2 Source/copyright: The Human Face of

More information

Big Data and Enterprise Data, Bridging Two Worlds with Oracle Data Integration

Big Data and Enterprise Data, Bridging Two Worlds with Oracle Data Integration Big Data and Enterprise Data, Bridging Two Worlds with Oracle Data Integration WHITE PAPER / JANUARY 25, 2019 Table of Contents Introduction... 3 Harnessing the power of big data beyond the SQL world...

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Drawing the Big Picture

Drawing the Big Picture Drawing the Big Picture Multi-Platform Data Architectures, Queries, and Analytics Philip Russom TDWI Research Director for Data Management August 26, 2015 Sponsor 2 Speakers Philip Russom TDWI Research

More information

Big Data Integration BIG DATA 9/15/2017. Business Performance

Big Data Integration BIG DATA 9/15/2017. Business Performance BIG DATA Business Performance Big Data Integration Big data is often about doing things that weren t widely possible because the technology was not advanced enough or the cost of doing so was prohibitive.

More information

Accelerate your SAS analytics to take the gold

Accelerate your SAS analytics to take the gold Accelerate your SAS analytics to take the gold A White Paper by Fuzzy Logix Whatever the nature of your business s analytics environment we are sure you are under increasing pressure to deliver more: more

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

The Technology of the Business Data Lake. Appendix

The Technology of the Business Data Lake. Appendix The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform

More information

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data THE RISE OF BIG DATA BIG DATA: A REVOLUTION IN ACCESS Large-scale data sets are nothing

More information

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals

More information

The Value of Data Modeling for the Data-Driven Enterprise

The Value of Data Modeling for the Data-Driven Enterprise Solution Brief: erwin Data Modeler (DM) The Value of Data Modeling for the Data-Driven Enterprise Designing, documenting, standardizing and aligning any data from anywhere produces an enterprise data model

More information

IBM Data Replication for Big Data

IBM Data Replication for Big Data IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source

More information

Oracle and Tangosol Acquisition Announcement

Oracle and Tangosol Acquisition Announcement Oracle and Tangosol Acquisition Announcement March 23, 2007 The following is intended to outline our general product direction. It is intended for information purposes only, and may

More information

Modernizing Business Intelligence and Analytics

Modernizing Business Intelligence and Analytics Modernizing Business Intelligence and Analytics Justin Erickson Senior Director, Product Management 1 Agenda What benefits can I achieve from modernizing my analytic DB? When and how do I migrate from

More information

Chapter 6 VIDEO CASES

Chapter 6 VIDEO CASES Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

MAPR DATA GOVERNANCE WITHOUT COMPROMISE MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance

More information

When, Where & Why to Use NoSQL?

When, Where & Why to Use NoSQL? When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),

More information

Improving the ROI of Your Data Warehouse

Improving the ROI of Your Data Warehouse Improving the ROI of Your Data Warehouse Many organizations are struggling with a straightforward but challenging problem: their data warehouse can t affordably house all of their data and simultaneously

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Oracle Big Data Appliance

Oracle Big Data Appliance Oracle Big Data Appliance Software User's Guide Release 4 (4.4) E65665-12 July 2016 Describes the Oracle Big Data Appliance software available to administrators and software developers. Oracle Big Data

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

The Value of Data Governance for the Data-Driven Enterprise

The Value of Data Governance for the Data-Driven Enterprise Solution Brief: erwin Data governance (DG) The Value of Data Governance for the Data-Driven Enterprise Prepare for Data Governance 2.0 by bringing business teams into the effort to drive data opportunities

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

ECONOMICAL, STORAGE PURPOSE-BUILT FOR THE EMERGING DATA CENTERS. By George Crump

ECONOMICAL, STORAGE PURPOSE-BUILT FOR THE EMERGING DATA CENTERS. By George Crump ECONOMICAL, STORAGE PURPOSE-BUILT FOR THE EMERGING DATA CENTERS By George Crump Economical, Storage Purpose-Built for the Emerging Data Centers Most small, growing businesses start as a collection of laptops

More information

McAfee Total Protection for Data Loss Prevention

McAfee Total Protection for Data Loss Prevention McAfee Total Protection for Data Loss Prevention Protect data leaks. Stay ahead of threats. Manage with ease. Key Advantages As regulations and corporate standards place increasing demands on IT to ensure

More information

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION The process of planning and executing SQL Server migrations can be complex and risk-prone. This is a case where the right approach and

More information

SIEM Solutions from McAfee

SIEM Solutions from McAfee SIEM Solutions from McAfee Monitor. Prioritize. Investigate. Respond. Today s security information and event management (SIEM) solutions need to be able to identify and defend against attacks within an

More information

The Top Five Reasons to Deploy Software-Defined Networks and Network Functions Virtualization

The Top Five Reasons to Deploy Software-Defined Networks and Network Functions Virtualization The Top Five Reasons to Deploy Software-Defined Networks and Network Functions Virtualization May 2014 Prepared by: Zeus Kerravala The Top Five Reasons to Deploy Software-Defined Networks and Network Functions

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

BPS Suite and the OCEG Capability Model. Mapping the OCEG Capability Model to the BPS Suite s product capability.

BPS Suite and the OCEG Capability Model. Mapping the OCEG Capability Model to the BPS Suite s product capability. BPS Suite and the OCEG Capability Model Mapping the OCEG Capability Model to the BPS Suite s product capability. BPS Contents Introduction... 2 GRC activities... 2 BPS and the Capability Model for GRC...

More information

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Chapter 3. Foundations of Business Intelligence: Databases and Information Management Chapter 3 Foundations of Business Intelligence: Databases and Information Management THE DATA HIERARCHY TRADITIONAL FILE PROCESSING Organizing Data in a Traditional File Environment Problems with the traditional

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Big Data Specialized Studies

Big Data Specialized Studies Information Technologies Programs Big Data Specialized Studies Accelerate Your Career extension.uci.edu/bigdata Offered in partnership with University of California, Irvine Extension s professional certificate

More information

Information empowerment for your evolving data ecosystem

Information empowerment for your evolving data ecosystem Information empowerment for your evolving data ecosystem Highlights Enables better results for critical projects and key analytics initiatives Ensures the information is trusted, consistent and governed

More information

Introduction to K2View Fabric

Introduction to K2View Fabric Introduction to K2View Fabric 1 Introduction to K2View Fabric Overview In every industry, the amount of data being created and consumed on a daily basis is growing exponentially. Enterprises are struggling

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera Agenda Problem (Solving) Apache Solr + Apache Hadoop et al Real-world examples Q&A Problem Solving

More information

Oracle Big Data Fundamentals Ed 1

Oracle Big Data Fundamentals Ed 1 Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

ETL is No Longer King, Long Live SDD

ETL is No Longer King, Long Live SDD ETL is No Longer King, Long Live SDD How to Close the Loop from Discovery to Information () to Insights (Analytics) to Outcomes (Business Processes) A presentation by Brian McCalley of DXC Technology,

More information

Rethinking VDI: The Role of Client-Hosted Virtual Desktops. White Paper Virtual Computer, Inc. All Rights Reserved.

Rethinking VDI: The Role of Client-Hosted Virtual Desktops. White Paper Virtual Computer, Inc. All Rights Reserved. Rethinking VDI: The Role of Client-Hosted Virtual Desktops White Paper 2011 Virtual Computer, Inc. All Rights Reserved. www.virtualcomputer.com The Evolving Corporate Desktop Personal computers are now

More information

Optimized Data Integration for the MSO Market

Optimized Data Integration for the MSO Market Optimized Data Integration for the MSO Market Actions at the speed of data For Real-time Decisioning and Big Data Problems VelociData for FinTech and the Enterprise VelociData s technology has been providing

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

Capture Business Opportunities from Systems of Record and Systems of Innovation

Capture Business Opportunities from Systems of Record and Systems of Innovation Capture Business Opportunities from Systems of Record and Systems of Innovation Amit Satoor, SAP March Hartz, SAP PUBLIC Big Data transformation powers digital innovation system Relevant nuggets of information

More information

CHAPTER. Overview of Oracle NoSQL Database and Big Data

CHAPTER. Overview of Oracle NoSQL Database and Big Data CHAPTER 1 Overview of Oracle NoSQL Database and Big Data 2 Oracle NoSQL Database Since the invention of the transistor, the proliferation and application of computer technologies has been shaped by Moore

More information

BUILDING the VIRtUAL enterprise

BUILDING the VIRtUAL enterprise BUILDING the VIRTUAL ENTERPRISE A Red Hat WHITEPAPER www.redhat.com As an IT shop or business owner, your ability to meet the fluctuating needs of your business while balancing changing priorities, schedules,

More information

THE RISE OF. The Disruptive Data Warehouse

THE RISE OF. The Disruptive Data Warehouse THE RISE OF The Disruptive Data Warehouse CONTENTS What Is the Disruptive Data Warehouse? 1 Old School Query a single database The data warehouse is for business intelligence The data warehouse is based

More information

Oracle Big Data Appliance

Oracle Big Data Appliance Oracle Big Data Appliance Software User's Guide Release 4 (4.6) E77518-02 November 2016 Describes the Oracle Big Data Appliance software available to administrators and software developers. Oracle Big

More information

A Survey on Big Data

A Survey on Big Data A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

More information