The Hadoop Paradigm & the Need for Dataset Management

Similar documents
Stages of Data Processing

Big Data with Hadoop Ecosystem

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

ELTMaestro for Spark: Data integration on clusters

When, Where & Why to Use NoSQL?

Fundamentals of Design, Implementation, and Management Tenth Edition

Big Data The end of Data Warehousing?

Modern Data Warehouse The New Approach to Azure BI

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Embedded Technosolutions

Introduction to Big-Data

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

Oracle #1 RDBMS Vendor

Evolution of Database Systems

5. Technology Applications

Partner Presentation Faster and Smarter Data Warehouses with Oracle OLAP 11g

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

The Future of Interoperability: Emerging NoSQLs Save Time, Increase Efficiency, Optimize Business Processes, and Maximize Database Value

Data Governance Overview

COMPARISON WHITEPAPER. Snowplow Insights VS SaaS load-your-data warehouse providers. We do data collection right.

Big Data Specialized Studies

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

An Introduction to Big Data Formats

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Strategic Briefing Paper Big Data

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Evolving To The Big Data Warehouse

Microsoft Big Data and Hadoop

Taming Structured And Unstructured Data With SAP HANA Running On VCE Vblock Systems

Certified Big Data and Hadoop Course Curriculum

Data Storage Infrastructure at Facebook

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Progress DataDirect For Business Intelligence And Analytics Vendors

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

From Silicon Valley to the Test Bed: Bringing Big-Data Technologies into ODS

The Business Value of Metadata for Data Governance: The Challenge of Integrating Packaged Applications

Unified Governance for Amazon S3 Data Lakes

Five Common Myths About Scaling MySQL

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

White Paper. Low Cost High Availability Clustering for the Enterprise. Jointly published by Winchester Systems Inc. and Red Hat Inc.

A Survey on Big Data

Big Data Hadoop Stack

Minimizing the Risks of OpenStack Adoption

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Shine a Light on Dark Data with Vertica Flex Tables

FAQ: Database Development and Management

SAP IQ Software16, Edge Edition. The Affordable High Performance Analytical Database Engine

Test bank for accounting information systems 1st edition by richardson chang and smith

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Meaning & Concepts of Databases

A Glimpse of the Hadoop Echosystem

Top 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software

DATABASE MANAGEMENT SYSTEMS. UNIT I Introduction to Database Systems

Modernizing Business Intelligence and Analytics

How to integrate data into Tableau

The InfoLibrarian Metadata Appliance Automated Cataloging System for your IT infrastructure.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

THE RISE OF. The Disruptive Data Warehouse

white paper Aster Data ncluster In - database Analytics with R

Distributed Virtual Reality Computation

Data Management Glossary

Fast Innovation requires Fast IT

Todd Walter Chief Technologist Teradata Corporation

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

A REVIEW PAPER ON BIG DATA ANALYTICS

TIBCO Data Virtualization for the Energy Industry

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

What is Gluent? The Gluent Data Platform

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

An Enchanted World: SAS in an Open Ecosystem

Oracle Big Data Connectors

Introduction to K2View Fabric

What is database? Types and Examples

SQL Maestro and the ELT Paradigm Shift

Accelerate your SAS analytics to take the gold

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Data Modeling - Conceive, Collaborate, Create. Introduction: The early conceptual beginnings of data modeling trace back to the origins

Hello, my name is Cara Daly, I am the Product Marketing Manager for Polycom Video Content Management Solutions and today I am going to be reviewing

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Data-Intensive Distributed Computing

Data, Information, and Databases

Information empowerment for your evolving data ecosystem

Oracle GoldenGate for Big Data

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

Challenges for Data Driven Systems

Introduction to NoSQL

Azure Data Factory. Data Integration in the Cloud

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

Transcription:

The Hadoop Paradigm & the Need for Dataset Management 1. Hadoop Adoption Hadoop is being adopted rapidly by many different types of enterprises and government entities and it is an extraordinarily complex technology that is hard to use and not feature rich yet. Two factors are driving the adoption, price - it is much less expensive that current data processing platforms, and scale, it can process very large sets of data. Much of the data that enterprises own right now in hundreds or thousands of databases, is not used at all to develop advanced analytics, because of the cost. Certainly within a lot of that data, there are very valuable business insights and with this new low cost platform they can now utilize any amount of this data for an advanced analytics purpose. Enterprises will develop many new insights. Once you have those insights your business decision making will have changed radically. Data warehousing has been around for a very long time and it's the standard by which all enterprises manage their data for analytic purposes today. But it is definitely not low cost, is not infinitely scalable, and although it is somewhat fault tolerant it is not nearly to the degree that Hadoop is. Hadoop is offering them a very low cost, almost infinitely scalable and completely fault tolerant data processing platform. They are tending to move a lot of data they otherwise would not have moved into a data warehouse into Hadoop, and once they get it in there, they start figuring out how to process it into a form on which they can perform new kinds of analytics that they could not have afforded to do in the past. Hadoop serves a number of purposes in the analytics pipeline. In order to create high quality analytic, you need data that has been collected, assembled and refined and refined again and

again, generally from a number of different systems, which is what Hadoop does well. It is a very powerful, scalable framework that allows you to manipulate these diverse collections of data into a finished dataset suitable for advanced analytics Hadoop itself does not really possess advanced analytic engines, instead it has some more standard data processing engines like Hive and Mapreduce, but in many cases, users want to conduct machine learning or statistical analytics to get different kinds of insights from the data. We have hooked in an open source analytics package called R, into Loom. This means you can get to any of the data in Hadoop, work with it, pull it back into R and perform advanced analytics. The advent of this new data processing paradigm has led directly to the rapid employment of a key new role - the data scientist. This role is a natural evolution of the business analyst role that developed in the 1990 s. Data scientists typically have stronger math skills than do business analysts and in many cases have significant computer science skills. These people are the main users of the new platform, taken together we have a paradigm shift in how data is used to build products and manage the business. 2. Dataset Management in Hadoop Computer based data management systems have been around since the 1960 s. In the beginning you would stick a piece of data in a memory register and have to remember where to find it and write the location in your program, not really a data management system, just data processing. Then data was organized in hierarchies giving structure to the way individual data entities are related and we had the first generation of the database management system. In 1970 E.F. Codd wrote a seminal paper while working at IBM that laid out a new way to organize data so that programs could interact with an abstraction of the data itself and so not have to account for its exact location. Further, the data elements were structured not in hierarchies, but in tables, columns and rows where the relations were understood formally and facilitated processing the abstraction. We define data processing as the computational activities occurring on collections of data elements. We define data management as an abstraction that precisely defines the relationships amongst data elements and amongst collections of data elements. Database Management Systems (DBMS) have both capabilities, data processing and data management. The abstraction layer above the data makes using the data vastly simpler, requiring much less coding and much less time to understand the data. Data processing in relational database management systems (RDBMS) is greatly simplified by understanding how to use the abstraction - tables, columns, row and keys - the management system. Hadoop is not a database management system, just a data processing system. But it is so

inexpensive and so scalable and so fault tolerant, that it is used to process many sets of data in the same cluster. The core difference between a RDBMS and Hadoop is that we have one set of data in a DBMS, and in Hadoop we have perhaps hundreds of sets of data. It is not possible to have one abstraction (table-row-column) for each of the sets of data, although Hive tries, so it will be useful to have some other sort of abstraction that simplifies the processing of data in Hadoop. The abstraction will have to be above the level of the schema of the set, although the schema of each collection of data in each data set should be available as a description of each set, it will have to be at the level of the data set and the operations that have affected each data set over time. Hadoop is not used normally as a transactional system where lots of data is being created by an application or a machine as with ERPs or CRMs, it is used to store lots of data that was created in other systems/machines, which is later processed to meet analytic requirements. When we have many sets of data that are processed/transformed and perhaps combined over multiple operations, we have a new kind of problem. How can we efficiently use the sets of data in Hadoop to produce the desired analytics? Studies 1 have confirmed that finding the right collection of data is the first time-consuming step in producing new insights. Then knowing enough about the data once it is found is a second time-consuming step -- what is its structure and other important characteristics. RDBMSs do not need this kind of information associated with the data, as the system itself imposes a structure and findability directly onto the data. But in Hadoop it is otherwise. Finding the right data and understanding it well enough to process it, is an enormous time-consuming effort. Most observers estimate that analysts spend 75% of their time finding and processing the raw data into a form suitable to support analysis. 3. Tracking Data Lineage We have proposed an abstraction to introduce the capabilities of a management system on the core data processing capability of Hadoop. The abstraction consists in a data set, query or transform and job. All data in Hadoop will then be related by being included in a named data set, which can be transformed by a job. In this way we can track all data assets in Hadoop and maintain relations amongst original data sets and all data sets derived for any combination of the original sets. This technique is called tracking data lineage.

Hadoop Information Model Dataset Lineage

Further, it is necessary to collect many additional properties about each of the core abstractions (data set, transform, job), things like: SCHEMA Location Number of Columns & Rows Originating System TIme & Date Loaded or Last Transformed More There may be dozens of these properties that are useful in using and processing the data set. Other sorts of properties will be collected for transforms and jobs so that the provenance of a data set can be precisely determined. With this basic abstraction available to organize and MANAGE all data in single Hadoop clusters and across collections of Hadoop clusters, the job of data processing undertaken by the data scientist or developer is greatly simplified and she becomes enormously more productive yielding better insights faster. Above, the Loom Home page displays recent Datasets, Queries and Jobs. Loom s Extensible Registry and the Auto-Scan dataset tracking function represent best practice for Hadoop..

Over time this core basic abstraction may grow but probably not by much. We are currently adding the abstraction of cluster soon to keep track of data sets and operations across clusters. Additional roadmap functionality is being driven by active Hadoop-user organizations.