The Hadoop Paradigm & the Need for Dataset Management 1. Hadoop Adoption Hadoop is being adopted rapidly by many different types of enterprises and government entities and it is an extraordinarily complex technology that is hard to use and not feature rich yet. Two factors are driving the adoption, price - it is much less expensive that current data processing platforms, and scale, it can process very large sets of data. Much of the data that enterprises own right now in hundreds or thousands of databases, is not used at all to develop advanced analytics, because of the cost. Certainly within a lot of that data, there are very valuable business insights and with this new low cost platform they can now utilize any amount of this data for an advanced analytics purpose. Enterprises will develop many new insights. Once you have those insights your business decision making will have changed radically. Data warehousing has been around for a very long time and it's the standard by which all enterprises manage their data for analytic purposes today. But it is definitely not low cost, is not infinitely scalable, and although it is somewhat fault tolerant it is not nearly to the degree that Hadoop is. Hadoop is offering them a very low cost, almost infinitely scalable and completely fault tolerant data processing platform. They are tending to move a lot of data they otherwise would not have moved into a data warehouse into Hadoop, and once they get it in there, they start figuring out how to process it into a form on which they can perform new kinds of analytics that they could not have afforded to do in the past. Hadoop serves a number of purposes in the analytics pipeline. In order to create high quality analytic, you need data that has been collected, assembled and refined and refined again and
again, generally from a number of different systems, which is what Hadoop does well. It is a very powerful, scalable framework that allows you to manipulate these diverse collections of data into a finished dataset suitable for advanced analytics Hadoop itself does not really possess advanced analytic engines, instead it has some more standard data processing engines like Hive and Mapreduce, but in many cases, users want to conduct machine learning or statistical analytics to get different kinds of insights from the data. We have hooked in an open source analytics package called R, into Loom. This means you can get to any of the data in Hadoop, work with it, pull it back into R and perform advanced analytics. The advent of this new data processing paradigm has led directly to the rapid employment of a key new role - the data scientist. This role is a natural evolution of the business analyst role that developed in the 1990 s. Data scientists typically have stronger math skills than do business analysts and in many cases have significant computer science skills. These people are the main users of the new platform, taken together we have a paradigm shift in how data is used to build products and manage the business. 2. Dataset Management in Hadoop Computer based data management systems have been around since the 1960 s. In the beginning you would stick a piece of data in a memory register and have to remember where to find it and write the location in your program, not really a data management system, just data processing. Then data was organized in hierarchies giving structure to the way individual data entities are related and we had the first generation of the database management system. In 1970 E.F. Codd wrote a seminal paper while working at IBM that laid out a new way to organize data so that programs could interact with an abstraction of the data itself and so not have to account for its exact location. Further, the data elements were structured not in hierarchies, but in tables, columns and rows where the relations were understood formally and facilitated processing the abstraction. We define data processing as the computational activities occurring on collections of data elements. We define data management as an abstraction that precisely defines the relationships amongst data elements and amongst collections of data elements. Database Management Systems (DBMS) have both capabilities, data processing and data management. The abstraction layer above the data makes using the data vastly simpler, requiring much less coding and much less time to understand the data. Data processing in relational database management systems (RDBMS) is greatly simplified by understanding how to use the abstraction - tables, columns, row and keys - the management system. Hadoop is not a database management system, just a data processing system. But it is so
inexpensive and so scalable and so fault tolerant, that it is used to process many sets of data in the same cluster. The core difference between a RDBMS and Hadoop is that we have one set of data in a DBMS, and in Hadoop we have perhaps hundreds of sets of data. It is not possible to have one abstraction (table-row-column) for each of the sets of data, although Hive tries, so it will be useful to have some other sort of abstraction that simplifies the processing of data in Hadoop. The abstraction will have to be above the level of the schema of the set, although the schema of each collection of data in each data set should be available as a description of each set, it will have to be at the level of the data set and the operations that have affected each data set over time. Hadoop is not used normally as a transactional system where lots of data is being created by an application or a machine as with ERPs or CRMs, it is used to store lots of data that was created in other systems/machines, which is later processed to meet analytic requirements. When we have many sets of data that are processed/transformed and perhaps combined over multiple operations, we have a new kind of problem. How can we efficiently use the sets of data in Hadoop to produce the desired analytics? Studies 1 have confirmed that finding the right collection of data is the first time-consuming step in producing new insights. Then knowing enough about the data once it is found is a second time-consuming step -- what is its structure and other important characteristics. RDBMSs do not need this kind of information associated with the data, as the system itself imposes a structure and findability directly onto the data. But in Hadoop it is otherwise. Finding the right data and understanding it well enough to process it, is an enormous time-consuming effort. Most observers estimate that analysts spend 75% of their time finding and processing the raw data into a form suitable to support analysis. 3. Tracking Data Lineage We have proposed an abstraction to introduce the capabilities of a management system on the core data processing capability of Hadoop. The abstraction consists in a data set, query or transform and job. All data in Hadoop will then be related by being included in a named data set, which can be transformed by a job. In this way we can track all data assets in Hadoop and maintain relations amongst original data sets and all data sets derived for any combination of the original sets. This technique is called tracking data lineage.
Hadoop Information Model Dataset Lineage
Further, it is necessary to collect many additional properties about each of the core abstractions (data set, transform, job), things like: SCHEMA Location Number of Columns & Rows Originating System TIme & Date Loaded or Last Transformed More There may be dozens of these properties that are useful in using and processing the data set. Other sorts of properties will be collected for transforms and jobs so that the provenance of a data set can be precisely determined. With this basic abstraction available to organize and MANAGE all data in single Hadoop clusters and across collections of Hadoop clusters, the job of data processing undertaken by the data scientist or developer is greatly simplified and she becomes enormously more productive yielding better insights faster. Above, the Loom Home page displays recent Datasets, Queries and Jobs. Loom s Extensible Registry and the Auto-Scan dataset tracking function represent best practice for Hadoop..
Over time this core basic abstraction may grow but probably not by much. We are currently adding the abstraction of cluster soon to keep track of data sets and operations across clusters. Additional roadmap functionality is being driven by active Hadoop-user organizations.