MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE
TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7
EXECUTIVE SUMMARY The MapR DataOps Governance Framework is designed to provide a complete enterprisewide management solution to governing data. It supports data lineage, metadata catalog, data dictionary, and data lifecycle management. Critical business decisions are being made against data. The result is tremendous pressure to create and maintain trust in data quality and regulatory data compliance. To achieve a high level of confidence in the quality of data, the MapR solution considers more than a single environment such as Hadoop because most data originates and is processed outside of a single platform. An enterprise solution must consider the entire enterprise and not focus only on a single point solution. The MapR DataOps Governance Framework is a blend of technology options that assist the data governance process. These technologies can be tailored to your organizational data transformation and data lineage requirements. Our complete enterprise-centric management capabilities include platform-based security, data lineage, metadata management at scale, self-service data discovery, and data lifecycle management. Platform-Based Security. As the only data platform with built-in security, MapR is designed to apply security semantics automatically as data is being stored and retrieved from the platform. MapR solves for all four pillars of security authentication, authorization, auditing, and data protection using platform-level capabilities that don t require external security tools or plugins. Such a solution is therefore complete and cannot be bypassed by components that have not been carefully altered to work with an external security tool. Data Lineage. MapR provides a robust, scalable mechanism to capture the data evolution across the enterprise and tracks the complete data transformation inside and outside of the big data platform. Metadata Management at Scale. MapR offers one complete metadata catalog to store and query metadata such as data source, transformations, and stewardship in a highly scalable and efficient manner. Secure, Self-Service Data Discovery. Using interactive SQL powered by Apache Drill, MapR allows users to discover data without first having to create a schema. This ensures granular security during the discovery process by empowering data owners and administrators to expose portions even obfuscated portions of data. Data Lifecycle Management. MapR assigns policies to place data in restricted zones based on criteria such as the data s age, temperature, or tenancy requirements. Cold data can be archived or deleted at once. 3
BACKGROUND Data governance is less about the technology and more about a set of processes tracking and managing the data origin and all subsequent transformations. The goal of the MapR DataOps Governance Framework is to achieve a high level of data quality and integrity to gain a competitive advantage and to meet mandated compliance. It is critical to understand your existing processes and objectives before choosing a technology. Technology can be leveraged to support data governance processes, but the challenge is selecting the right technologies to track the full holistic transformation of your data. Choosing the right technology requires a solid understanding of your organization s business needs: How do you define the owner of the data? What is your data management strategy? What is the data-cleaning process and criteria for data validation, correctness, and completeness? What are the various data transformations used against your data today? Are there any industry or regulatory requirements? What are the data access policies for your organization? What data controls and change recording are required? Today, no single technology or vendor offers a one size fits all solution. Any vendor making this product claim is misleading you. Every industry and organization has unique processes and requirements that demand great care when selecting technologies to assist in the data governance process. Before choosing a technology, you must understand the full transformation process of the data so that you can select technologies that track and manage data with an enterprisewide view. Having an enterprisewide view of data is critical to achieving a core goal of data governance: addressing data quality. A data governance solution is only truly helpful if it addresses all enterprise data management processes and flows, not just those within a single domain or big data platform. After all, data quality problems can be introduced anywhere in the chain, even before the data reaches the big data platform. Other big data vendors make claims of having complete data governance. These big data solutions mostly focus on data governance within the walls of a big data world and have significant gaps when managing data governance from an enterprisewide view. These are point solutions to an enterprise problem. It is crucial to leverage the right technology for the organization. The MapR Converged Data Platform is specifically designed to be open and pluggable. This allows teams to leverage the right data governance technology in tandem with existing MapR data governance capabilities. 4
MAPR DATA GOVERNANCE The MapR data governance solution consists of two main components: the MapR Converged Data Platform and the MapR DataOps Governance Framework. MapR Open Approach to Governance for All Data. RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS BLOGS, SOCIAL MEDIA, LINK DATA LOG FILES, CLICKSTREAMS ENTERPRISEWIDE GOVERNANCE WORKFLOW PLATFORM-BASED SECURITY SEARCH COMPLIANCE-READY LINEAGE DISCOVERY SCALABLE METADATA REPOSITORY BUSINESS INTELLIGENCE ANALYTICS OPERATIONAL APPLICATIONS CLOUD-SCALE DATA STORE ANALYTICS & ML ENGINES OPERATIONAL DATABASE GLOBAL EVENT STREAMS CONVERGED DATA PLATFORM High Availability Real-Time Unified Security Multi-Tenancy Disaster Recovery Global Namespace MapR DataOps Governance Framework The MapR Converged Data Platform offers a robust and unmatched protection scheme for data within the MapR platform. MapR security is built directly into the platform and supports the ability to apply security protection directly as data comes into and out of the platform without requiring an external security manager server or specific security plugins for each ecosystem component. MapR security semantics are applied automatically by design for data being retrieved or stored by any ecosystem, application, or users out of the box. The MapR DataOps Governance Framework is built on an open architecture, allowing customers to extend and use the right technology to support processes that match their use cases. With MapR, businesses can track and manage the data transformation process to achieve a complete data-governance data-lineage monitoring solution. MapR offers a rich set of APIs available to data governance technologies suitable for tracking and managing data across the enterprise. The MapR DataOps Governance Framework architecture leverages the right partner technology to provide the best data governance approach. Big data only solutions offered by others do not provide full end-to-end data governance solutions. Their patchwork of disparate security models and adhoc security services add complexity without actually solving the problem. Our open architecture allows for a best-of-breed solution from industry data-governance leaders, giving you a broad range of technology options tailored to specific use cases and requirements. Every organization has unique data quality procedures in place. Great care is required in selecting technologies to assist in the data governance process to successfully keep track of the metadata and the transformation process. For this reason, the MapR DataOps Governance Framework is designed explicitly toward an open architecture. This lets customers plug in the right technology to extend MapR to support and assist in data governance process and procedures. The MapR open architecture is supported by leading industry data governance solutions such as Cask, Waterline, Infomatica, Collibra, Podium, Dataguise, Talend, and Alation. In addition, MapR data governance partners provide an even tighter integration and certified arrangement so that MapR customers have one metadata catalog and a clear path of data lineage as illustrated by the graphic below. MapR is currently pursuing arrangements with Cask and Waterline. 5
Cask provides a unified integration platform for big data. Open source Cask Data Application Platform (CDAP) lets architects, data scientists, and business analysts focus on applications and insights rather than infrastructure and integration. Through powerful self-service data lineage tools and APIs, CDAP provides users with visibility into how data is flowing into, through, and out of data lakes. It allows them to perform impact and root cause analysis as well as provides an audit trail for compliance. CDAP provides the capabilities and standardization to collect technical, operational, and business metadata from data ingestion and transformation needed to create rich metadata for governance. Programmatic APIs allow for integrating with existing Spark or MapReduce-based applications for publishing metadata, which enables better tracking and visibility with preexisting solutions. CDAP also provides the capability to aggregate and index data at the level of entities where users interact, which is essential. It supports searching based on tags, properties, or schema fields and types, which is critical for discovering datasets in an operational cluster. Both a data dictionary and preferred tags provide a way for standardizing tags and fields that are applied on the datasets. EDW OPTIMIZATION MANAGED DATA LAKE BUSINESS-CRITICAL DATA OPS & IoT DATA PREPARATION DATA INGESTION OPERATIONS & MANAGEMENT SECURITY & GOVERNANCE APP DEVELOPMENT ECOSYSTEM NAVIGATOR NiFi / HDF VERSUS CONVERGED DATA PLATFORM MapR DataOps Governance Framework with Cask vs the Competition 6
Waterline Data provides a business-centric data catalog in the enterprise. Companies often have problems finding, organizing, and effectively using their data. Most organizations track their data using tribal knowledge in the heads of their data analysts, scientists, and stewards. Waterline s Smart Data Catalog replaces this tribal knowledge with software that automatically profiles and tags data using machine learning plus a system of ratings and reviews think of it as Google meets Yelp for data to catalog data consistently so users can quickly search for and find data. Waterline provides solutions for self-service analytics and data governance and compliance that automate the discovery, curation, and resolution of critical data. This allows users to spend more time using data and less time searching for it, to better comply with data regulatory requirements, and to reduce the costs associated with data redundancy and data hoarding. MAPR AND WATERLINE DATA EXTENDS GOVERNANCE INFRASTRUCTURE METADATA SERVICES SECURITY DATA SOURCE SUPPORT FINGERPRINTING DISCOVERY SERVICES ENABLE SECURITY FOR DARK DATA CATALOG DATA SOURCES BEYOND HADOOP Tag Discovery & Suggestions Statistical Demographics Near Real-Time Security Updates Sensitive Data Discovery Relational Azure Blobs S3 + Redshift Inferred Lineage Curation Metadata Repository (Navigator or Atlas) Tag Based Access Control Infrastructure HDFS, Hive CDH & HDP BASIC SERVICES MapR DataOps Governance Framework with Waterline Data vs the Competition 7
MapR Data Governance Without Compromise provides a way to feed relevant MapR data governance data into a customized solution. MapR Professional Services can develop a custom data governance solution that integrates with an existing or new solution. During a six week engagement, MapR Professional Services develops the foundation for a custom solution using core features of the MapR Converged Data Platform to create an enterprisewide platform for cataloging metadata, collating data evolution events for lineage, and organizing data and assigning policies to facilitate data lifecycle management. CONCLUSION Data governance is not just about the technology. Rather, it is a set of processes that track and manage the origin and transformation of all data to achieve a high level of data quality and integrity. The end result is a competitive advantage for your business. Data governance ensures business data is efficiently managed throughout the enterprise data lifecycle, resulting in data that benefits the business through its high quality, integrity, and trustworthiness. This enterprisewide process is established by people responsible for data quality. The role of technology is to support the process and the people managing it. Choosing the right technology to align with your organization s goals is essential in establishing a holistic data management program. For the data to be useful, you must manage it. Because decisions are being made against this data, creating and maintaining trust in the data quality is essential for data governance success. The MapR DataOps Governance Framework is built on an open architecture. This design provides the necessary flexibility for plugging and extending the right technology that aligns with your organizational processes. Data scientists need an enterprisewide view of the data to ensure the data maintained is high quality. This cannot be achieved using technologies that are only focused on big data. More information on the professional services based governance engagement can be found here: https://mapr.com/solutions/quickstart/data-governance/ For more information visit mapr.com MapR and the MapR logo are registered trademarks of MapR and its subsidiaries in the United States and other countries. Other marks and brands may be claimed as the property of others. The product plans, specifications, and descriptions herein are provided for information only and subject to change without notice, and are provided without warranty of any kind, express or implied. Copyright 2018 MapR Technologies, Inc.