Hortonworks DataFlow. Accelerating Big Data Collection and DataFlow Management. A Hortonworks White Paper DECEMBER Hortonworks DataFlow

Similar documents
Hortonworks and The Internet of Things

Hortonworks DataFlow Sam Lachterman Solutions Engineer

MODERNIZE INFRASTRUCTURE

Sustainable Security Operations

Accelerate Your Enterprise Private Cloud Initiative

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Achieving Digital Transformation: FOUR MUST-HAVES FOR A MODERN VIRTUALIZATION PLATFORM WHITE PAPER

BUILDING the VIRtUAL enterprise

Overcoming Business Challenges in WAN infrastructure

Solving the Enterprise Data Dilemma

5 Steps to Government IT Modernization

Transformation Through Innovation

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

SIEM: Five Requirements that Solve the Bigger Business Issues

Next-Generation HCI: Fine- Tuned for New Ways of Working

The Top Five Reasons to Deploy Software-Defined Networks and Network Functions Virtualization

RED HAT ENTERPRISE LINUX. STANDARDIZE & SAVE.

That Set the Foundation for the Private Cloud

Smart Data Center Solutions

Predictive Insight, Automation and Expertise Drive Added Value for Managed Services

SIEM Solutions from McAfee

How a Federated Identity Service Turns Identity into a Business Enabler, Not an IT Bottleneck

Build Your Zero Trust Security Strategy With Microsegmentation

SD-WAN Transform Your Agency

ProDeploy Suite. Accelerate enterprise technology adoption with expert deployment designed for you

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Preparing your network for the next wave of innovation

Changing the Economics of Lossless Full Packet Capture Enabling Real-time Visibility

The Data Explosion. A Guide to Oracle s Data-Management Cloud Services

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

Mesh Wide Area Network AP Elevating Bandwidth to the n th Power

Digital Renewable Ecosystem on Predix Platform from GE Renewable Energy

CenturyLink for Microsoft

Cisco Wide Area Application Services and Cisco Nexus Family Switches: Enable the Intelligent Data Center

Hyper-Converged Infrastructure: Providing New Opportunities for Improved Availability

Informatica Data Quality Product Family

TECHNOLOGY WHITE PAPER. Java for the Real Time Business

It s Time for WAN Optimization to Evolve to Meet the Needs of File Collaboration

Introduction to the Active Everywhere Database

LINEAR VIDEO DELIVERY FROM THE CLOUD. A New Paradigm for 24/7 Broadcasting WHITE PAPER

What to Look for in a Partner for Software-Defined Data Center (SDDC)

Cisco Digital Media System: Simply Compelling Communications

Why Converged Infrastructure?

Shine a Light on Dark Data with Vertica Flex Tables

Moving From Reactive to Proactive Storage Management with an On-demand Cloud Solution

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

The McAfee MOVE Platform and Virtual Desktop Infrastructure

Data Virtualization and the API Ecosystem

Evaluating Cloud Databases for ecommerce Applications. What you need to grow your ecommerce business

Proven results Unsurpassed interoperability Fast, secure and adaptable network. Only EnergyAxis brings it all together for the Smart Grid

SOLUTION BRIEF BIG DATA SECURITY

Cisco Connected Factory Accelerator Bundles

Security and Performance advances with Oracle Big Data SQL

Making the case for SD-WAN

HDP Security Overview

in collaboration with

HDP Security Overview

Unified Governance for Amazon S3 Data Lakes

Accelerating the Business Value of Virtualization

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

3 Ways Businesses Use Network Virtualization. A Faster Path to Improved Security, Automated IT, and App Continuity

SD-WAN. The CIO s guide to. Why it s time for a new kind of network

Storage Access Network Design Using the Cisco MDS 9124 Multilayer Fabric Switch

De-dupe: It s not a question of if, rather where and when! What to Look for and What to Avoid

Cisco APIC Enterprise Module Simplifies Network Operations

Not all SD-WANs are Created Equal: Performance Matters

The State of Cloud Monitoring

Enabling efficiency through Data Governance: a phased approach

Service Delivery Platform

Digital Enterprise Platform for Live Business. Kevin Liu SAP Greater China, Vice President General Manager of Big Data and Platform BU

Achieving End-to-End Security in the Internet of Things (IoT)

Benefits of SD-WAN to the Distributed Enterprise

SOLUTION BRIEF NETWORK OPERATIONS AND ANALYTICS. How Can I Predict Network Behavior to Provide for an Exceptional Customer Experience?

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

Listening for the Right Signals Using Event Stream Processing for Enterprise Data

Extend Your Reach. with. Signature Core Fiber Optic Cabling System

The #1 Key to Removing the Chaos. in Modern Analytical Environments

Azul Disrupts the ROI Equation for High Performance Applications

Port Tapping Session 2 Race tune your infrastructure

SOLUTION BRIEF RSA NETWITNESS EVOLVED SIEM

Content Management for the Defense Intelligence Enterprise

Windows 10 IoT Overview. Microsoft Corporation

USE CASE STUDY. Connecting Data Through Mission. Department of Transportation (DOT) A Product of the Federal CIO Council Innovation Committee

August Oracle - GoldenGate Statement of Direction

Digitalization of Manufacturing

CA Test Data Manager Key Scenarios

HARNESSING THE HYBRID CLOUD TO DRIVE GREATER BUSINESS AGILITY

BUILD BETTER MICROSOFT SQL SERVER SOLUTIONS Sales Conversation Card

SDN meets the real world part two: SDN rewrites the WAN manual

THE JOURNEY OVERVIEW THREE PHASES TO A SUCCESSFUL MIGRATION ADOPTION ACCENTURE IS 80% IN THE CLOUD

DATACENTER SERVICES DATACENTER

Network Edge Innovation With Virtual Routing

WIND RIVER NETWORKING SOLUTIONS

Transform your network and your customer experience. Introducing SD-WAN Concierge

in Action Delivering the digital enterprise Human Centric Innovation Ralf Salzmann Manager OEM

Harness the Power of Your Network

Transforming Utility Grid Operations with the Internet of Things

Realizing the Full Potential of MDM 1

OneUConn IT Service Delivery Vision

MASERGY S MANAGED SD-WAN

Transcription:

Hortonworks DataFlow Accelerating Big Data Collection and DataFlow Management A Hortonworks White Paper DECEMBER 2015 Hortonworks DataFlow 2015 Hortonworks www.hortonworks.com

2 Contents What is Hortonworks DataFlow? 3 Benefits of Hortonworks DataFlow 4 Leverage Operational Efficiency Make Better Business Decisions Increase Data Security Features of Hortonworks DataFlow 5 Data Collection Real Time Decisions Operational Efficiency Security and Provenance Bi-directional DataFlow Command and Control Common applications of Hortonworks DataFlow 6 Accelerated Data Collection and Operational Effectiveness Increased Security and Unprecedented Chain of Custody The Internet of Any Thing with Hortonworks DataFlow Adaptive to Resource Constraints Secure Data Collection Prioritized Data Transfer and Bi-directional Feedback Loop Why Hortonworks for Apache Hadoop? 12 About Hortonworks 12

3 What is Hortonworks DataFlow? Hortonworks DataFlow (HDF), powered by Apache NiFi, is the first integrated platform that solves the complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available. Hortonworks DataFlow is a single combined platform for data acquisition, simple event processing, transport and delivery, designed to accommodate the highly diverse and complicated dataflows generated by a world of connected people, systems and things. An ideal solution for the Internet of Any Thing (IoAT), HDF enables simple, fast data acquisition, secure data transport, prioritized data flow and clear traceability of data from the very edge of your network all the way to the core data center. Through a combination of an intuitive visual interface, a high fidelity access and authorization mechanism and an always on chain of custody (data provenance) framework, HDF is the perfect complement to HDP to bring together historical and perishable insights for your business. Hortonworks DataFlow is based on Apache NiFi, technology originally created by the NSA (National Security Agency) in 2006 to address the challenge of automating the flow of data between systems of all types the very same problem that enterprises are encountering today. After eight years of development and use at scale, the NSA Technology Transfer Program released NiFi to the Apache Software Foundation in the fall of 2014. A single integrated platform for data acquisition, simple event processing, transport and delivery mechanism from source to storage. Hortonworks DataFlow enables the real time collection and processing of perishable insights. Hortonworks Data Platform can be used to enrich content and support changes to real-time dataflows. ENRICH CONTENT STORE DATA AND METADATA PERISHABLE INSIGHTS HISTORICAL INSIGHTS Hortonworks DataFlow is designed to securely collect and transport data from highly diverse data sources be they big or small, fast or slow, always connected or intermittently available. Figure 1: Hortonworks DataFlow

4 Benefits of Hortonworks DataFlow DataFlow was designed inherently to meet the practical challenges of collecting data from a wide range of disparate data sources; securely, efficiently and over a geographically disperse and possibly fragmented network. Because the NSA encountered many of the issues enterprises are facing now, this technology has been field-proven with built-in capabilities for security, scalability, integration, reliability and extensibility and has a proven track record for operational usability and deployment. HORTONWORKS DATAFLOW ENABLES ENTERPRISES TO: Leverage Operational Efficiency Accelerate big data ROI via simplified data collection and a visually intuitive dataflow management interface Significantly reduce cost and complexity of managing, maintaining and evolving dataflows Trace and verify value of data sources for future investments Quickly adapt to new data sources through an extremely scalable extensible platform Make Better Business Decisions Make better business decisions with highly granular data sharing policies Focus on innovation by automating dataflow routing, management and trouble-shooting without the need for coding Enable on-time, immediate decision making by leveraging real time data bi-directional dataflows Increase business agility with prioritized data collection policies Increase Data Security Support unprecedented yet simple to implement data security from source to storage Improve compliance and reduce risk through highly granular data access, data sharing and data usage policies Create a secure dataflow ecosystem with the ability to run the same security and encryption on small scale JVM capable data sources as well as enterprise class datacenters Accelerate big data ROI through a single data-source agnostic collection platform Reduce cost and complexity through an intuitive, real-time visual user interface Unprecedented yet simple to implement data security from source to storage Better business decisions with highly granular data sharing policies React in real time by leveraging bi-directional data flows and prioritized data feeds Adapt to new data sources through an extremely scalable, extensible platform

5 Features of Hortonworks DataFlow DATA COLLECTION REAL TIME DECISIONS OPERATIONAL EFFICIENCY Integrated collection from dynamic, disparate and distributed sources of differing formats, schemas, protocols, speeds and sizes such as machines, geo location devices, click streams, files, social feeds, log files and videos Real-time evaluation of perishable insights at the edge as being pertinent or not, and executing upon consequent decisions to send, drop or locally store data as needed Fast, effective drag and drop interface for creation, management, tuning and troubleshooting of dataflows, enabling coding free creation and adjustments of dataflows in five minutes or less SECURITY AND PROVENANCE BI-DIRECTIONAL DATAFLOW COMMAND AND CONTROL Secure end-to-end routing from source to destination, with discrete user authorization and detailed, real time visual chain of custody and metadata (data provenance) Reliably prioritize and transport data in real time leveraging bi-directional dataflows to dynamically adapt to fluctuations in data volume, network connectivity and source and endpoint capacity Immediate ability to create, change, tune, view, start, stop, trace, parse, filter, join, merge, transform, fork, clone or replay dataflows through a visual user interface with real time operational visibility and feedback Figure 2: Apache NiFi Real-Time Visual User Interface

6 Common Applications of Hortonworks DataFlow Hortonworks DataFlow accelerates time to insight by securely enabling off-the shelf, flow based programming for big data infrastructure and simplifying the current complexity of secure data acquisition, ingestion and real time analysis of distributed, disparate data sources. An ideal framework for collection of data and management of dataflows, the most popular uses of Hortonworks DataFlow are for: simplified, streamlined big data ingest, increased security for collection and sharing of data with high fidelity chain of custody metadata and as the underlying infrastructure for the internet of things. Case 1: Accelerated Data Collection And Operational Effectiveness Streamlined Big Data Ingestion Hortonworks DataFlow accelerates big data pipeline ingest through a single integrated and easily extensible visual interface for acquiring, and ingesting data from different, disparate, distributed data sources in real time. The simplified and integrated creation, control and analyses of data flows results in faster ROI of big data projects and increased operational effectiveness. WHAT IS A COMMAND AND CONTROL INTERFACE? A command and control interface provides the ability to manipulate the dataflow in real time such that current contextual data can be fed back to the system to immediately change its output. This is in contrast to a design and deploy approach which involves statically programming a dataflow system before the data is flowed through it, and then returning to a static programming phase to make adjustments and restarting the dataflow again. An analogy to this would be the difference between 3D printing which requires pre-planning before execution, and molding clay which provides immediate feedback and adjustment of the end product in real time. To learn more about Hortonworks DataFlow go to http://hortonworks.com/hdf UI Easily Access and Trace Changes to DataFlow UI Processor Real Time Changes UI Dynamically Adjust DataFlow Figure 2: An integrated, data source agnostic collection platform

7 Why aren't current data collection systems ideal? Current big data collection and ingest tools are purpose-built and over-engineered simply because they were not originally designed with universally applicable, operationally efficient design principles in mind. This creates a complex architecture of disparate acquisition, messaging and often customized transformation tools that make big data ingest complex, time consuming and expensive from both a deployment and a maintenance perspective. Further, the time lag associated with the command line and coding dependent tools fetters access to data and prevents the on-time, operational decision making required of today s business environment. Figure 3: Current big data ingest solutions are complex and operationally inefficient

8 Case 2: Increased Security and Unprecedented Chain of Custody Increased security and provenance with Hortonworks DataFlow Data security is growing ever more important in the world of ever connected devices and the need to adhere to compliance and data security regulations is currently difficult, complex and costly. Verification of data access and usage is difficult, time consuming and often involves a manual process of piecing together different systems and reports to verify where data is sourced from, how it is used, who has used it and how often. Current tools utilized for transporting electronic data today are not designed for the expected security requirements of tomorrow. It is difficult, if not almost impossible for current tools to share discrete bits of data, much less do so dynamically a problem that had to be addressed in the environment of Apache NiFI as a dataflow platform used in governmental agencies. Hortonworks Dataflow addresses the security and data provenance needs in an electronic world of distributed real time big data flow management. Hortonworks DataFlow augments existing systems with a secure, reliable, simplified and integrated big data ingestion platform that ensures data security from all sources be they centrally located, high volume data centers or remotely distributed data sources over geographically dispersed communication links. As part of its security features, HDF inherently provides end-to-end data provenance a chain-of-custody for data. Beyond the ability to meet compliance regulations, provenance provides a method for tracing data from its point of origin, from any point in the dataflow, in order to determine which data sources are most used and most valuable. WHAT IS DATA PROVENANCE? Provenance is defined as the place of origin or earliest known history of some thing. In the context of a dataflow, data provenance is the ability to trace the path of a piece of data within a dataflow from its place of creation, through to its final destination. Within Hortonworks DataFlow data provenance provides the ability to visually verify where data came from, how it was used, who viewed it, whether it was sent, copied, transformed or received. Any system or person who came in contact with a specific piece of data is captured in its completeness in terms of time, date, action, precedents and dependents for a complete picture of the chain of custody for that precise piece of data within a dataflow. This provenance metadata information is used to support data sharing compliance requirements as well as for dataflow troubleshooting and optimization. Figure 4: Secure from source to storage with high fidelity data provenance

9 Data democratization with unprecedented security Hortonworks DataFlow opens new doors of business insights by enabling secure access without hindering analysis, as very specific data can be shared or not shared. For example, Mary could be given access to discrete pieces of data tagged with the term finance within a dataflow, while Jan could be given access to the same dataflow but with access only to data tagged with 2015 and revenue. The removes the disadvantages of role based data access which can inadvertently create security risks, while still enabling democratization of data for comprehensive analysis and decision making. Hortonworks Dataflow, with its inherent ability to support fine grained provenance data and metadata throughout the collection, transport and ingest process provides comprehensive and detailed information needed for audit and remediation unmatched by any existing data ingest system in place today. Figure 5: Fine-grained data access and control

10 Case 3: The Internet of Any Thing (IoAT) The Internet of Any Thing with Hortonworks DataFlow Designed in the field, where resources are scarce (power, connectivity, bandwidth), Hortonworks DataFlow is a scalable, proven platform for the acquisition and ingestion of the Internet of Things (IoT), or even more broadly, the Internet of Any Thing (IoAT). Adaptive to Resource Constraints There are many challenges in enabling an ever connected yet physically dispersed Internet of Things. Data sources may often be remote, physical footprints may be limited, power and bandwidth are likely to be both variable and constrained. Unreliable connectivity disrupts communication and causes data loss while the lack of security on most of the world s deployed sensors puts businesses and safety at risk. At the same time, devices are producing more data than ever before. Much of the data being produced is data-in-motion and unlocking the business value from this data is crucial to business transformations of the modern economy. Yet business transformation relies on accurate, secure access to data from the source through to storage. Hortonworks DataFlow was designed with all these real-world constraints in mind: power limitations, connectivity fluctuations, data security and traceability, data source diversity and geographical distribution, altogether, for accurate, time-sensitive decision making. Figure 6: A proven platform for the Internet of Things

11 Secure Data Collection Hortonworks Dataflow addresses the security needs of IoT with a secure, reliable, simplified and integrated big data collection platform that ensures data security from distributed data sources over geographically dispersed communication links. The security features of HDF includes end-to-end data provenance a chain-of-custody for data. This enables IoT systems to verify origins of the dataflow, trouble-shoot problems from point of origin through destination and the ability to determine which data sources are most frequently used and most valuable. Hortonworks DataFlow is able to run security and encryption on small scale, JVM-capable data sources as well as enterprise class datacenters. This enables the Internet of Things with a reliable, secure, common data collection and transport platform with a real-time feedback loop to continually and immediately improve algorithms and analysis for accurate, informed on-time decision making. Prioritized Data Transfer and Bi-directional Feedback Loop Because connectivity and available bandwidth may fluctuate, and the volume of data being produced by the source may exceed that which can be accepted by the destination, Hortonworks DataFlow supports the prioritization of data within a dataflow. This means that should there be resource constraints, the data source can be instructed to automatically promote the more important pieces of information to be sent first, while holding less important data for future windows of transmission opportunity, or possibly not sent at all. For example, should an outage from a remote device occur, it is critical to send the most important data from that device first as the outage is repaired and communication is re-established, Once this critical most important data has been sent, it can then be followed by the backlog of lower priority data that is vital to historical analysis, but less critical to immediate decision making. Hortonworks DataFlow enables the decision to be made at the edge of whether to send, drop or locally store data, as needed, and as conditions change. Additionally, with a fine grained command and control interface, data queues can be slowed down, or accelerated to balance the demands of the situation at hand with the current availability and cost of resources. With the ability to seamlessly adapt to resource constraints in real time, ensure secure data collection and prioritized data transfer, Hortonworks DataFlow is a proven platform ideal for the Internet of Things.

12 Why Hortonworks for Apache Hadoop? Founded in 2011 by 24 engineers from the original Yahoo! Apache Hadoop development and operations team, Hortonworks has amassed more Apache Hadoop experience under one roof than any other organization. Our team members are active participants and leaders in Apache Hadoop developing, designing, building and testing the core of the Apache Hadoop platform. We have years of experience in Apache Hadoop operations and are best suited to support your mission-critical Apache Hadoop project. For an independent analysis of Hortonworks Data Platform and its leadership among Apache Hadoop vendors, you can download the Forrester Wave : Big Data Apache Hadoop Solutions, Q1 2014 report from Forrester Research. About Hortonworks Hortonworks develops, distributes and supports the only 100% open source Apache Hadoop data platform. Our team comprises the largest contingent of builders and architects within the Apache Hadoop ecosystem who represent and lead the broader enterprise requirements within these communities. Hortonworks Data Platform deeply integrates with existing IT investments upon which enterprises can build and deploy Apache Hadoop-based applications. Hortonworks has deep relationships with the key strategic data center partners that enable our customers to unlock the broadest opportunities from Apache Hadoop. For more information, visit www.hortonworks.com.