Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions?

Size: px

Start display at page:

Download "Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions?"

Verity Turner
6 years ago
Views:

1 Kylo FAQ General What is Kylo? Capturing and processing big data isn't easy. That's why Apache products such as Spark, Kafka, Hadoop, and NiFi that scale, process, and manage immense data volumes are so popular. The drawback is that they do not make data flows easy for business users to access, optimize, and analyze. Kylo, from Think Big, overcomes that challenge. It lets businesses easily configure and monitor data pipelines in and through the data lake so users have constant access to high-quality data. It also enhances data profiling and discovery with extensible metadata. When will Kylo be generally available? Think Big is actively using Kylo with multiple large-scale enterprises globally. We are currently limiting use to Think Big services-driven projects until the open source release. We expect to public open source release Q Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions? Yes. NiFi operates on the "edge" and isn't bound to any particular Hadoop distribution. What is Kylo s value-add over plain NiFi? NiFi provides flow-based processing, acts as an orchestration engine and framework for data processing on the edge. It doesn t itself provide all the tooling required for a Data Lake solution. Key benefits of Kylo include: Write-once, use many times o Apache NiFi is a powerful IT tool for designing pipelines but in practice most Data Lake oriented feeds utilize just a small number of unique flows or patterns". Think Big Kylo allows IT the flexibility to design these unique flows as a model then register the NiFi template with Kylo. This enables non-technical business users to configure dozens or hundreds of new feeds through a simple, guided stepper-ui. In other words, Kylo s UI allows users to set up pipeline without having to code in NiFi. As long as the basic ingestion pattern is the same, there is no need for new coding. Business users will be

2 able to bring in new data sources, perform standard transformations and publish to target systems. Superior UI for monitoring data feeds o Kylo s Operations Dashboard provides centralized health monitoring of feeds and underlying data quality to provide data confidence. It can also integrate with Hadoop monitoring tools such as Ambari or Cloudera Navigator to correlate feed health to cluster service issues. Kylo can enforce Service Level Agreements, data quality metrics, and alerts Key Data Lake features o Metadata search, data discovery, wrangling, data browse, event-based feed execution (to chain together flows) Accelerates NiFi development through NiFi extensions o Includes custom NiFi custom processors for operations such as: Data Profile, Data Cleanse, Data Validate, Merge/Dedupe, Extract Table with high-water, etc. o Includes custom Nifi processors for utilizing Hadoop for processing: Spark exec, Sqoop, Spark shell, Hive and Spark via JDBC/Thrift, and others. These processors aren't yet available with vanilla NiFi o Pre-built NiFi templates for implementing Data Lake best practices: Data Ingest, ILM, and Data Processing Open Source What are Think Big s plans to open source Kylo? Think Big plans to release Kylo as open source under the Apache 2.0 license. Customers given the source code prior to release will be asked to sign an NDA restricting them from publicly releasing the source until after formal release. Think Big will offer paid commercial support for the framework under the open source model. Architecture What is the deployment architecture? Kylo typically runs on a Linux edge node of the Hadoop cluster either on-premise or the cloud. Kylo integrates with Apache NiFi that can be co-located or on separate edge machines. Kylo currently requires Postgres or MySQL for a metadata repository. It requires Java 8 or later and has been tested on several major generations of Cloudera or Hortonworks Hadoop distributions and Apache Spark 1.5.x+. It is installed via RPM. Are there any pre-requisites for the Kylo installation and setup? Redhat/GNU/Linux distributions RPM (for install) Java 1.8 (or greater) Hadoop 2.4+

3 Spark 1.5.x+ Apache NiFi 0.5+ (or Hortonworks DataFlow) Hive MySQL Metadata What type of metadata does Kylo capture? Kylo captures all business and technical (e.g. schema) metadata defined during the creation of feeds and categories. Kylo captures lineage as relationships between feeds and automatically captures all operational metadata generated during a pipeline. Kylo captures job and feed performance metadata and SLA metrics and also generates data profile statistics which act as metadata. Version metadata and feed configuration changes are also captured. How does Kylo support metadata exchange with 3rd party metadata servers? Kylo's metadata server has REST APIs which could be used to do metadata exchange. Kylo does not have a single API call to export all, so one would have to be written in the integration layer or a new API via customization work. How does Kylo deal with custom data formats? The Kylo team is actively working on making the entire schema discovery mechanism a pluggable component so we can support future data formats that come along as a plug-in. This also includes the ability to supply a schema and the business glossary as a definition file during feed creation. The advantage of this approach is that it can leverage existing metadata. What is the metadata server? A key part of Kylo s architecture relies on open-source JBoss ModeShape framework which allows for dynamic schemas. This gives the business the ability extend entities with business metadata, etc. Dynamic schemas - provides extensible features for extending schema towards custom business metadata in the field. Other features include: Versioning - ability to track changes to metadata over time Text Search - flexible searching metastore Portability - can run on Sql and Nosql databases See: How extensible is Kylo s metadata model?

4 Kylo s metadata model is very extensible due the use of ModeShape (see above). The Kylo application allows an administrator to define standard business metadata that users will be prompted to enter when creating feeds and categories. The configuration can be setup so that all feeds in a particular category collect the same type of business metadata. This is an all UI-driven configuration. Is business related data captured or are they all operational metadata? Yes, see above. Business metadata fields can be defined by the customer and will appear in the UI during the feed setup process. Does Kylo s metadata server provide APIs? Yes, Kylo provides REST APIs documented using Swagger. Does Kylo provide a visual lineage? Not today. Kylo s API allows users to produce a lineage graph via JSON but does not visualize it (yet). The Kylo metadata server has REST APIs which could allow a pipeline designer to supplement Kylo s lineage with additional metadata that could provide a much finer-grained capability. Additionally REST APIs can be used to record metadata that originated in 3rd party metadata repositories. What type of process metadata does Kylo capture? Kylo captures information on the status of the feeds, how long it took, when it was started and finished, any errors etc. Kylo captures operational metadata at each step that can include record counts, etc. dependent on the type of step. What type of data or record lineage is captured? Kylo tracks lineage as relationships between feeds. A feed in Kylo represents a significant unit movement of data between source(s) and sink (e.g. an ingest, transformation pipeline, or export of data) but it does not imply a particular technology since transformations can occur in Spark, Hive, Pig, Shell scripts, or even 3rd party tools like Informatica, Talend, etc. At Think Big we believe that feed lineage has advantages over bottom-up approach tools like Cloudera Navigator (object lineage) provide. A feed is enriched with business data, Service Level Agreements, job history, and technical metadata about any sources and sinks it uses as well as operational metadata about datasets. When tracing lineage, Kylo is capable of providing a much more relatable representation of dependencies (either forwards or backwards through the chain). Does Kylo track object-level lineage (table, attribute)? Kylo does not automatically capture metadata for each transform at the lowest level or currently perform impact analysis on table structure changes. Object lineage is possible through tools such as Cloudera Navigator or Atlas which can be used as a supplement to Kylo. Keep in mind these tools have blind spots in that they are limited to certain technologies like Hive or Impala. If a transform occurs in Spark it will not be able to trace it. These tools also do not perform automatic impact analysis. Why is direct lineage automatically tracked between feeds and not table objects?

5 In a traditional EDW/RDBMS solution, a table is the de-facto storage unit and SQL primitives (filter, join, union, etc.) can fully represent all transforms. In Hadoop one must consider nontraditional concepts such as streams, queues, NoSQL/HBase, flat files, external tables w/ HDFS, Spark/Pig jobs, Map-Reduce, Python, etc. NiFi has 150 existing connectors to these different technologies and transforms. Kylo specifically allows a designer to use all these capabilities. The downside is there is no reliable mechanism for us to automatically capture object-level lineage through all these potential sources/sinks and processes that could come into play. Atlas and Navigator ignore the reality above and only track transforms between Hive/Impala tables via HQL. These two tools are constrained to tracking lineage for Hive transactions. This works just fine until the introduction of a source outside of Hive or an unsupported transformation technology (e.g. Spark, Pig) and now your lineage is broken! A feed in Kylo s metadata model is a 1st class entity representing a meaningful movement of data. Feeds generally process data between source(s) and sinks(s). Example is an Ingest or Wrangle job. The internals of a feed can involve very complex steps. Our feed abstraction makes those messy details a black box. The beauty of a feed is it is an incredibly enriched object for communicating: Business metadata. Descriptions of feed purpose as well as any other business metadata specified by the creator. Intra-feed lineage. All job executions, steps, and the operational metadata are captured including profile statistics. Note: operational metadata includes source files, counts, etc. DAG - Kylo can provide access to the full pipeline in human readable form (i.e. NiFi flow). Service Level Agreement and its performance over time Technical metadata such any tables created, its schema and validation and cleansing rules Finally and most importantly for lineage, a feed can declare a dependency on other feed(s). Currently this can be declared through Kylo s UI via the precondition capability. This dependency relationship can be n-deep and n-wide then queried (forward or backward) through the REST API. This allows Kylo to understand lineage from the perspective of chains of feeds each with their associated treasure trove of meaningful metadata. Is there a way to start from a table object and understand its lineage? Yes, if a table is created by a feed, it is possible to navigate from a table to its parent feed to dependent feed(s) to their associated table. The metadata relationship is: 1. Feed_B explicitly has a dependency on Feed_A. Navigate:Feed_A <- (depends) Feed_B 2. Feed_A writes to Table_A, Feed_B writes to Table_B. Navigate: Feed_A (sink:table_a) <- (depends) Feed_B (sink:table_b) Can Kylo capture enhanced lineage using its metadata model if a customer really wants a more explicit relationship between sources/sinks/processes? Yes, this is possible using the REST API. The way to do this rests with the designer role. The designer can create a NiFi model that explicitly updates the metadata repository to create detailed relationships with the deep knowledge he/she knows. It is extra up-front effort but provides total flexibility. Think Big R&D can provide examples of using REST API for this effort. This includes using our REST API to document external processes. For example, transforms and flows outside of Kylo's purview (e.g. Informatica, Bteq, Talend,...)

6 Development Lifecycle What's the development lifecycle process using Kylo? Pipelines developed with Apache NiFi by IT Designers can be developed in one environment then imported into UAT and production after testing. Thus the production NiFi environment would typically be limited to an administrator. Once the NiFi template is registered with Kylo s system then a business analyst can configure new feeds from it through the guided user interface. Does Kylo support approval process to move feeds into production? Kylo generation using Apache NiFi does not require a restart to deploy new pipelines. By locking down production NiFi access, users could be restricted from creating new types of pipelines without a formal approval process. The Kylo user interface does not yet support authorization, roles, etc. Suppose a client has over 100 source systems and has over 10 thousand tables to be ingested into Hadoop. What s the best way to configure data feeds for them in Kylo? One by one? One could theoretically write scripts that use Kylo s APIs to generate those feeds. Kylo does not currently have a utility to do it. Tool Comparisons Is Kylo similar to Cloudera Navigator, Apache Atlas? Navigator is a governance tool that comes as part the Cloudera Enterprise license. Among other features, it provides data lineage of Hive SQL queries. This is useful but only provides part of the picture. Kylo as a framework is really the foundation of an entire solution: Captures both business and operational metadata Tracks lineage at the feed-level (much more useful) Provides IT Operations with a useful dashboard; ability to track/enforce Service Level Agreements, performance metrics, etc. How does Kylo compare to traditional ETL tools like Talend, Informatica, Data Stage? Many ETL tools are focused on SQL transformations using their own technology cluster. Hadoop is really ELT (extract and load raw data, then transform). But typically the data warehouse style transformation is into a relational schema such as a star or snowflake. In Hadoop it is in another flat denormalized structure. Kylo provides a user interface for an end-user to configure new data feeds including schema, security, validation, cleansing, etc. Kylo s wrangling feature provides ability to perform complex visual data transformations using Spark as an engine. Kylo can theoretically support any transformation technology that Hadoop supports. Potentially 3 rd party technologies such as Talend can be orchestrated via NiFi and therefore leverage these technologies as well. How does Kylo compare with Teradata Listener?

7 Teradata Listener is a technology for self-service data ingest. Listener simplifies end-user (such as the application developer or marketing intelligence) and IT complexity by providing a single platform to deploy and manage an end-user specified ingestion and distribution model, significantly reducing deployment time and cost of ownership. Whereas, Kylo is a solutions framework for delivering Data Lakes on Hadoop and Spark. It performs ELT, etc. with UI modules for IT Operations, Data Analysts, and Data Scientists. Scheduler What is the best way to schedule job priorities in Kylo? Typically scheduling is performed through the built-in scheduler. There are some advanced techniques in NiFi that allows further prioritization for shared resources. Can Kylo support complicated ETL scheduling? Kylo supports Cron, timer-based, or event-based using rules. Cron is very flexible though. What s difference between timer and Cron schedule strategies? Timer is for a fixed interval, (i.e. every 5 min or 10 seconds). Cron can be configured to do that as well but can handle more complex cases such as, every Tues at 8AM and 4PM. Does Kylo support message-trigger schedule strategy? Yes, Kylo can absolutely support message trigger schedule strategies. This is merely a modification to Kylo s generic ingest template. Does Kylo support chaining feeds (i.e. one data feed consumed by another data feed)? Yes, Kylo supports event-based triggering of feeds. Start by defining rules that determine when to run a feed such as run when data has been processed by feed A and feed B and wait up to an hour before running anyway. Kylo supports simple rules up to very complicated rules requiring use of its API. Security Does Kylo have a roles, users and privileges management function? Kylo uses Spring Security. As such, it can integrate with Active Directory, LDAP, or most likely any authentication provider. Kylo s Operations Dashboard does not currently support roles as it is typically oriented to a single role (IT Operations). Authorization could be added in the future. How does incremental loading strategy of a data feed work?

8 Kylo supports a simple incremental extract component. Kylo maintains a high-water mark for each load using a date field in the source record. Kylo can further configure a backoff or overlap to ensure that it does not miss records. At this time there isn t CDC tool integration with Kylo. When creating a data feed for a relational database, how should one source the database s schema? Kylo introspects the source schema and exposes it through its user interface for users to configure feeds. What kinds of databases can be supported in Kylo? Kylo stores metadata and job history in MySQL or Postgres. For sourcing data, Kylo can theoretically support any database that provides a JDBC driver. Does Kylo support creating a Hive table automatically after the source data are put into Hadoop? Kylo has a stepper wizard that can be used to configure feeds and can define a table schema in Hive. The stepper infers the schema looking at a sample file or from the database source. It automatically creates a Hive table in the first run of the feed. Where is the pipeline configuration data stored? In database/file system? Kylo provides a user interface to configure pipelines or feeds. The metadata is stored in a metadata server backed by MySQL (alternatively Postgres). How can a user rerun a feed? What are the steps to restore original state before data ingest? One exciting feature of Kylo is the ability for NiFi to replay a failed step. This could be particularly useful for secondary steps of a pipeline (e.g. a flow succeeds to process data into Hive but fails to archive into S3). It might be possible to just re-execute the S3 portion without a full re-execution of the data. In general, the engineers who built Kylo strive for idempotent behavior so any step and data can be reprocessed without duplication. For more information on Kylo visit or follow Think Big on Twitter for the latest Kylo program updates.

Kylo Documentation. Release Think Big, a Teradata Company

Kylo Documentation. Release Think Big, a Teradata Company Kylo Documentation Release 0.9.1 Think Big, a Teradata Company Nov 07, 2018 About 1 Features 3 2 FAQ 5 3 Terminology 15 4 Release Notes 19 5 Downloads 89 6 Overview 93 7 Review Dependencies 95 8 Prepare