Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions?

Size: px
Start display at page:

Download "Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions?"

Transcription

1 Kylo FAQ General What is Kylo? Capturing and processing big data isn't easy. That's why Apache products such as Spark, Kafka, Hadoop, and NiFi that scale, process, and manage immense data volumes are so popular. The drawback is that they do not make data flows easy for business users to access, optimize, and analyze. Kylo, from Think Big, overcomes that challenge. It lets businesses easily configure and monitor data pipelines in and through the data lake so users have constant access to high-quality data. It also enhances data profiling and discovery with extensible metadata. When will Kylo be generally available? Think Big is actively using Kylo with multiple large-scale enterprises globally. We are currently limiting use to Think Big services-driven projects until the open source release. We expect to public open source release Q Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions? Yes. NiFi operates on the "edge" and isn't bound to any particular Hadoop distribution. What is Kylo s value-add over plain NiFi? NiFi provides flow-based processing, acts as an orchestration engine and framework for data processing on the edge. It doesn t itself provide all the tooling required for a Data Lake solution. Key benefits of Kylo include: Write-once, use many times o Apache NiFi is a powerful IT tool for designing pipelines but in practice most Data Lake oriented feeds utilize just a small number of unique flows or patterns". Think Big Kylo allows IT the flexibility to design these unique flows as a model then register the NiFi template with Kylo. This enables non-technical business users to configure dozens or hundreds of new feeds through a simple, guided stepper-ui. In other words, Kylo s UI allows users to set up pipeline without having to code in NiFi. As long as the basic ingestion pattern is the same, there is no need for new coding. Business users will be

2 able to bring in new data sources, perform standard transformations and publish to target systems. Superior UI for monitoring data feeds o Kylo s Operations Dashboard provides centralized health monitoring of feeds and underlying data quality to provide data confidence. It can also integrate with Hadoop monitoring tools such as Ambari or Cloudera Navigator to correlate feed health to cluster service issues. Kylo can enforce Service Level Agreements, data quality metrics, and alerts Key Data Lake features o Metadata search, data discovery, wrangling, data browse, event-based feed execution (to chain together flows) Accelerates NiFi development through NiFi extensions o Includes custom NiFi custom processors for operations such as: Data Profile, Data Cleanse, Data Validate, Merge/Dedupe, Extract Table with high-water, etc. o Includes custom Nifi processors for utilizing Hadoop for processing: Spark exec, Sqoop, Spark shell, Hive and Spark via JDBC/Thrift, and others. These processors aren't yet available with vanilla NiFi o Pre-built NiFi templates for implementing Data Lake best practices: Data Ingest, ILM, and Data Processing Open Source What are Think Big s plans to open source Kylo? Think Big plans to release Kylo as open source under the Apache 2.0 license. Customers given the source code prior to release will be asked to sign an NDA restricting them from publicly releasing the source until after formal release. Think Big will offer paid commercial support for the framework under the open source model. Architecture What is the deployment architecture? Kylo typically runs on a Linux edge node of the Hadoop cluster either on-premise or the cloud. Kylo integrates with Apache NiFi that can be co-located or on separate edge machines. Kylo currently requires Postgres or MySQL for a metadata repository. It requires Java 8 or later and has been tested on several major generations of Cloudera or Hortonworks Hadoop distributions and Apache Spark 1.5.x+. It is installed via RPM. Are there any pre-requisites for the Kylo installation and setup? Redhat/GNU/Linux distributions RPM (for install) Java 1.8 (or greater) Hadoop 2.4+

3 Spark 1.5.x+ Apache NiFi 0.5+ (or Hortonworks DataFlow) Hive MySQL Metadata What type of metadata does Kylo capture? Kylo captures all business and technical (e.g. schema) metadata defined during the creation of feeds and categories. Kylo captures lineage as relationships between feeds and automatically captures all operational metadata generated during a pipeline. Kylo captures job and feed performance metadata and SLA metrics and also generates data profile statistics which act as metadata. Version metadata and feed configuration changes are also captured. How does Kylo support metadata exchange with 3rd party metadata servers? Kylo's metadata server has REST APIs which could be used to do metadata exchange. Kylo does not have a single API call to export all, so one would have to be written in the integration layer or a new API via customization work. How does Kylo deal with custom data formats? The Kylo team is actively working on making the entire schema discovery mechanism a pluggable component so we can support future data formats that come along as a plug-in. This also includes the ability to supply a schema and the business glossary as a definition file during feed creation. The advantage of this approach is that it can leverage existing metadata. What is the metadata server? A key part of Kylo s architecture relies on open-source JBoss ModeShape framework which allows for dynamic schemas. This gives the business the ability extend entities with business metadata, etc. Dynamic schemas - provides extensible features for extending schema towards custom business metadata in the field. Other features include: Versioning - ability to track changes to metadata over time Text Search - flexible searching metastore Portability - can run on Sql and Nosql databases See: How extensible is Kylo s metadata model?

4 Kylo s metadata model is very extensible due the use of ModeShape (see above). The Kylo application allows an administrator to define standard business metadata that users will be prompted to enter when creating feeds and categories. The configuration can be setup so that all feeds in a particular category collect the same type of business metadata. This is an all UI-driven configuration. Is business related data captured or are they all operational metadata? Yes, see above. Business metadata fields can be defined by the customer and will appear in the UI during the feed setup process. Does Kylo s metadata server provide APIs? Yes, Kylo provides REST APIs documented using Swagger. Does Kylo provide a visual lineage? Not today. Kylo s API allows users to produce a lineage graph via JSON but does not visualize it (yet). The Kylo metadata server has REST APIs which could allow a pipeline designer to supplement Kylo s lineage with additional metadata that could provide a much finer-grained capability. Additionally REST APIs can be used to record metadata that originated in 3rd party metadata repositories. What type of process metadata does Kylo capture? Kylo captures information on the status of the feeds, how long it took, when it was started and finished, any errors etc. Kylo captures operational metadata at each step that can include record counts, etc. dependent on the type of step. What type of data or record lineage is captured? Kylo tracks lineage as relationships between feeds. A feed in Kylo represents a significant unit movement of data between source(s) and sink (e.g. an ingest, transformation pipeline, or export of data) but it does not imply a particular technology since transformations can occur in Spark, Hive, Pig, Shell scripts, or even 3rd party tools like Informatica, Talend, etc. At Think Big we believe that feed lineage has advantages over bottom-up approach tools like Cloudera Navigator (object lineage) provide. A feed is enriched with business data, Service Level Agreements, job history, and technical metadata about any sources and sinks it uses as well as operational metadata about datasets. When tracing lineage, Kylo is capable of providing a much more relatable representation of dependencies (either forwards or backwards through the chain). Does Kylo track object-level lineage (table, attribute)? Kylo does not automatically capture metadata for each transform at the lowest level or currently perform impact analysis on table structure changes. Object lineage is possible through tools such as Cloudera Navigator or Atlas which can be used as a supplement to Kylo. Keep in mind these tools have blind spots in that they are limited to certain technologies like Hive or Impala. If a transform occurs in Spark it will not be able to trace it. These tools also do not perform automatic impact analysis. Why is direct lineage automatically tracked between feeds and not table objects?

5 In a traditional EDW/RDBMS solution, a table is the de-facto storage unit and SQL primitives (filter, join, union, etc.) can fully represent all transforms. In Hadoop one must consider nontraditional concepts such as streams, queues, NoSQL/HBase, flat files, external tables w/ HDFS, Spark/Pig jobs, Map-Reduce, Python, etc. NiFi has 150 existing connectors to these different technologies and transforms. Kylo specifically allows a designer to use all these capabilities. The downside is there is no reliable mechanism for us to automatically capture object-level lineage through all these potential sources/sinks and processes that could come into play. Atlas and Navigator ignore the reality above and only track transforms between Hive/Impala tables via HQL. These two tools are constrained to tracking lineage for Hive transactions. This works just fine until the introduction of a source outside of Hive or an unsupported transformation technology (e.g. Spark, Pig) and now your lineage is broken! A feed in Kylo s metadata model is a 1st class entity representing a meaningful movement of data. Feeds generally process data between source(s) and sinks(s). Example is an Ingest or Wrangle job. The internals of a feed can involve very complex steps. Our feed abstraction makes those messy details a black box. The beauty of a feed is it is an incredibly enriched object for communicating: Business metadata. Descriptions of feed purpose as well as any other business metadata specified by the creator. Intra-feed lineage. All job executions, steps, and the operational metadata are captured including profile statistics. Note: operational metadata includes source files, counts, etc. DAG - Kylo can provide access to the full pipeline in human readable form (i.e. NiFi flow). Service Level Agreement and its performance over time Technical metadata such any tables created, its schema and validation and cleansing rules Finally and most importantly for lineage, a feed can declare a dependency on other feed(s). Currently this can be declared through Kylo s UI via the precondition capability. This dependency relationship can be n-deep and n-wide then queried (forward or backward) through the REST API. This allows Kylo to understand lineage from the perspective of chains of feeds each with their associated treasure trove of meaningful metadata. Is there a way to start from a table object and understand its lineage? Yes, if a table is created by a feed, it is possible to navigate from a table to its parent feed to dependent feed(s) to their associated table. The metadata relationship is: 1. Feed_B explicitly has a dependency on Feed_A. Navigate:Feed_A <- (depends) Feed_B 2. Feed_A writes to Table_A, Feed_B writes to Table_B. Navigate: Feed_A (sink:table_a) <- (depends) Feed_B (sink:table_b) Can Kylo capture enhanced lineage using its metadata model if a customer really wants a more explicit relationship between sources/sinks/processes? Yes, this is possible using the REST API. The way to do this rests with the designer role. The designer can create a NiFi model that explicitly updates the metadata repository to create detailed relationships with the deep knowledge he/she knows. It is extra up-front effort but provides total flexibility. Think Big R&D can provide examples of using REST API for this effort. This includes using our REST API to document external processes. For example, transforms and flows outside of Kylo's purview (e.g. Informatica, Bteq, Talend,...)

6 Development Lifecycle What's the development lifecycle process using Kylo? Pipelines developed with Apache NiFi by IT Designers can be developed in one environment then imported into UAT and production after testing. Thus the production NiFi environment would typically be limited to an administrator. Once the NiFi template is registered with Kylo s system then a business analyst can configure new feeds from it through the guided user interface. Does Kylo support approval process to move feeds into production? Kylo generation using Apache NiFi does not require a restart to deploy new pipelines. By locking down production NiFi access, users could be restricted from creating new types of pipelines without a formal approval process. The Kylo user interface does not yet support authorization, roles, etc. Suppose a client has over 100 source systems and has over 10 thousand tables to be ingested into Hadoop. What s the best way to configure data feeds for them in Kylo? One by one? One could theoretically write scripts that use Kylo s APIs to generate those feeds. Kylo does not currently have a utility to do it. Tool Comparisons Is Kylo similar to Cloudera Navigator, Apache Atlas? Navigator is a governance tool that comes as part the Cloudera Enterprise license. Among other features, it provides data lineage of Hive SQL queries. This is useful but only provides part of the picture. Kylo as a framework is really the foundation of an entire solution: Captures both business and operational metadata Tracks lineage at the feed-level (much more useful) Provides IT Operations with a useful dashboard; ability to track/enforce Service Level Agreements, performance metrics, etc. How does Kylo compare to traditional ETL tools like Talend, Informatica, Data Stage? Many ETL tools are focused on SQL transformations using their own technology cluster. Hadoop is really ELT (extract and load raw data, then transform). But typically the data warehouse style transformation is into a relational schema such as a star or snowflake. In Hadoop it is in another flat denormalized structure. Kylo provides a user interface for an end-user to configure new data feeds including schema, security, validation, cleansing, etc. Kylo s wrangling feature provides ability to perform complex visual data transformations using Spark as an engine. Kylo can theoretically support any transformation technology that Hadoop supports. Potentially 3 rd party technologies such as Talend can be orchestrated via NiFi and therefore leverage these technologies as well. How does Kylo compare with Teradata Listener?

7 Teradata Listener is a technology for self-service data ingest. Listener simplifies end-user (such as the application developer or marketing intelligence) and IT complexity by providing a single platform to deploy and manage an end-user specified ingestion and distribution model, significantly reducing deployment time and cost of ownership. Whereas, Kylo is a solutions framework for delivering Data Lakes on Hadoop and Spark. It performs ELT, etc. with UI modules for IT Operations, Data Analysts, and Data Scientists. Scheduler What is the best way to schedule job priorities in Kylo? Typically scheduling is performed through the built-in scheduler. There are some advanced techniques in NiFi that allows further prioritization for shared resources. Can Kylo support complicated ETL scheduling? Kylo supports Cron, timer-based, or event-based using rules. Cron is very flexible though. What s difference between timer and Cron schedule strategies? Timer is for a fixed interval, (i.e. every 5 min or 10 seconds). Cron can be configured to do that as well but can handle more complex cases such as, every Tues at 8AM and 4PM. Does Kylo support message-trigger schedule strategy? Yes, Kylo can absolutely support message trigger schedule strategies. This is merely a modification to Kylo s generic ingest template. Does Kylo support chaining feeds (i.e. one data feed consumed by another data feed)? Yes, Kylo supports event-based triggering of feeds. Start by defining rules that determine when to run a feed such as run when data has been processed by feed A and feed B and wait up to an hour before running anyway. Kylo supports simple rules up to very complicated rules requiring use of its API. Security Does Kylo have a roles, users and privileges management function? Kylo uses Spring Security. As such, it can integrate with Active Directory, LDAP, or most likely any authentication provider. Kylo s Operations Dashboard does not currently support roles as it is typically oriented to a single role (IT Operations). Authorization could be added in the future. How does incremental loading strategy of a data feed work?

8 Kylo supports a simple incremental extract component. Kylo maintains a high-water mark for each load using a date field in the source record. Kylo can further configure a backoff or overlap to ensure that it does not miss records. At this time there isn t CDC tool integration with Kylo. When creating a data feed for a relational database, how should one source the database s schema? Kylo introspects the source schema and exposes it through its user interface for users to configure feeds. What kinds of databases can be supported in Kylo? Kylo stores metadata and job history in MySQL or Postgres. For sourcing data, Kylo can theoretically support any database that provides a JDBC driver. Does Kylo support creating a Hive table automatically after the source data are put into Hadoop? Kylo has a stepper wizard that can be used to configure feeds and can define a table schema in Hive. The stepper infers the schema looking at a sample file or from the database source. It automatically creates a Hive table in the first run of the feed. Where is the pipeline configuration data stored? In database/file system? Kylo provides a user interface to configure pipelines or feeds. The metadata is stored in a metadata server backed by MySQL (alternatively Postgres). How can a user rerun a feed? What are the steps to restore original state before data ingest? One exciting feature of Kylo is the ability for NiFi to replay a failed step. This could be particularly useful for secondary steps of a pipeline (e.g. a flow succeeds to process data into Hive but fails to archive into S3). It might be possible to just re-execute the S3 portion without a full re-execution of the data. In general, the engineers who built Kylo strive for idempotent behavior so any step and data can be reprocessed without duplication. For more information on Kylo visit or follow Think Big on Twitter for the latest Kylo program updates.

Kylo Documentation. Release Think Big, a Teradata Company

Kylo Documentation. Release Think Big, a Teradata Company Kylo Documentation Release 0.9.1 Think Big, a Teradata Company Nov 07, 2018 About 1 Features 3 2 FAQ 5 3 Terminology 15 4 Release Notes 19 5 Downloads 89 6 Overview 93 7 Review Dependencies 95 8 Prepare

More information

Kylo Documentation. Release Think Big, a Teradata Company

Kylo Documentation. Release Think Big, a Teradata Company Kylo Documentation Release 0.8.2 Think Big, a Teradata Company Aug 24, 2017 About 1 Features 3 2 FAQ 5 3 Terminology 15 4 Release Notes 19 5 Deployment Guides 45 6 Authentication 111 7 Access Control

More information

Kylo Documentation. Release Think Big, a Teradata Company

Kylo Documentation. Release Think Big, a Teradata Company Kylo Documentation Release 0.9.0 Think Big, a Teradata Company Mar 28, 2018 About 1 Features 3 2 FAQ 5 3 Terminology 15 4 Release Notes 19 5 Downloads 73 6 Overview 75 7 Review Dependencies 77 8 Prepare

More information

Kylo Documentation. Release Think Big, a Teradata Company

Kylo Documentation. Release Think Big, a Teradata Company Kylo Documentation Release 0.8.4 Think Big, a Teradata Company Dec 22, 2017 About 1 Features 3 2 FAQ 5 3 Terminology 15 4 Release Notes 19 5 Downloads 69 6 Overview 71 7 Review Dependencies 73 8 Prepare

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

Informatica Enterprise Information Catalog

Informatica Enterprise Information Catalog Data Sheet Informatica Enterprise Information Catalog Benefits Automatically catalog and classify all types of data across the enterprise using an AI-powered catalog Identify domains and entities with

More information

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET Lenses 2.1 Enterprise Features PRODUCT DATA SHEET 1 OVERVIEW DataOps is the art of progressing from data to value in seconds. For us, its all about making data operations as easy and fast as using the

More information

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

MAPR DATA GOVERNANCE WITHOUT COMPROMISE MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

Information empowerment for your evolving data ecosystem

Information empowerment for your evolving data ecosystem Information empowerment for your evolving data ecosystem Highlights Enables better results for critical projects and key analytics initiatives Ensures the information is trusted, consistent and governed

More information

What is Gluent? The Gluent Data Platform

What is Gluent? The Gluent Data Platform What is Gluent? The Gluent Data Platform The Gluent Data Platform provides a transparent data virtualization layer between traditional databases and modern data storage platforms, such as Hadoop, in the

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET SOLUTION SHEET Syncsort DMX-h Simplifying Big Data Integration Goals of the Modern Data Architecture Data warehouses and mainframes are mainstays of traditional data architectures and still play a vital

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Smart Data Catalog DATASHEET

Smart Data Catalog DATASHEET DATASHEET Smart Data Catalog There is so much data distributed across organizations that data and business professionals don t know what data is available or valuable. When it s time to create a new report

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Data Lake Based Systems that Work

Data Lake Based Systems that Work Data Lake Based Systems that Work There are many article and blogs about what works and what does not work when trying to build out a data lake and reporting system. At DesignMind, we have developed a

More information

WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG

WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG The #1 Challenge in Successfully Deploying a Data Catalog The data cataloging space is relatively new. As a result, many organizations don

More information

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Hortonworks DataFlow Sam Lachterman Solutions Engineer Hortonworks DataFlow Sam Lachterman Solutions Engineer 1 Hortonworks Inc. 2011 2017. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development,

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP 07.29.2015 LANDING STAGING DW Let s start with something basic Is Data Lake a new concept? What is the closest we can

More information

Oracle Data Integrator 12c: Integration and Administration

Oracle Data Integrator 12c: Integration and Administration Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 67863102 Oracle Data Integrator 12c: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive

More information

Enterprise Data Catalog for Microsoft Azure Tutorial

Enterprise Data Catalog for Microsoft Azure Tutorial Enterprise Data Catalog for Microsoft Azure Tutorial VERSION 10.2 JANUARY 2018 Page 1 of 45 Contents Tutorial Objectives... 4 Enterprise Data Catalog Overview... 5 Overview... 5 Objectives... 5 Enterprise

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Orchestrating Big Data with Apache Airflow

Orchestrating Big Data with Apache Airflow Orchestrating Big Data with Apache Airflow July 2016 Airflow allows developers, admins and operations teams to author, schedule and orchestrate workflows and jobs within an organization. While it s main

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Oracle Data Integrator 12c: Integration and Administration

Oracle Data Integrator 12c: Integration and Administration Oracle University Contact Us: +34916267792 Oracle Data Integrator 12c: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive data integration platform

More information

Hortonworks DataFlow. Accelerating Big Data Collection and DataFlow Management. A Hortonworks White Paper DECEMBER Hortonworks DataFlow

Hortonworks DataFlow. Accelerating Big Data Collection and DataFlow Management. A Hortonworks White Paper DECEMBER Hortonworks DataFlow Hortonworks DataFlow Accelerating Big Data Collection and DataFlow Management A Hortonworks White Paper DECEMBER 2015 Hortonworks DataFlow 2015 Hortonworks www.hortonworks.com 2 Contents What is Hortonworks

More information

Installing HDF Services on an Existing HDP Cluster

Installing HDF Services on an Existing HDP Cluster 3 Installing HDF Services on an Existing HDP Cluster Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Upgrade Ambari and HDP...3 Installing Databases...3 Installing MySQL... 3 Configuring

More information

ASG WHITE PAPER DATA INTELLIGENCE. ASG s Enterprise Data Intelligence Solutions: Data Lineage Diving Deeper

ASG WHITE PAPER DATA INTELLIGENCE. ASG s Enterprise Data Intelligence Solutions: Data Lineage Diving Deeper THE NEED Knowing where data came from, how it moves through systems, and how it changes, is the most critical and most difficult task in any data management project. If that process known as tracing data

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

iway Big Data Integrator New Features Bulletin and Release Notes

iway Big Data Integrator New Features Bulletin and Release Notes iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.2 DN3502232.0717 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo, iway,

More information

Data Management Glossary

Data Management Glossary Data Management Glossary A Access path: The route through a system by which data is found, accessed and retrieved Agile methodology: An approach to software development which takes incremental, iterative

More information

@Pentaho #BigDataWebSeries

@Pentaho #BigDataWebSeries Enterprise Data Warehouse Optimization with Hadoop Big Data @Pentaho #BigDataWebSeries Your Hosts Today Dave Henry SVP Enterprise Solutions Davy Nys VP EMEA & APAC 2 Source/copyright: The Human Face of

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Down the event-driven road: Experiences of integrating streaming into analytic data platforms Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

IBM Data Replication for Big Data

IBM Data Replication for Big Data IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,

More information

StreamSets Control Hub Installation Guide

StreamSets Control Hub Installation Guide StreamSets Control Hub Installation Guide Version 3.2.1 2018, StreamSets, Inc. All rights reserved. Table of Contents 2 Table of Contents Chapter 1: What's New...1 What's New in 3.2.1... 2 What's New in

More information

Schema Registry Overview

Schema Registry Overview 3 Date of Publish: 2018-11-15 https://docs.hortonworks.com/ Contents...3 Examples of Interacting with Schema Registry...4 Schema Registry Use Cases...6 Use Case 1: Registering and Querying a Schema for

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time

More information

Installing an HDF cluster

Installing an HDF cluster 3 Installing an HDF cluster Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Installing Ambari...3 Installing Databases...3 Installing MySQL... 3 Configuring SAM and Schema Registry Metadata

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Migrate from Netezza Workload Migration

Migrate from Netezza Workload Migration Migrate from Netezza Automated Big Data Open Netezza Source Workload Migration CASE SOLUTION STUDY BRIEF Automated Netezza Workload Migration To achieve greater scalability and tighter integration with

More information

Hortonworks and The Internet of Things

Hortonworks and The Internet of Things Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded

More information

Hortonworks Data Platform

Hortonworks Data Platform Data Governance () docs.hortonworks.com : Data Governance Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform

More information

Migrate from Netezza Workload Migration

Migrate from Netezza Workload Migration Migrate from Netezza Automated Big Data Open Netezza Source Workload Migration CASE SOLUTION STUDY BRIEF Automated Netezza Workload Migration To achieve greater scalability and tighter integration with

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Oracle Big Data Discovery

Oracle Big Data Discovery Oracle Big Data Discovery Turning Data into Business Value Harald Erb Oracle Business Analytics & Big Data 1 Safe Harbor Statement The following is intended to outline our general product direction. It

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Data sources. Gartner, The State of Data Warehousing in 2012

Data sources. Gartner, The State of Data Warehousing in 2012 data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Enterprise Data Catalog Fixed Limitations ( Update 1)

Enterprise Data Catalog Fixed Limitations ( Update 1) Informatica LLC Enterprise Data Catalog 10.2.1 Update 1 Release Notes September 2018 Copyright Informatica LLC 2015, 2018 Contents Enterprise Data Catalog Fixed Limitations (10.2.1 Update 1)... 1 Enterprise

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

Alexander Klein. #SQLSatDenmark. ETL meets Azure

Alexander Klein. #SQLSatDenmark. ETL meets Azure Alexander Klein ETL meets Azure BIG Thanks to SQLSat Denmark sponsors Save the date for exiting upcoming events PASS Camp 2017 Main Camp 05.12. 07.12.2017 (04.12. Kick-Off abends) Lufthansa Training &

More information

An Oracle White Paper October 12 th, Oracle Metadata Management v New Features Overview

An Oracle White Paper October 12 th, Oracle Metadata Management v New Features Overview An Oracle White Paper October 12 th, 2018 Oracle Metadata Management v12.2.1.3.0 Disclaimer This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality,

More information

The future of Subsurface Data Management? Building a Data Science Lab Data Lake Jane McConnell, Practice Partner Oil and Gas, Teradata DEJ KL, 3

The future of Subsurface Data Management? Building a Data Science Lab Data Lake Jane McConnell, Practice Partner Oil and Gas, Teradata DEJ KL, 3 The future of Subsurface Data Management? Building a Data Science Lab Data Lake Jane McConnell, Practice Partner Oil and Gas, Teradata DEJ KL, 3 October 2017 Analytics and AI is gaining ground in Subsurface

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Data Virtualization Implementation Methodology and Best Practices

Data Virtualization Implementation Methodology and Best Practices White Paper Data Virtualization Implementation Methodology and Best Practices INTRODUCTION Cisco s proven Data Virtualization Implementation Methodology and Best Practices is compiled from our successful

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Data sources. Gartner, The State of Data Warehousing in 2012

Data sources. Gartner, The State of Data Warehousing in 2012 data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Administration 1. DLM Administration. Date of Publish:

Administration 1. DLM Administration. Date of Publish: 1 DLM Administration Date of Publish: 2018-05-18 http://docs.hortonworks.com Contents Replication concepts... 3 HDFS cloud replication...3 Hive cloud replication... 3 Cloud replication guidelines and considerations...4

More information

Datameer for Data Preparation:

Datameer for Data Preparation: Datameer for Data Preparation: Explore, Profile, Blend, Cleanse, Enrich, Share, Operationalize DATAMEER FOR DATA PREPARATION: EXPLORE, PROFILE, BLEND, CLEANSE, ENRICH, SHARE, OPERATIONALIZE Datameer Datameer

More information

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,

More information

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017 Spotfire Advanced Data Services Lunch & Learn Tuesday, 21 November 2017 CONFIDENTIALITY The following information is confidential information of TIBCO Software Inc. Use, duplication, transmission, or republication

More information

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT. Oracle Big Data. A NALYTICS A ND MANAG E MENT. Oracle Big Data: Redundância. Compatível com ecossistema Hadoop, HIVE, HBASE, SPARK. Integração com Cloudera Manager. Possibilidade de Utilização da Linguagem

More information

Improving Your Business with Oracle Data Integration See How Oracle Enterprise Metadata Management Can Help You

Improving Your Business with Oracle Data Integration See How Oracle Enterprise Metadata Management Can Help You Improving Your Business with Oracle Data Integration See How Oracle Enterprise Metadata Management Can Help You Özgür Yiğit Oracle Data Integration, Senior Manager, ECEMEA Safe Harbor Statement The following

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

The Technology of the Business Data Lake. Appendix

The Technology of the Business Data Lake. Appendix The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform

More information

GDPR Data Discovery and Reporting

GDPR Data Discovery and Reporting GDPR Data Discovery and Reporting PRODUCT OVERVIEW The GDPR Challenge The EU General Data Protection Regulation (GDPR) is a regulation mainly concerned with how data is captured and retained, and how organizations

More information

FEATURES BENEFITS SUPPORTED PLATFORMS. Reduce costs associated with testing data projects. Expedite time to market

FEATURES BENEFITS SUPPORTED PLATFORMS. Reduce costs associated with testing data projects. Expedite time to market E TL VALIDATOR DATA SHEET FEATURES BENEFITS SUPPORTED PLATFORMS ETL Testing Automation Data Quality Testing Flat File Testing Big Data Testing Data Integration Testing Wizard Based Test Creation No Custom

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Hortonworks Data Platform

Hortonworks Data Platform Data Governance () docs.hortonworks.com : Data Governance Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.1 DN3502232.0517 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,

More information

Agile Data Management Challenges in Enterprise Big Data Landscape

Agile Data Management Challenges in Enterprise Big Data Landscape Agile Data Management Challenges in Enterprise Big Data Landscape Eric Simon, SAP Big Data October, 2017 1 Evolution Towards Enterprise Big Data Landscape administrator Data analyst Athena Redshift #123

More information

iway iway Big Data Integrator User s Guide Version DN

iway iway Big Data Integrator User s Guide Version DN iway iway Big Data Integrator User s Guide Version 1.5.0 DN3502221.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo, iway, iway Software, Parlay,

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Own change. TECHNICAL WHITE PAPER Data Integration With REST API

Own change. TECHNICAL WHITE PAPER Data Integration With REST API TECHNICAL WHITE PAPER Data Integration With REST API Real-time or near real-time accurate and fast retrieval of key metrics is a critical need for an organization. Many times, valuable data are stored

More information

Hortonworks Data Platform

Hortonworks Data Platform Data Governance () docs.hortonworks.com : Data Governance Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Informatica Cloud Data Integration Spring 2018 April. What's New

Informatica Cloud Data Integration Spring 2018 April. What's New Informatica Cloud Data Integration Spring 2018 April What's New Informatica Cloud Data Integration What's New Spring 2018 April April 2018 Copyright Informatica LLC 2016, 2018 This software and documentation

More information