Deploying Custom Step Plugins for Pentaho MapReduce

Similar documents
Transformation Variables in Pentaho MapReduce

Pentaho MapReduce with MapR Client

Guidelines - Configuring PDI, MapReduce, and MapR

Pentaho Backup and Recovery

Big Data XML Parsing in Pentaho Data Integration (PDI)

PDI Techniques Logging and Monitoring

Pentaho Data Integration (PDI) Development Techniques

PDI Techniques Working with Git and PDI Enterprise Repository

Getting Started with Pentaho and Cloudera QuickStart VM

Oracle BDA: Working With Mammoth - 1

Oracle Big Data Fundamentals Ed 2

Pentaho Data Integration (PDI) Techniques - Dividing Large Repositories

Pentaho and Microsoft Azure

Best Practices for Installation & Upgrade

Hortonworks Data Platform

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

Hadoop Map Reduce 10/17/2018 1

Using QBrowser v2 with WebLogic JMS for 10.3

Adaptive Executive Layer with Pentaho Data Integration

Troubleshooting Guide and FAQs Community release

Roadmap: Operating Pentaho at Scale. Jens Bleuel Senior Product Manager, Pentaho

Pentaho Data Integration (PDI) Standards for Lookups, Joins, and Subroutines

A brief history on Hadoop

Pentaho Server SAML Authentication with Hybrid Authorization

Parallelism with the PDI AEL Spark Engine. Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara

<Partner Name> <Partner Product> RSA Ready Implementation Guide for. MapR Converged Data Platform 3.1

Innovatus Technologies

Pentaho MapReduce. 3. Expand the Big Data tab, drag a Hadoop Copy Files step onto the canvas 4. Draw a hop from Start to Hadoop Copy Files

Getting Started with Hadoop

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Ultimate Hadoop Developer Training

Hadoop Lab 3 Creating your first Map-Reduce Process

Pentaho Data Integration (PDI) Techniques - Guidelines for Metadata Injection

Data transformation guide for ZipSync

Dell Storage Compellent Integration Tools for VMware

Data Explorer in Pentaho Data Integration (PDI)

Deployment Planning Guide

Dell Storage Integration Tools for VMware

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Creating an Inverted Index using Hadoop

Exam Questions CCA-505

Hadoop Tutorial. General Instructions

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Oracle Big Data Fundamentals Ed 1

DePaul University CSC555 -Mining Big Data. Course Project by Bill Qualls Dr. Alexander Rasin, Instructor November 2013

Platform SDK Deployment Guide. Platform SDK 8.1.2

Quick Deployment Step- by- step instructions to deploy Oracle Big Data Lite Virtual Machine

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Installing Cisco Virtual Switch Update Manager

VMware vsphere Big Data Extensions Administrator's and User's Guide

SEVENMENTOR TRAINING PVT.LTD

Red Hat JBoss Fuse 6.3

Deploying Cisco Nexus Data Broker

Leverage Rational Application Developer v8 to develop OSGi application and test with Websphere Application Server v8

Note, you must have Java installed on your computer in order to use Exactly. Download Java here: Installing Exactly

Pentaho Data Integration Installation Guide

Talend Big Data Sandbox. Big Data Insights Cookbook

Problem Set 0. General Instructions

How to Install and Configure EBF16193 for Hortonworks HDP 2.3 and HotFix 3 Update 2

NET EXPERT SOLUTIONS PVT LTD

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Hadoop Quickstart. Table of contents

Syncsort Incorporated, 2016

Using Hadoop With Pentaho Data Integration

Instructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e

Beta. VMware vsphere Big Data Extensions Administrator's and User's Guide. vsphere Big Data Extensions 1.0 EN

Project Requirements

Big Data Hadoop Stack

VMware vsphere Big Data Extensions Command-Line Interface Guide

20742: Identity with Windows Server 2016

Introducing Oracle R Enterprise 1.4 -

CS 209 Programming in Java #12 JAR Files: Creation and Use

EXTERNAL SOURCE CONTROL & PENTAHO. One-button export, formatting and standardization, commit, and deploy from separate environments.

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Red Hat JBoss Fuse 6.2

Interact2 Help and Support

[MS20414]: Implementing an Advanced Server Infrastructure

EMC Documentum Composer

Commands Manual. Table of contents

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

EMC Documentum Composer

Demo Package Guide. OpenL Tablets BRMS Release 5.19

Red Hat JBoss Fuse 6.1

Pentaho Data Integration (PDI) with Oracle Wallet Security

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Processing Big Data with Hadoop in Azure HDInsight

Microsoft Implementing an Advanced Server Infrastructure

Red Hat Fuse 7.0 Installing on Apache Karaf

Compile and Run WordCount via Command Line

Commands Guide. Table of contents

Installing or Upgrading the Cisco Nexus Data Broker Software in Centralized Mode

Expert Lecture plan proposal Hadoop& itsapplication

Best Practices - PDI Performance Tuning

h p://

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

Profitability Application Pack Installation Guide Release

Pentaho Data Integration Installation Guide

Hortonworks Technical Preview for Apache Falcon

Transcription:

Deploying Custom Step Plugins for Pentaho MapReduce

This page intentionally left blank.

Contents Overview... 1 Before You Begin... 1 Pentaho MapReduce Configuration... 2 Plugin Properties Defined... 2 The Copy Process... 2 Configuration Changes... 3 Distributing Custom OSGi Plugins to PMR... 3 Manual Copy... 3 Modify the Archive File... 4 Related Information... 4 Finalization Checklist... 5

This page intentionally left blank.

Overview Pentaho MapReduce (PMR) allows ETL developers to design and execute transformations that run in Hadoop MapReduce. If one or more custom steps are used in the mapper, reducer, or combiner transformations, additional configuration will be needed to make sure all dependencies are available for use in the Hadoop environment. Our intended audience is Pentaho Data Integration (PDI) developers or administrators configuring PMR on a Hadoop cluster. The intention of this document is to speak about topics generally; however, these are the specific versions covered here: Software Version(s) Pentaho 6.x, 7.x, 8.x The Components Reference in Pentaho Documentation has a complete list of supported software and hardware. Before You Begin This document assumes that you have knowledge of Pentaho and that you already have connectivity to a Hadoop cluster. More information about related topics outside of this document can be found in the Pentaho MapReduce documentation. Page 1

Pentaho MapReduce Configuration The big data plugin adds PMR capabilities to PDI. Before PMR can perform PDI transformations in MapReduce, many code libraries must be copied to the Hadoop cluster. When a PMR job is invoked, PDI checks to see if the folder exists on the cluster and copies libraries if the folder does not exist. You can find details on these topics in the following sections: Configuring Plugin Properties The Copy Process Distributing Custom OSGi Plugins to PMR Plugin Properties Defined These properties in the file $PENTAHO_HOME/data-integration/plugins/pentaho-big-dataplugin/plugin.properties define what is transferred to HDFS when the folder does not exist: Property Table 1: Plugin Properties and Descriptions Description pmr.kettle.installation.id pmr.kettle.dfs.install.dir pmr.libraries.archive.file pmr.kettle.additional.plugins The installation ID, allowing multiple versions or configurations of PDI to be used with the same cluster. This defaults to the version of PDI being used when left blank. The directory in the Hadoop filesystem where the Pentaho libraries should be rooted. This defaults to /opt/pentaho/mapreduce. A zip file containing core Kettle components and required libraries for executing PDI from a Hadoop map, combine, or reduce context. A comma-separated list of folders located in $PENTAHO_HOME/data-integration/plugins to copy to the install directory. The Copy Process PDI checks to see if the ${pmr.kettle.dfs.install.dir}/${pmr.kettle.installation.id} folder exists on the Hadoop cluster, whenever a PMR job is executed. If the folder does not exist, it is created on the destination file system, and the contents of the ${pmr.libraries.archive.file} zip file are unzipped to that location. Also, any of the traditional plugin directories listed in ${pmr.kettle.additional.plugins} are copied to ${pmr.kettle.dfs.install.dir}/${pmr.kettle.installation.id}/plugins/. Page 2

Configuration Changes If changes are made to the plugin.properties file, a user must manually remove the contents of ${pmr.kettle.dfs.install.dir}/${pmr.kettle.installation.id} on the Hadoop file system for the changes to take effect. With the default options, this can be achieved by running: hadoop fs -rm -R -skiptrash /opt/pentaho/mapreduce/* The next time a PMR job is performed, the folder will not be found, and the copy process will place the libraries in the Hadoop file system as configured. Distributing Custom OSGi Plugins to PMR An OSGi-based plugin system was added in the Pentaho 6.0 release. Plugins can be deployed and managed with Apache Karaf in their own container, instead of the legacy class loader plugin architecture. There are two methods for making sure any OSGi plugins are available to Kettle running in a MapReduce context. With either method, any changes will need to be re-applied in the event of a patch or minor/major update of Pentaho s product. Manual Copy The first method can be done without external tools, but it has the drawback of not being persisted if the copy process is repeated after a configuration change. Plugins are usually installed to Karaf by placing a bundled JAR file in the $PENTAHO_HOME/dataintegration/system/karaf/deploy folder. Some plugin installations may require JAR file replacements or xml feature file edits in the system/karaf/system folder. The copy process will create the /opt/pentaho/mapreduce/<version>/ folder in the Hadoop file system, which has a system/karaf folder under it. The bundled OSGi JAR dependencies can be added by modifying system/karaf in the Hadoop file system with the same process as used for the install on the client. Performing the following command would add a custom OSGi bundle for use by PMR transformation tasks if you use pmr.kettle.dfs.install.dir = /opt/pentaho/mapreduce and pmr.kettle.installation.id = 7.1: hadoop fs -put cust-plugin-bundle.jar /opt/pentaho/mapreduce/7.1/system/karaf/deploy Page 3

Modify the Archive File This method requires a zip archive manipulation tool, but will automatically be re-applied if the copy process is repeated. As mentioned earlier, the pmr.libraries.archive.file is extracted to the Hadoop file system. The plugin installation will be deployed during that extraction process when a PDI MapReduce job runs, by manipulating the archive file. 1. Make a backup of the archive file, which should be located at $PENTAHO_HOME/dataintegration/plugins/pentaho-big-data-plugin/pentaho-mapreduce- libraries.zip. 2. Open the original file in your zip file manipulation tool, and apply the changes as required by the installation of the OSGi plugin to the archive. There is a system/karaf folder, which contains a deploy folder for bundle installation, or system folder for modifying and replacing existing resources. 3. Execute the process as described in the Configuration Changes section of the document to deploy the changes to the server. 4. Run a Pentaho MapReduce job utilizing the plugin to confirm the installation was successful. Related Information Here are some links to information that you may find helpful while using this best practices document: Components Reference Pentaho MapReduce documentation Page 4

Finalization Checklist This checklist is designed to be added to any implemented project that uses this collection of best practices, to verify that all items have been considered and reviews have been performed. (Compose specific questions about the topics in the document and put them in the table.) Name of the Project: Date of the Review: Name of the Reviewer: Item Response Comments Did you add the plugin folder name to the pmr.kettle.additional.plugins file and invoke a configuration update, for legacy plugins? Did you use the manual copy process method for OSGi plugins? If yes, did you modify all required files for the plugin installation under the system/karaf folder? If no, did you make a backup of the pentaho-mapreducelibraries.zip file archive before applying any changes? If yes, did you open the new archive and confirm the changes were made according to the install? If yes, did you force the configuration update and run a job to invoke the plugin deployment? Page 5