About 1. Chapter 1: Getting started with oozie 2. Remarks 2. Versions 2. Examples 2. Installation or Setup 2. Chapter 2: Oozie

Similar documents
Hortonworks Data Platform

Java Cookbook. Java Action specification. $ java -Xms512m a.b.c.mymainclass arg1 arg2

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Quick Understand How To Develop An End-to-End E- commerce application with Hadoop & Spark

Hortonworks Technical Preview for Apache Falcon

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Chase Wu New Jersey Institute of Technology

Innovatus Technologies

Introduction to BigData, Hadoop:-

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

About 1. Chapter 1: Getting started with ckeditor 2. Remarks 2. Versions 2. Examples 3. Getting Started 3. Explanation of code 4

About 1. Chapter 1: Getting started with hbase 2. Remarks 2. Examples 2. Installing HBase in Standalone 2. Installing HBase in cluster 3

Big Data Hadoop Stack

Hadoop Quickstart. Table of contents

BIG DATA TRAINING PRESENTATION

About 1. Chapter 1: Getting started with odata 2. Remarks 2. Examples 2. Installation or Setup 2. Odata- The Best way to Rest 2

Running Apache Spark Applications

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Hadoop. Introduction to BIGDATA and HADOOP

Part II (c) Desktop Installation. Net Serpents LLC, USA

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

How to Install and Configure EBF16193 for Hortonworks HDP 2.3 and HotFix 3 Update 2

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Running various Bigtop components

About 1. Chapter 1: Getting started with wso2esb 2. Remarks 2. Examples 2. Installation or Setup 2. Chapter 2: Logging in WSO2 ESB 3.

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Analytics using Apache Hadoop and Spark with Scala

Configuring Apache Knox SSO

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016

Aims. Background. This exercise aims to get you to:

Getting Started with Hadoop/YARN

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Configuring Apache Knox SSO

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Oracle Big Data Fundamentals Ed 2

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop Online Training

Big Data: How can I add Apache Oozie to my Hortonworks HDP Hadoop instance? How can I add Apache Oozie to my Hadoop instance?

Configuring Sqoop Connectivity for Big Data Management

SAS Viya 3.2 and SAS/ACCESS : Hadoop Configuration Guide

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Hortonworks Data Platform

Integrating Big Data with Oracle Data Integrator 12c ( )

Red Hat JBoss Web Server 3.1

Configuring a Hadoop Environment for Test Data Management

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

VMware vsphere Big Data Extensions Administrator's and User's Guide

Big Data Hadoop Course Content

Cloudera Manager Quick Start Guide

Hadoop Ecosystem. Why an ecosystem

Hadoop: The Definitive Guide

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Rubix Documentation. Release Qubole

visual-studio-2010 #visual- studio-2010

How to Install and Configure EBF15545 for MapR with MapReduce 2

Expert Lecture plan proposal Hadoop& itsapplication

How to Run the Big Data Management Utility Update for 10.1

Client Usage. Client Usage. Assumptions. Usage. Hdfs.put( session ).file( localfile ).to( remotefile ).now() java -jar bin/shell.

Knox Implementation with AD/LDAP

Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop An Overview. - Socrates CCDH

Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica. Hadoop Ecosystem

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi

Problem Set 0. General Instructions

wolfram-mathematica #wolframmathematic

Oracle Big Data Fundamentals Ed 1

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Hortonworks Data Platform

Hadoop File System Commands Guide

Upgrading Big Data Management to Version Update 2 for Hortonworks HDP

Getting Started with Hadoop

Hadoop course content

windows-10-universal #windows- 10-universal

SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS, Fourth Edition

UNIT II HADOOP FRAMEWORK

HOD User Guide. Table of contents

Beta. VMware vsphere Big Data Extensions Administrator's and User's Guide. vsphere Big Data Extensions 1.0 EN

4 Installation from sources

About 1. Chapter 1: Getting started with blender 2. Remarks 2. Examples 2. Hello World! (Add-On) 2. Installation or Setup 3

ruby-on-rails-4 #ruby-onrails-4

Prototyping Data Intensive Apps: TrendingTopics.org

Hadoop. Introduction / Overview

Installing Apache Knox

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS

Installation and Configuration Documentation

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Oracle Data Integrator Release Notes

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Exercise #1: ANALYZING SOCIAL MEDIA AND CUSTOMER SENTIMENT WITH APACHE NIFI AND HDP SEARCH INTRODUCTION CONFIGURE AND START SOLR

HDFS Access Options, Applications

Hadoop Lab 2 Exploring the Hadoop Environment

Top 25 Hadoop Admin Interview Questions and Answers

Developer Training for Apache Spark and Hadoop: Hands-On Exercises

Transcription:

oozie #oozie

Table of Contents About 1 Chapter 1: Getting started with oozie 2 Remarks 2 Versions 2 Examples 2 Installation or Setup 2 Chapter 2: Oozie 101 7 Examples 7 Oozie Architecture 7 Oozie Application Deployment 7 How to pass configuration with Oozie Proxy Job submission 7 Chapter 3: Oozie data triggered coordinator 9 Introduction 9 Remarks 9 Examples 9 oozie coordinator sample 9 oozie workflow sample 10 job.properties sample 11 shell script sample 11 submitting the coordinator job 11 Credits 12

About You can share this PDF with anyone you feel could benefit from it, downloaded the latest version from: oozie It is an unofficial and free oozie ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official oozie. The content is released under Creative Commons BY-SA, and the list of contributors to each chapter are provided in the credits section at the end of this book. Images may be copyright of their respective owners unless otherwise specified. All trademarks and registered trademarks are the property of their respective company owners. Use the content presented in this book at your own risk; it is not guaranteed to be correct nor accurate, please send your feedback and corrections to info@zzzprojects.com https://riptutorial.com/ 1

Chapter 1: Getting started with oozie Remarks Oozie is an Apache open source project, originally developed at Yahoo. Oozie is a general purpose scheduling system for multistage Hadoop jobs. Oozie allow to form a logical grouping of relevant Hadoop jobs into an entity called Workflow. The Oozie workflows are DAG (Directed cyclic graph) of actions. Oozie provides a way to schedule Time or Data dependent Workflow using an entity called Coordinator. Further you can combine the related Coordinators into an entity called Bundle and can be scheduled on a Oozie server for execution. Oozie support most of the Hadoop Jobs as Oozie Action Nodes like: MapRedude, Java, FileSystem (HDFS operations), Hive, Hive2, Pig, Spark, SSH, Shell, DistCp and Sqoop. It provides a decision capability using a Decision Control Node action and Parallel execution of the jobs using Fork-Join Control Node. It allow users to configure email option for Success/Failure notification of the Workflow using Email action. Versions Oozie Version Release Date 4.3.0 2016-12-02 Examples Installation or Setup Pre-requisites This article demonstrated installing oozie-4.3.0 on Hadoop 2.7.3 1. Java 1.7+ 2. Hadoop 2.x (here, 2.7.3) 3. Maven3+ 4. Unix box Step1: Dist file Get oozie tar.gz file from http://www-eu.apache.org/dist/oozie/4.3.0/ and extract it cd $HOME tar -xvf oozie-4.3.0.tar.gz https://riptutorial.com/ 2

Step2: Build Oozie cd $HOME/oozie-4.3.0/bin./mkdistro.sh -DskipTests Step3: Server Installation Copy the built binaries to the home directory as oozie cd $HOME cp -R $HOME/oozie-4.3.0/distro/target/oozie-4.3.0-distro/oozie-4.3.0. Step 3.1: libext Create libext directory inside oozie directory cd $HOME/oozie mkdir libext Note: ExtJS (2.2+) library (optional, to enable Oozie webconsole) But, The ExtJS library is not bundled with Oozie because it uses a different license :( Now you need to put hadoop jars inside libext directory, else it will throw below error in oozie.log file WARN ActionStartXCommand:523 - SERVER[data01.teg.io] USER[hadoop] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000000-161215143751620-oozie-hado-W] ACTION[0000000-161215143751620-oozie-hado-W@mr-node] Error starting action [mr-node]. ErrorType [TRANSIENT], ErrorCode [JA009], Message [JA009: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.] So, let's put below jars inside libext directory cp $HADOOP_HOME/share/hadoop/common/*.jar oozie/libext/ cp $HADOOP_HOME/share/hadoop/common/lib/*.jar oozie/libext/ cp $HADOOP_HOME/share/hadoop/hdfs/*.jar oozie/libext/ cp $HADOOP_HOME/share/hadoop/hdfs/lib/*.jar oozie/libext/ cp $HADOOP_HOME/share/hadoop/mapreduce/*.jar oozie/libext/ cp $HADOOP_HOME/share/hadoop/mapreduce/lib/*.jar oozie/libext/ cp $HADOOP_HOME/share/hadoop/yarn/*.jar oozie/libext/ cp $HADOOP_HOME/share/hadoop/yarn/lib/*.jar oozie/libext/ Step 3.2: Oozie Impersonate To avoid impersonate error on oozie, modify core-site.xml like below <!-- OOZIE --> <name>hadoop.proxyuser.[oozie_server_user].hosts</name> <value>[oozie_server_hostname]</value> <name>hadoop.proxyuser.[oozie_server_user].groups</name> <value>[user_groups_that_allow_impersonation]</value> https://riptutorial.com/ 3

Assuming, my oozie user is huser and host is localhost and group is hadoop <!-- OOZIE --> <name>hadoop.proxyuser.huser.hosts</name> <value>localhost</value> <name>hadoop.proxyuser.huser.groups</name> <value>hadoop</value> Note : You can use * in all values, in case of confusion Step 3.3: Prepare the war cd $HOME/oozie/bin./oozie-setup.sh prepare-war This will create oozie.war file inside oozie directory. If this war will be used further, you may face this error : ERROR ActionStartXCommand:517 - SERVER[data01.teg.io] USER[hadoop] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000000-161220104605103-ooziehado-W] ACTION[0000000-161220104605103-oozie-hado-W@mr-node] Error, java.lang.nosuchfielderror: HADOOP_CLASSPATH Why? because, The oozie compilation produced Hadoop 2.6.0 jars even when specifying Hadoop 2.7.3 with the option "-Dhadoop.version=2.7.3". So, to avoid this error, copy the oozie.war file to a different directory mkdir $HOME/oozie_war_dir cp $HOME/oozie/oozie.war $HOME/oozie_war_dir cd $HOME/oozie_war_dir jar -xvf oozie.war rm -f oozie.war/web-inf/lib/hadoop-*.jar rm -f oozie.war/web-inf/lib/hive-*.jar rm oozie.war jar -cvf oozie.war./* cp oozie.war $HOME/oozie/ Then, regenerate the oozie.war binaries for oozie with a prepare-war cd $HOME/oozie/bin./oozie-setup.sh prepare-war Step 3.4: Create sharelib on HDFS cd $HOME/oozie/bin./oozie-setup.sh sharelib create -fs hdfs://localhost:9000 https://riptutorial.com/ 4

Now, this sharelib set up may give you below error: org.apache.oozie.service.serviceexception: E0104: Could not fully initialize service [org.apache.oozie.service.sharelibservice], Not able to cache sharelib. An Admin needs to install the sharelib with oozie-setup.sh and issue the 'oozie admin' CLI command to update the sharelib To avoid this, modify oozie-site.xml like below cd $HOME/oozie vi conf/oozie-site.xml Add below property <name>oozie.service.hadoopaccessorservice.hadoop.configurations</name> <value>*=/usr/local/hadoop/etc/hadoop/</value> The value should be your $HADOOP_HOME/etc/hadoop, where all hadoop configuration files are present. Step 3.5 : Create Oozie DB cd $HOME/oozie./bin/ooziedb.sh create -sqlfile oozie.sql -run Step 3.6 : Start Daemon To start Oozie as a daemon use the following command:./bin/oozied.sh start To stop./bin/oozied.sh stop check logs for errors, if any cd $HOME/oozie/logs tail -100f oozie.log Use the following command to check the status of Oozie from command line: $./bin/oozie admin -oozie http://localhost:11000/oozie -status System mode: NORMAL Step 4: Client Installation https://riptutorial.com/ 5

$ cd $ cp oozie/oozie-client-4.3.0.tar.gz. $ tar -xvf oozie-client-4.3.0.tar.gz $ mv oozie-client-3.3.2 oozie-client $ cd bin Add $HOME/oozie-client/bin to PATH variable in.bashrc file and restart your terminal or do source $HOME/.bashrc For more details on set up, you can refer this URL https://oozie.apache.org/docs/4.3.0/dg_quickstart.html Now you can submit hadoop jobs to oozie in your terminal. To run an example, you can follow this URL and set up your first example to run https://oozie.apache.org/docs/4.3.0/dg_examples.html You may face below error while running the map reduce example in above URL java.io.ioexception: java.net.connectexception: Call From localhost.localdomain/127.0.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.connectexception: Connection refused; For more details see: http://wiki.apache.org/hadoop/connectionrefused Solution: Start mr-jobhistory-server.sh cd $HADOOP_HOME/sbin./mr-jobhistory-server.sh start historyserver Another point to note about modifying job.properties file is : namenode=hdfs://localhost:9000 jobtracker=localhost:8032 in your case, this can be different, as I am using apache hadoop, you may be using cloudera/hdp/anything To run spark job, I have tried running in local[*], yarn-client and yarn-cluster as master, but succeeded in local[*] only Read Getting started with oozie online: https://riptutorial.com/oozie/topic/3437/getting-started-withoozie https://riptutorial.com/ 6

Chapter 2: Oozie 101 Examples Oozie Architecture Oozie is developed on a client-server architecture. Oozie server is a Java web application that runs Java servlet container within an embedded Apache Tomcat. Oozie provides three different type of clients to interact with the Oozie server: Command Line, Java Client API and HTTP REST API. Oozie server does not store any in-memory information of the running jobs. It relies on RDBMS to store states and data of all the Oozie jobs. Every time it retrieves the job information from the database and stores updated information back into the database. Oozie Server (can) sits outside of the Hadoop cluster and performs orchestration of the Hadoop jobs defined in a Oozie Workflow job. Oozie Application Deployment A simplest Oozie application is consists of a workflow logic file (workflow.xml), workflow properties file (job.properties/job.xml) and required JAR files, scripts and configuration files. Except the workflow properties file, all the other files should to be stored in a HDFS location. The workflow properties file should be available locally, from where Oozie application is submitted and started. The HDFS directory, where workflow.xml is stored along with other scripts and configuration files, is called Oozie workflow application directory. All the JAR files should be stored under a /lib directory in the oozie application directory. The more complex Oozie applications can consist of coordinators (coordinator.xml) and bundle (bundle.xml) logic files. These files are also stored in the HDFS into a respective Oozie application directory. How to pass configuration with Oozie Proxy Job submission When using the Oozie Proxy job submission API for submitting the Oozie Hive, Pig, and Sqoop actions. To pass any configuration to the action, is required to be in below format. For Hive action: oozie.hive.options.size : The number of options you'll be passing to Hive action. oozie.hive.options.n : An argument to pass to Hive, the 'n' should be an integer starting with zero (0) to indicate the option number. <name>oozie.hive.options.1</name> <value>-doozie.launcher.mapreduce.job.queuename=hive</value> https://riptutorial.com/ 7

<name>oozie.hive.options.0</name> <value>-dmapreduce.job.queuename=hive</value> <name>oozie.hive.options.size</name> <value>2</value> For Pig Action: oozie.pig.options.size : The number of options you'll be passing to Pig action. oozie.pig.options.n : An argument to pass to Pig, the 'n' should be an integer starting with zero (0) to indicate the option number. <name>oozie.pig.options.1</name> <value>-doozie.launcher.mapreduce.job.queuename=pig</value> <name>oozie.pig.options.0</name> <value>-dmapreduce.job.queuename=pig</value> <name>oozie.pig.options.size</name> <value>2</value> For Sqoop Action: oozie.sqoop.options.size : The number of options you'll be passing to Sqoop Hadoop job. oozie.sqoop.options.n : An argument to pass to Sqoop. hadoop job conf, the 'n' should be an integer starting with zero(0) to indicate the option number. <name>oozie.sqoop.options.1</name> <value>-doozie.launcher.mapreduce.job.queuename=sqoop</value> <name>oozie.sqoop.options.0</name> <value>-dmapreduce.job.queuename=sqoop</value> <name>oozie.sqoop.options.size</name> <value>2</value> Read Oozie 101 online: https://riptutorial.com/oozie/topic/4134/oozie-101 https://riptutorial.com/ 8

Chapter 3: Oozie data triggered coordinator Introduction A detailed explanation is given on oozie data triggered coordinator job with example. Coordinator runs periodically from the start time until the end time. Beginning at start time, the coordinator job checks if input data is available. When the input data becomes available, a workflow is started to process the input data which on completion produces the required output data. This process is repeated at every tick of frequency until the end time of coordinator. Remarks <done-flag>_success</done_flag> The above snippet in coordinator.xml for input dataset signals the presence of input data. That means coordinator action will be in WAITING state till _SUCCESS file is present in the given input directory. Once it is present, workflow will start execution. Examples oozie coordinator sample The below coordinator job will trigger coordinator action once in a day that executes a workflow. The workflow has a shell script that moves input to output. <coordinator-app name="log_process_coordinator" frequency="${coord:days(1)}" start="2017-04- 29T06:00Z" end="2018-04-29t23:25z" timezone="utc" xmlns="uri:oozie:coordinator:0.2"> <datasets> <dataset name="input_dataset" frequency="${coord:days(1)}" initial-instance="2017-04- 29T06:00Z" timezone="gmt"> <uri-template>${namenode}/mypath/coord_job_example/input/${year}${month}${day}</uritemplate> <done-flag>_success</done-flag> </dataset> <dataset name="output_dataset" frequency="${coord:days(1)}" initial-instance="2017-04- 29T06:00Z" timezone="gmt"> <uri-template>${namenode}/mypath/coord_job_example/output/${year}${month}${day}</uritemplate> <done-flag>_success</done-flag> </dataset> </datasets> <input-events> <data-in name="input_event" dataset="input_dataset"> <instance>${coord:current(0)}</instance> </data-in> </input-events> <output-events> https://riptutorial.com/ 9

<data-out name="output_event" dataset="output_dataset"> <instance>${coord:current(0)}</instance> </data-out> </output-events> <action> <workflow> <app-path>${workflowappuri}</app-path> <configuration> <name>jobtracker</name> <value>${jobtracker}</value> <name>namenode</name> <value>${namenode}</value> <name>pool.name</name> <value>${poolname}</value> <name>inputdir</name> <value>${coord:datain('input_event')}</value> <name>outputdir</name> <value>${coord:dataout('output_event')}</value> </configuration> </workflow> </action> </coordinator-app> oozie workflow sample <workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf"> <start to="shell-node"/> <action name="shell-node"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobtracker}</job-tracker> <name-node>${namenode}</name-node> <configuration> <name>mapred.job.queue.name</name> <value>${poolname}</value> </configuration> <exec>${myscript}</exec> <argument>${inputdir}</argument> <argument>${outputdir}</argument> <file>${myscriptpath}</file> <capture-output/> </shell> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>shell action failed, error message[${wf:errormessage(wf:lasterrornode())}] https://riptutorial.com/ 10

</message> </kill> <end name="end"/> </workflow-app> job.properties sample namenode=hdfs://namenode:port start=2016-04-12t06:00z end=2017-02-26t23:25z jobtracker=yourjobtracker poolname=yourpool oozie.coord.application.path=${namenode}/hdfs_path/coord_job_example/coord workflowappuri=${oozie.coord.application.path} myscript=myscript.sh myscriptpath=${oozie.coord.application.path}/myscript.sh shell script sample inputdir=${1} outputdir=${2} hadoop fs -mkdir -p ${outputdir} hadoop fs -cp ${inputdir}/* ${outputdir}/ submitting the coordinator job Copy the script, coordinator.xml and workflow.xml into HDFS. coordinator.xml must be present in the directory specified by oozie.coord.application.path in job.properties. workflow.xml should be present in the directory specified by workflowappuri. Once everything is in place, run the below command from shell oozie job -oozie <oozie_url>/oozie/ -config job.properties Read Oozie data triggered coordinator online: https://riptutorial.com/oozie/topic/9845/oozie-datatriggered-coordinator https://riptutorial.com/ 11

Credits S. No Chapters Contributors 1 Getting started with oozie Community, Jyoti Ranjan, YoungHobbit 2 Oozie 101 YoungHobbit 3 Oozie data triggered coordinator sunitha https://riptutorial.com/ 12