Hue Application for Big Data Ingestion

Similar documents
Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Configuring and Deploying Hadoop Cluster Deployment Templates

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Cloudera Manager Quick Start Guide

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Oracle Big Data Fundamentals Ed 2

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

BIG DATA COURSE CONTENT

Basics of Web. First published on 3 July 2012 This is the 7 h Revised edition

HBASE + HUE THE UI FOR APACHE HADOOP

Fault Detection using Advanced Analytics at CERN's Large Hadron Collider

Innovatus Technologies

Hadoop. Introduction / Overview

Hortonworks Data Platform

Oracle Big Data Fundamentals Ed 1

Big Data Hadoop Stack

Standard 1 The student will author web pages using the HyperText Markup Language (HTML)

AWS Serverless Architecture Think Big

Hadoop An Overview. - Socrates CCDH

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Azure Data Factory VS. SSIS. Reza Rad, Consultant, RADACAD

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Big Data Analytics using Apache Hadoop and Spark with Scala

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Hadoop. Introduction to BIGDATA and HADOOP

Introduction to BigData, Hadoop:-

Development of a New Web Portal for the Database on Demand Service

Data warehousing on Hadoop. Marek Grzenkowicz Roche Polska

The Evolution of Big Data Platforms and Data Science

Hadoop Online Training

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Org.apache.derby.client.am.sqlexception Schem 'me' Does Not Exist

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Data in the Cloud and Analytics in the Lake

Oracle Data Integrator 12c: Integration and Administration

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Automation with Meraki Provisioning API

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Data Architectures in Azure for Analytics & Big Data

Data Lake Based Systems that Work

Real Life Web Development. Joseph Paul Cohen

Using SQLite Ebooks Free

MapR Enterprise Hadoop

Orchestrating Big Data with Apache Airflow

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Oracle Data Integrator 12c: Integration and Administration

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Unifying Big Data Workloads in Apache Spark

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2018

Copy Data From One Schema To Another In Sql Developer

A Simple Course Management Website

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Product Compatibility Matrix

Evaluation Guide for ASP.NET Web CMS and Experience Platforms

Hadoop Development Introduction

A Generic Microservice Architecture for Environmental Data Management

Web-based ios Configuration Management

Hadoop: The Definitive Guide PDF

Hadoop course content

USING DOCKER FOR MXCUBE DEVELOPMENT AT MAX IV

HOMEWORK 8. M. Neumann. Due: THU 29 MAR PM. Getting Started SUBMISSION INSTRUCTIONS

Developer Training for Apache Spark and Hadoop: Hands-On Exercises

Using Apache Zeppelin

Working with Database Connections. Version: 7.3

HOMEWORK 9. M. Neumann. Due: THU 8 NOV PM. Getting Started SUBMISSION INSTRUCTIONS

KNIME for the life sciences Cambridge Meetup

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions?

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Database Developers Forum APEX

Enabling Universal Authorization Models using Sentry

Lessons learned from real-world deployments of Java EE 7. Arun Gupta, Red

Stages of Data Processing

Because databases are not easily accessible by Hadoop, Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external

Real-time Data Engineering in the Cloud Exercise Guide

Byte Academy. Python Fullstack

PYRAMID Headline Features. April 2018 Release

Summer Student Project Report

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

29-27 May 2013 CERN WEB FRAMEWORKS. Adrian Mönnich

DOWNLOAD PDF MICROSOFT SQL SERVER HADOOP CONNECTOR USER GUIDE

An Introduction to Big Data Formats

WWW. HTTP, Ajax, APIs, REST

Working with Database Connections. Version: 18.1

Tomasz Szumlak WFiIS AGH 23/10/2017, Kraków

EXPERIENCES MOVING FROM DJANGO TO FLASK

Big Data with Hadoop Ecosystem

Top 25 Hadoop Admin Interview Questions and Answers

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

September Development of favorite collections & visualizing user search queries in CERN Document Server (CDS)

Transcription:

Hue Application for Big Data Ingestion August 2016 Author: Medina Bandić Supervisor(s): Antonio Romero Marin Manuel Martin Marquez CERN openlab Summer Student Report 2016 1

Abstract The purpose of project was to develop a web application for the HLoader - a data ingestion and ETL framework developed at CERN. This framework automates data streaming/ingestion from a wide range of data services into CERN Hadoop Big Data analytics cluster. The web application uses the HLoader REST API to interact with the framework and expose the ETL functionalities to the users. This report will introduce the HUE framework for web applications and explain why it is useful and how to use it. The report will give an overview of the frameworks and the technologies used for the project. This report will also show how to create an application in HUE and use the HLoader REST API. Finally, it will present the project future plans and improvements. 2

Table of Contents 1 Introduction... 4 2 HUE Framework... 5 2.1 Why HUE?... 5 2.2 HUE Architecture... 6 2.2.1 Django Framework... 7 3 Making The HUE Application... 8 3.1 Basic HUE Application... 8 3.2 HLoader... 9 3.3 REST API Connection... 10 4 Future... 13 5 Conclusion... 13 Acknowledgements... 14 References... 15 3

1 Introduction Big Data ingestion is a very complex process. Especially when we speak about data ingestion in CERN Hadoop clusters where a great amount of data is ingested from various sources like experiments, accelerators, systems, infrastructure, services and so on. That implies the necessity of simplification of the process and easier access to perform efficiently the tasks required during the process. Figure 1: Big Data Ingestion The purpose of this project is to create a web interface to interact with the HLoader framework developed at CERN to simplify the ETL processes for Hadoop. This web application should facilitate the interaction with the HLoader framework, improving the ETL (extract transform load) process for both users and service administrators. For this purpose, the HUE framework offers multiple features that can be used by the web application. 4

2 HUE Framework 2.1 Why HUE? HUE - which stands for 'Hadoop User Interface' - is a web application built in Django framework and placed between Hadoop and browser. It hosts built-in and user HUE apps and provides an environment for analyzing and exploring data. Figure 2: HUE There are various reasons to choose this framework in this use case. First if all, it is already integrated with Hadoop system. Most noticeably, there are python file-objectlike APIs for interacting with HDFS. These APIs work by making REST API or Thrift calls to the Hadoop daemons. The Hadoop administrators must enable these interfaces from Hadoop. Considering that this project is for automating data ingestion in Hadoop clusters, that feature is huge benefit. It also supports different SQL editors for Hive, Impala, MySQL, Oracle, PostgreSQL, SparkSQL, Solr SQL and Phoenix, as well as scheduling of jobs and workflows through an Oozie Editor and Dashboard. Other benefits of using this framework are: collaborative user interface, configuration management, supervision of sub-processes, user management, integrated security and interaction with other applications. 5

2.2 HUE Architecture A HUE application has three layers. The first one is the front-end for user interaction which includes usage of jquery, Bootstrap and HTML/CSS. The second layer is the overall application logic contained in its web server. Basically, the Django framework is for that part and Python as main language. The third layer includes external backend services with which the application is interacting. Figure 3: HUE Architecture The minimum that should be implemented is a "Django view" controller function that processes the request and the associated template to render the response into HTML. Except for the web server, Hue applications run some other daemon processes. That is the preferred way to manage long-running tasks for which might exist together with the web page rendering. This communication between views and daemons is done by using Thrift or by exchanging state through the database. 6

2.2.1 Django Framework Django is an MVC (model-view-controller) web framework. The only difference is that controller is called view and a view is called template. Django is running on WSGI container/web server, manages the URL dispatch, executes application logic code, and puts together the views from their templates. Python is used for the application logic and HTML or Mako templates for user interface. Django views accept requests from users and perform the necessary changes on models, which represent the data storage of application. Figure 4: MVC Framework Django uses a database, usually SQLite, to manage and store the session data, and Hue applications can use it as well for the models. For programmers, the essential thing to know is how the "urls.py" file maps between URLs and view functions. These view functions typically use their arguments (captured parameters) and their request object (the POST and GET parameters) to prepare dynamic content to be rendered done by using a template. 7

3 Making the HUE Application 3.1 Basic HUE Application First of all, everything necessary that should be integrated in one HUE application should be installed. On CentOS (the OS used for this project) these were required: Oracle's JDK, ant, asciidoc, cyrus-sasl-devel, cyrus-sasl-gssapi, cyrus-sasl-plain, gcc, gcc-c++, krb5- devel, libffi-devel, libtidy (for unit tests only), libxml2-devel, libxslt-devel, make, mvn, mysql, mysql-devel, openldap-devel, python-devel, sqlite-devel, openssl-devel, gmpdevel. After the environment preparation, the application should be created and installed with the create_desktop_app command. Once everything is set up, it is possible to start developing the HUE web application itself. Considering that HUE application is a Django application in core, views and templates should be made first and the url.py file has to be modified. There are some minor differences in the code compared to the development of a standard Django application, especially because Mako language is used for templates. The result of this is application which is running on http://localhost:8000/. Figure 5: HUE Authentication 8

Figure 6: Basic HUE interface Figure 7: The Hello World HUE application Usually the models are also necessary, but in this case instead of making the database through models, the application is connected with the HLoader REST API which uses PostgreSQL as backend. 9

3.2 HLoader HLoader is a framework which is built around Apache Sqoop for data ingestion jobs between RDBMS and Hadoop Distributed File System (HDFS). It has been developed at CERN to facilitate and automate the data ingestion from relational databases from a wide range of projects including some CERN experiments and critical systems of the CERN Accelerator Complex. The goal of the HLoader project is to automate the process of data ingestion between RDBMS systems and Hadoop clusters by building a framework that is like a data pipeline, and interfaces with the systems via Apache Sqoop and other tools, applying all the necessary actions in order to deliver ready-to-read data on the Hadoop file system for high-end frameworks like Impala or Spark. Figure 8: HLoader Framework HLoader components include an agent, a scheduler, runners and a monitor. The agent is responsible for populating the meta-database with newly added jobs notifying the scheduler of the changes. When a new job is added to the metadata database, the user provides all the information required by Sqoop for the transfer. The transfer initiates a Sqoop Map Reduce job which imports data from the database to the target Hadoop cluster. The metadata database also stores all the information related to the transfers, including logs and execution status. The users interact with the HLoader through a RESTful API, which retrieves some of the data in the metadata database to authenticate the user and allow them to perform basic CRUD operations. 10

3.3 REST API Connection In this project, the web application is going to be connected with some different REST APIs to retrieve the information. To facilitate the development and testing, the connection with the REST has been first done using the command line and then code has been ported to the HUE web application. The same approach was taken for the implementation of the views, writing the custom functions and using them on the templates. On the other side, a development HLoader REST API environment has to be set up. The procedure for this required first to download the HLoader project and install its dependencies using the python requirements.txt and the Oracle.rpm files. The next step is to configure the config.ini file (AUTH_ properties for the authentication database and POSTGRE_ properties for the meta database). The metadata database should be also set up using the scripts provided in the HLoader project. Once everything is configured, the HLoader REST API can be started using the HLoader.py file. The REST API will be exposed to the user the metadata and will allow the submission of new jobs. The HLoader REST API runs by default on http://127.0.0.1:5000. The jobs submitted using the API will be executed using Oozie Workflows or Coordinators. 11 Figure 9: Part of HLoader's JSON

The following list describes the available methods in the HLoader REST API: GET /headers Returns Python environment variables and request headers. GET /api/v1 The index page GET /api/v1/clusters Returns a JSON with an array of clusters, potentially filtered by an attribute value. GET /api/v1/servers Returns a JSON with an array of servers, potentially filtered by an attribute value. GET /api/v1/schemas Returns JSON with arrays of available and unavailable schemas given an owner username. GET /api/v1/jobs Returns a JSON with an array of jobs, potentially filtered by an attribute value. POST /api/v1/jobs Submits a job and returns a JSON containing its ID DELETE /api/v1/jobs Deletes a job given its ID and reports the status of the operation. GET /api/v1/logs Returns a JSON with an array of logs, potentially filtered by an attribute value. GET /api/v1/transfers Returns a JSON with an array of transfers, potentially filtered by an attribute value. On Figure 10 we can see how the list of jobs and the details are shown. The functionality to delete a job is also possible using the delete button. 12 Figure 10: HUE Web Interface for HLoader

4 Future The next steps of the project include to complete the web application design and its integration with the REST API at the moment just the core set of operations have been integrated. The application should also be extended to support a larger number of ETL functionalities, as well as more data sources and data types. Once the application is fully functional and the core operations are completely supported and tested, additional integration tests have to be prepared to ensure the proper functioning with the CERN IT infrastructure. Finally, once the integration is ensured, the application has to be deployed in the production environment. Figure 11: Example of Finished HUE Interface 13

5 Conclusion The automation and simplification of the ETL process into the CERN Hadoop infrastructure plays a key role for the CERN users and the service administrators. With this arguments, a web based interface makes possible to both to control and create new ETL process in a fast and comfortable environment. The integration of the application into the HUE framework provides many built-in functionalities for the big data ingestion process. On the contrary, the development can be challenging at the beginning mainly because of the lack of technical documentation. A large amount of time was spent in the first stages of this project due to this issue. Nevertheless, once the developers are familiar with the technology, the framework can greatly simplify and improve the implementation of interfaces to the big data ingestion process, especially in complex environments with very large amount of data like the CERN. 14

Acknowledgements First of all, I want to thank to my supervisors Antonio and Manuel for the guidance and support during the whole summer. They helped me to face and fix all of the issues and thanks to them I can say that learned a lot through developing this project. I would also like to thank the best group in CERN the IT-DB group for providing us students a really nice working environment during the whole stay. In the end I would like to thank CERN for giving me opportunity to have the best summer with all of the amazing people I met and places I had opportunity to see. It was a great honor for me to work in CERN for two months and I hope I will get a chance to return back some time in the future as well. 15

References thenewboston.com - Django Tutorial for Beginners www.coursera.com Hadoop Platform and Application Framework http://cloudera.github.io/hue/docs-2.5.0/sdk/sdk.html#fast-guide-to-creating-a-newhue-application https://github.com/cerndb/hloader Achari, Shiva Hadoop Essentials delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem.birmingham :Packt Publ., 2015. 16