Sandbox Setup Guide for HDP 2.2 and VMware

Similar documents
How to Hadoop effortlessly with Waterline Data Inventory

Reset the Admin Password with the ExtraHop Rescue CD

Accessing clusters 2. Accessing Clusters. Date of Publish:

Bitnami Apache Solr for Huawei Enterprise Cloud

Easy Setup Guide. Cisco FindIT Network Probe. You can easily set up your FindIT Network Probe in this step-by-step guide.

WA2503 Hadoop Programming on the. Hortonworks Data Platform. Lab Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc.

Hortonworks Data Platform

Bitnami MEAN for Huawei Enterprise Cloud

VMware vsphere Big Data Extensions Administrator's and User's Guide

Installing Hortonworks Data Platform 2.1 Technical Preview VirtualBox on Windows

Cloudamize vcenter Agent Installer

Deploy Oracle Spatial and Graph Map Visualization Component to Oracle Cloud

Installing or Upgrading ANM Virtual Appliance

Quick Start Guide ViPR Controller & ViPR SolutionPack

Hands-on Exercise Hadoop

Dell Storage Compellent Integration Tools for VMware

WA2342 NoSQL Systems Comparison. Lab Server VM Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc. 1

Bitnami JRuby for Huawei Enterprise Cloud

Using Hive for Data Warehousing

Installing Hortonworks Data Platform 2.1 Technical Preview VirtualBox on Mac

Oracle SOA Suite/BPM Suite VirtualBox Appliance. Introduction and Readme

How to Run the Big Data Management Utility Update for 10.1

TIBCO LiveView Web Getting Started Guide

WA2592 Applied Data Science and Big Data Analytics. Classroom Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc.

How to Install and Configure EBF16193 for Hortonworks HDP 2.3 and HotFix 3 Update 2

Free Download: Quick Start Guide

Cloud Computing II. Exercises

Bitnami HHVM for Huawei Enterprise Cloud

Bitnami MySQL for Huawei Enterprise Cloud

Bitnami MariaDB for Huawei Enterprise Cloud

Talend Big Data Sandbox. Big Data Insights Cookbook

WA2393 Data Science for Solution Architects. Classroom Setup Guide. Web Age Solutions Inc. Copyright Web Age Solutions Inc. 1

Installing Hortonworks Sandbox VMware Player on Windows 1

HiveManager Virtual Appliance QuickStart

Processing Big Data with Hadoop in Azure HDInsight

How to Configure Big Data Management 10.1 for MapR 5.1 Security Features

Developer Training for Apache Spark and Hadoop: Hands-On Exercises

FireFox. CIS 231 Windows 10 Install Lab # 3. 1) Use either Chrome of Firefox to access the VMware vsphere web Client.

Beta. VMware vsphere Big Data Extensions Administrator's and User's Guide. vsphere Big Data Extensions 1.0 EN

Installing and Upgrading Cisco Network Registrar Virtual Appliance

Introduction Secure Message Center (Webmail, Mobile & Visually Impaired) Webmail... 2 Mobile & Tablet... 4 Visually Impaired...

The Cisco HCM-F Administrative Interface

AWS Quick Start Guide. Launch a Linux Virtual Machine Version

Bitnami Node.js for Huawei Enterprise Cloud

Ansible Tower Quick Setup Guide

Installing Cisco MSE in a VMware Virtual Machine

VMware Identity Manager Connector Installation and Configuration (Legacy Mode)

Cloud Help for Community Managers...3. Release Notes System Requirements Administering Jive for Office... 6

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

TIBCO LiveView Web Getting Started Guide

Dell Storage Integration Tools for VMware

Online Help StruxureWare Central

Introduction to Cuda Visualization. Graphical Application Tunnelling on Palmetto

Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop

Click Studios. Passwordstate. Remote Session Launcher. Installation Instructions

SAS Data Loader 2.4 for Hadoop

Quick Start Guide ViPR Controller & ViPR SolutionPack

Aventail Advanced Reporting Installation Instructions

In this exercise you will practice working with HDFS, the Hadoop. You will use the HDFS command line tool and the Hue File Browser

VMware vrealize Log Insight Getting Started Guide

Bulk Export Content. Learning Objectives. In this Job Aid, you will learn how to:

Talend Big Data Sandbox. Big Data Insights Cookbook

Guidelines - Configuring PDI, MapReduce, and MapR

QuickStart Guide for Managing Computers. Version

Installing Hortonworks Sandbox VMware Player on Windows

Bitnami Ruby for Huawei Enterprise Cloud

InControl 2 Software Appliance Setup Guide

1) Use either Chrome of Firefox to access the VMware vsphere web Client. FireFox

EventTracker: Virtual Appliance

Archiware Pure User Manual

EventTracker: Virtual Appliance

Server Installation Guide

Configuring GNS3 for CCNA Security Exam (for Windows) Software Requirements to Run GNS3

ISE Express Installation Guide. Secure Access How -To Guides Series

Oracle BDA: Working With Mammoth - 1

Cisco Integrated Management Controller (IMC) Supervisor is a management system that allows you to manage rack mount servers on a large scale.

FireFox. CIS 231 Windows 2012 R2 Server Install Lab #1

Overview of the Cisco NCS Command-Line Interface

Cisco VDS Service Broker Software Installation Guide for UCS Platforms

BlueMix Hands-On Workshop

Cloudera Manager Quick Start Guide

1) Use either Chrome of Firefox to access the VMware vsphere web Client. FireFox

Installing IBM InfoSphere BigInsights Quick Start Edition

QuickStart Guide for Managing Computers. Version 9.73

ElasterStack 3.2 User Administration Guide - Advanced Zone

Mascot Insight Installation and setup guide

Xcalar Installation Guide

QuickStart Guide for Managing Computers. Version 9.32

Installing and Configuring vcloud Connector

QRM+ Tutorials. QNAP s Remote Server Management Solution. rev

Aware IM Version 8.1 Installation Guide

Oracle Big Data Manager User s Guide. For Oracle Big Data Appliance

CIS 231 Windows 7 Install Lab #2

SETTING UP THE STUDENT COMPUTERS

ForeScout Extended Module for IBM BigFix

SnapCenter Software 4.0 Installation and Setup Guide

Use Jamf Self Service to upgrade to macos Mojave

LENSEC, LLC. PERSPECTIVE VMS. Release Notes (Version 1.2.1) LENSEC, LLC. 8/17/2012

Dell EMC ME4 Series vsphere Client Plug-in

Eucalyptus User Console Guide

Transcription:

Waterline Data Inventory Sandbox Setup Guide for HDP 2.2 and VMware Product Version 2.0 Document Version 10.15.2015 2014-2015 Waterline Data, Inc. All rights reserved. All other trademarks are the property of their respective owners.

Overview Table of Contents Overview... 2 Related Documents... 3 System requirements... 3 Setting up the sandbox... 4 Opening Waterline Data Inventory in a browser... 5 Running Waterline Data Inventory... 5 Exploring the sample cluster... 6 Shutting down the cluster... 8 Accessing the Hadoop cluster using SSH... 9 Loading data into HDFS... 9 Running Waterline Data Inventory jobs... 12 Monitoring Waterline Data Inventory jobs... 14 Configuring additional Waterline Data Inventory functionality... 16 Accessing Hive tables... 16 Overview Waterline Data Inventory reveals information about the metadata and data quality of files in a Apache Hadoop cluster so the users of the data can identify the files they need for analysis and downstream processing. The application installs on an edge node in the cluster and runs MapReduce jobs to collect data and metadata from files in HDFS and Hive. It then discovers relationships and patterns in the profiled data and stores the results in its metadata repository. A browser application lets users search, browse, and tag HDFS files and Hive tables using the benefits of the collected metadata and Data Inventory s discovered relationships. This document describes setting up Waterline Data Inventory in a virtual machine image that is pre-configured with the Waterline Data Inventory application and sample cluster data. The image is built from Hortonworks HDP 2.2 sandbox on VMWare Player or VMWare Fusion. 2014-2015 Waterline Data, Inc. All rights reserved. 2

Related Documents Related Documents Waterline Data Inventory User Guide (also available from the menu in the browser application) For the most recent documentation and product tutorials, sign into Waterline Data Inventory support (support.waterlinedata.com) and go to "Product Downloads, Documentation, and Tutorials": System requirements Waterline Data Inventory sandbox is available inside the Hortonworks HDP 2.2 sandbox. The system requirements and installation instructions are the same as Hortonworks describes: hortonworks.com/products/hortonworks-sandbox/#install The Waterline Data Inventory sandbox is configured with 10 GB of physical RAM rather than the default of 4 GB. The basic requirements are as follows For your host computer: At least 10 GB of RAM 64-bit computer that supports virtualization. VMware describes the unlikely cases where your hardware may not be compatible with 64-bit virtualization: kb.vmware.com/kb/1003945 2014-2015 Waterline Data, Inc. All rights reserved. 3

Setting up the sandbox Operating system supported by VMWare Player, including Microsoft Windows (XP and later) or VMWare Fusion, including Apple Mac OS X. VMWare virtualization application for your operating system. Download the latest version from here: Player (Windows): www.vmware.com/products/player/ Fusion (Mac): www.vmware.com/products/fusion/ Waterline Data Inventory VM image built on Hortonworks HDP 2.2 sandbox, VMWare version. www.waterlinedata.com/downloads Browser compatibility Microsoft Internet Explorer 10 and later (not supported on Mac OS) Chrome 36 or later Safari 6 or later Firefox 31 or later Setting up the sandbox 1. Install VMware Player or Fusion. 2. Download the Waterline Data Inventory VM (.ova file). 3. Open the.ova file with VMware (double-click the file). 4. Click Import to accept the default settings for the VM. This will take a few minutes to expand the archive and create the guest environment. 5. (Optional) Configure a way to easily move files between the host and guest. Some options are: Configure a shared directory between the host and guest. (Settings > Sharing: Enable Shared Folders and identify the host folder to share) From the guest computer, you can access the shared folder at /media/sf_<shared folder name>). Setup copy and paste. (Settings > Isolation: check Enable Copy and Paste) 6. Start the VM. It will take a few minutes for Hadoop and its components startup. 7. Note the IP address used for SSH access, such as 172.16.238.128 so that you can log into the guest machine through SSH as waterlinedata/waterlinedata. 2014-2015 Waterline Data, Inc. All rights reserved. 4

Opening Waterline Data Inventory in a browser Opening Waterline Data Inventory in a browser The sandbox includes pre-profiled data so you can see the functionality of Waterline Data Inventory before you load your own data. 1. Open a browser to the Waterline Data Inventory application: http://localhost:8082 or http://<ip address from step 7>:8082 2. Sign into Waterline Data Inventory as "waterlinedata", password waterlinedata. Running Waterline Data Inventory If for some reason the browser application did not appear, you may need to sign into the guest and start Waterline Data Inventory manually. If so, follow these steps: 1. Start an SSH session. (Mac OSX) Open a terminal or command prompt on the host and connect to the guest using the guest IP address (from step 7 above): $ ssh waterlinedata@172.16.238.128 Enter the password when prompted ("waterlinedata"). (Windows) Start an SSH client such as PuTTY and identify the connection parameters: Host Name: the guest IP address (from step 7 above). Protocol: SSH Log in using username waterlinedata and password waterlinedata. 2. You may be prompted to continue connecting though the authenticity of the host cannot be established. Enter yes. 3. Start the embedded metadata repository database, Derby. $ cd /opt/waterlinedata $ bin/derbystart You'll see a response that ends with "...started and ready to accept connections on port 4444". Type Enter to return to the shell prompt. 3. Start the embedded web server, Jetty. $ bin/jettystart The console fills with status messages from Jetty. Only messages identified by "ERROR" or "exception" indicate problems. You are now ready to use the application and its sample data. 2014-2015 Waterline Data, Inc. All rights reserved. 5

Exploring the sample cluster Exploring the sample cluster The Waterline Data Inventory sandbox is pre-populated with public data to simulate a set of users analyzing and manipulating the data. As you might expect among a group of users, there are multiple copies of the same data, standards for file and field names are not consistent, and data is not always wrangled into forms that are immediately useful for analysis. In other words, the data is intended to reflect reality. Here are some entry points to help you use this sample data to explore the capabilities of Waterline Data Inventory: Tags Tags help you identify data that you may want to use for analysis. When you place tags on fields, Waterline Data Inventory looks for similar data across the profiled files in the cluster and suggests your tags for other fields. Use the tags you enter and automatically suggested tags in searches and search filtering with facets. In the sample data, look for tags for "Food Service" data. 2014-2015 Waterline Data, Inc. All rights reserved. 6

Exploring the sample cluster Lineage relationships, landings, and origins Waterline Data Inventory uses file metadata and data to identify cluster files that are related to each other. It finds copies of the same data, joins between files, and horizontal and vertical subsets of files. If you mark the places where data comes into the cluster with "Landing" labels, Waterline Data Inventory propagates this information through the lineage relationships to show the origin of the data. In the sample data, look for origins for "data.gov," "Twitter," and "Restaurant Inspections." 2014-2015 Waterline Data, Inc. All rights reserved. 7

Shutting down the cluster Searching with facets Use the Global Search text box on the top of the page to do keyword searches across your cluster metadata, including searching on file and field names, tags and tag descriptions, 50 examples of the most frequent data in each field. Waterline Data Inventory also provides search facets on common file and field properties, such as file size and data density. Some of the most powerful facets are those for tags and origins. Use the facet lists on the Advance Search page to identify what kind of data you want to find. Then use facets in the left pane to refine the search results further. In the sample data, use "Food Service" tags in the Advance Search page, then filter the results by origin, such as "Restaurant Inspections". Shutting down the cluster To make sure you can restart the cluster cleanly, follow these steps to shut it down: 1. In a terminal window on the guest (or your SSH connection), stop the Jetty web server and the Derby repository database server: $ /opt/waterlinedata/bin/jettystop $ /opt/waterlinedata/bin/derbystop 2. Shut down the cluster. Choose Virtual Machine > Shut Down. If you don't see this option, press the Option key while opening the menu. 2014-2015 Waterline Data, Inc. All rights reserved. 8

Accessing the Hadoop cluster using SSH Accessing the Hadoop cluster using SSH To run Waterline Data Inventory jobs and to upload files in bulk to HDFS, you will want to access the guest machine using a command prompt or terminal on your host computer through a Secure Shell (SSH) connection. Alternatively, you can use the terminal in the guest VMware window, but that can be awkward. 1. Start an SSH session. (Mac OSX) In a terminal window, start an SSH session using the IP address provided for the guest instance (step 7 on page 4) and the username waterlinedata, all lower case: $ ssh waterlinedata@<guest IP address> or $ ssh waterlinedata@localhost (Windows) Start an SSH client such as PuTTY and identify the connection parameters: Host Name: the guest IP address (step 7 on page 4). Protocol: SSH Log in using username waterlinedata and password waterlinedata. 2. You may be prompted to continue connecting though the authenticity of the host cannot be established. Enter yes. Loading data into HDFS Loading data into HDFS is a two stage process: first you load data from its source such as your local computer or a public website to the guest file system. Then you copy the data from the guest file system into HDFS. For a small number of files, the Hadoop utility Hue makes this process very easy by allowing you to select files from the host computer and copy them directly into HDFS. For larger files or large numbers of files, you may decide to use a combination of an SSH client (to move files to the guest machine) and a command-line operation (to move files from the guest file system to HDFS). If you have a shared directory configured between the host and guest, you can access the files directly from the guest. Using Hue to load files into HDFS To access Hue from a browser on the host computer: http://<cluster IP address>:<hue port></filebrowser/ For example, http://localhost:8000/filebrowser Sign in to Hue as hue with the password 1111. 2014-2015 Waterline Data, Inc. All rights reserved. 9

Loading data into HDFS The following controls on the Hue File Browser page may be useful: Control Home New > Directory Upload > Files Move to Trash > Delete Forever Description Home in Hue is /user/hue. Use the navigation controls to go to other user directories. Create a new directory inside the current directory. Feel free to create additional /user directories. Note: Avoid adding directories above /user because it complicates accessing these locations from the Linux command line. Hue allows you to use your local file system to select and upload files. Note: Avoid uploading zip files unless you are familiar with uncompressing these files from inside HDFS. Trash is just another directory in HDFS, so moving files to trash does not remove them from HDFS. Loading files into HDFS from a command line Copying files to HDFS is a two-step process requiring an SSH connection: 1. Make the data accessible from guest machine. There are several ways to do this: Use an SSH client such as PuTTY, FileZilla, or CyberDuck. Use secure copy (scp). Configure a shared directory in the VMware settings for the VM image. 2. From inside an SSH connection, use the Hadoop file system command copyfromlocal to move files from the guest file system into HDFS. The following steps describe using scp to copy files into the guest. Skip to step 5 if you chose to use a GUI SSH client to copy the files. These instructions have you use separate terminal windows or command prompts to access the guest machine using two methods: (Guest) indicates the terminal window or command prompt with an open SSH connection. (Host) indicates the terminal window or command prompt that uses scp directly. 2014-2015 Waterline Data, Inc. All rights reserved. 10

Loading data into HDFS To copy files from the host computer to HDFS on the guest: 1. (Guest) Open an SSH connection to the guest. See Accessing the Hadoop cluster using SSH. 2. (Guest) Create a staging location for your data on the guest file system. The SSH connection working directory is /home/waterlinedata. From here, create a directory for your staged data: $ mkdir /data 3. (Guest) If needed, create HDFS directories into which you will copy the files. Create the directories using Hue or using the following command inside an SSH connection: $ hadoop fs -mkdir <HDFS path> For example, to create a new directory in the Landing directory: $ hadoop fs -mkdir /user/waterlinedata/newstagingarea 4. (Host) In a separate terminal window or command prompt, copy directories or files from host to guest. Navigate to the location of the data to copy on the host and run the scp command: $ scp -r./<directory or file> waterlinedata@<cluster IP address>:<linux destination> For example (all on one line): $ scp -r./newdata waterlinedata@localhost:/home/waterlinedata/data -p2222 $ scp -r./newdata waterlinedata@127.0.0.1:/home/waterlinedata/data 5. (Guest) Back in the SSH terminal window or command prompt, copy the files from guest file system to the cluster using the HDFS command copyfromlocal. Navigate to the location of the data files you copied in step 4 and copy the files into HDFS using the following command: $ hadoop fs -copyfromlocal <localdir> <HDFS dir> For example (all on one line): $ hadoop fs -copyfromlocal /home/waterlinedata/data/ /user/waterlinedata/newstagingarea/ 2014-2015 Waterline Data, Inc. All rights reserved. 11

Running Waterline Data Inventory jobs Running Waterline Data Inventory jobs Waterline Data Inventory format discovery and profiling and tag propagation jobs are MapReduce jobs run in Hadoop. These jobs populate the Waterline Data Inventory repository with file format and schema information, sample data, and data quality metrics for files in HDFS and Hive. Lineage discovery, collection discovery, and origin propagation jobs are jobs run on the edge node where Waterline Data Inventory is installed. These jobs use data from the repository to suggest relationships among files, to suggest additional tag associations, and to propagate origin information. Waterline Data Inventory jobs are run on a command line on the computer on which Waterline Data Inventory is installed. The jobs are started using scripts located in the bin subdirectory in the installation location. For the VM, the installation location is /opt/waterlinedata. If you are running Waterline Data Inventory jobs in a development environment, consider opening two separate command windows: one for the Jetty console output and a second to run Waterline Data Inventory jobs. 2014-2015 Waterline Data, Inc. All rights reserved. 12

Running Waterline Data Inventory jobs Command Full HDFS Processing $ waterline profile <HDFS dir> HDFS Profiling $ waterline profileonly <HDFS dir> Full Hive Processing $ profilehive default /user/waterlinedata/.hivetmp Hive Profiling $ profilehiveonly default /user/waterlinedata/.hivetmp Tag Propagation $ waterline tag Lineage Discovery $ waterline runlineage Description Performs the initial profile of your cluster or on a regular interval to profile new and updated files. This command triggers profiling as well as the discovery processes that use profiling data. Use the directory you specify here to set the scope of the profiling job. When you ve profiled the entire cluster (or enough to provide enough profiling information), you are ready to run the lineage discovery command. Profiles cluster content. Use this command after you ve added files to the cluster but you aren t ready to have Data Inventory suggest tags for the data. Example: $ waterline profileonly /user/waterlinedata/landing Full profile and discovery of the tables in the indicated Hive databases ( default in the case of the sandbox). Indicate more than one database with a commaseparated list. To specify individual tables, user the property waterlinedata.profile.hivenamefilter with a regular expression as an override. Profiling of the tables in the indicated Hive databases ( default in the case of the sandbox). No discovery processes are run. Propagates tags across the cluster. Use this command when you know that you haven t added new files since the last profile but you have tags and tag associations that you want Data Inventory to consider for propagation. Discovers lineage relationships and propagates origin information. Use this command when you have marked folders or files with origin labels and want that information propagated through the cluster. Include this command after the full profile for regular cluster profiling. 2014-2015 Waterline Data, Inc. All rights reserved. 13

Monitoring Waterline Data Inventory jobs Monitoring Waterline Data Inventory jobs Waterline Data Inventory provides a record of job history in the Dashboard of the browser application. In addition, you can follow detailed progress of each job on the console where you run the command. Monitoring Hadoop jobs When you run the profile command, you ll see an initial job for format discovery followed by one or more profiling jobs. There will be at least one profiling job running in parallel for each file type Data Inventory identifies in the format discovery pass. The console output includes a link to the job log for the running job. For example: 2014-09-20 18:17:27,048 INFO [WaterlineData Format Discovery Workflow V2] mapreduce.job (Job.java:submit(1289)) - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1913847052944_0004/ 2014-2015 Waterline Data, Inc. All rights reserved. 14

Monitoring Waterline Data Inventory jobs While the job is running, you can follow this link to see the progress of the MapReduce activity. Alternatively, you can monitor the progress of these jobs using Hue in a browser: http://<cluster IP address>:8000/jobbrowser You ll need to specify the waterlinedata user. Monitoring local jobs After the Hadoop jobs complete, Waterline Data Inventory runs local jobs to process the data collected in the repository. You can follow the progress of these jobs by watching console output in the command window in which you started the job. Debugging information There are multiple sources of debugging information available for Data Inventory. If you encounter a problem, collect the following information for Waterline Data support. Job messages on the console Waterline Data Inventory generates console output for jobs run at the command prompt. If the job encounters problems, you would review the console output for clues to the problem. To report errors to Waterline Data support, you would copy this output into a text file or email to help us follow what occurred: /opt/waterlinedata/bin/waterline profile <HDFS directory> These messages appear on the console but are collected in a log file with debug logging level: /var/log/waterline/wdi-inventory.log Web server console output The embedded web server, Jetty, produces output corresponding to user interactions with the browser application. These messages appear on the console but are collected in a log file with debug logging level: /var/log/waterline/waterlinedata.log Use tail to see the most recent entries in the log: tail -f /var/log/waterline/wdi-ui.log Advanced Search results From inside Waterline Data Inventory, use the Advance Search to identify files that failed to profile. Choose the profile status you are interested in from the Profile Status search facet. 2014-2015 Waterline Data, Inc. All rights reserved. 15

Configuring additional Waterline Data Inventory functionality Configuring additional Waterline Data Inventory functionality Waterline Data Inventory provides a number of configuration settings and integration interfaces to enable extended functionality. Refer to Waterline Data Inventory Installation and Administration Guide for details: support.waterlinedata.com/hc/en-us/articles/205840116-waterline-data- Inventory-v2-0-Documents Accessing Hive tables Waterline Data Inventory makes it easy to create Hive tables from files in your cluster. You can access the Hive instance on the guest through Hue or by connecting to Hive from other third-party query or analysis tools. Viewing Hive tables in Hue You can access the Hive tables in your cluster through Hue using the Beeswax query tool: http://<cluster IP address>:8000/beeswax Connecting to the Hive datastore To access Hive tables from Tableau, Qlik Sense, or other analysis tool, you ll need to configure a connection to the Hive datastore on the cluster. For a Waterline Datasupplied cluster, use the following connection information: Parameter Value Server Use the same server IP address as you use for Waterline Data Inventory Port 10000 Server Type HiveServer2 Authentication Username and Password Username hive Password hive 2014-2015 Waterline Data, Inc. All rights reserved. 16