Greenplum Data Loader Installation and User Guide

Size: px

Start display at page:

Download "Greenplum Data Loader Installation and User Guide"

Liliana Bradley
6 years ago
Views:

1 Greenplum DataLoader 1.2 Installation and User Guide Rev: A01 1

2 Copyright 2012 EMC Corporation. All rights reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICUAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com All other trademarks used herein are the property of their respective owners. 2

3 Greenplum Data Loader Installation and User Guide Greenplum Data Loader Installation and User Guide... 3 Overview of Greenplum Data Loader... 5 Benefits of Greenplum Data Loader... 5 Getting Started With Greenplum Data Loader... 5 Greenplum Data Loader Components... 5 Greenplum Data Loader Dependencies... 6 Greenplum Data Loader RPMs... 6 Greenplum Data Loader Deployment Structure... 7 Master Node... 7 Slave Node... 7 BulkLoader CLI... 7 Preparing to Install Greenplum Data Loader... 8 Installing Greenplum Data Loader... 9 Configuring Greenplum Data Loader Using Greenplum Data Loader Registering or Deleting a Data Store Submitting a Job Suspending a Job Resuming a Job Stopping a Job Trouble Shooting Appendix A: Command Line Reference bulkloader Synopsis Submit Suspend Resume Stop Query List Config Appendix B: Data Store Registration Properties FTP Data Store Registration Properties

4 HTTP Data Store Registration Properties HDFS Data Store Registration Properties LocalFS Data Store Registration Properties NFS Data Store Registration Properties NFS Data Store HDFS Data Store Appendix C: Greenplum Data Loader Copy Strategies Copy Strategies Appendix D: Zookeeper Installation and Configuration Appendix E: Installing and Configuring the MapReduce Cluster Appendix F: Installing and Configuring Bookkeeper Appendix G: Sample Deployment Topology Using an Existing MapReduce Cluster Installing a Dedicated MapReduce Cluster Appendix H: Properties for Each Datastore Type Glossary

5 Overview of Greenplum Data Loader Greenplum Data Loader is an advanced Big Data transporting tool. It focuses on loading Big Data into data analytics platforms. It is an enterprise solution for staged, batch data-loading. It features loading batch data onto large data warehouse or analytics platforms for offline analysis. It deploys code, partitions data into chunks, splits jobs into multiple tasks, schedules the tasks taking into account data locality and network topology, and handles job failures. Greenplum Data Loader can dynamically scale the execution of data loading tasks to maximize the system resource. With single node deployment, it linearly scales out on disk numbers up to the maximum machine bandwidth. With multi-node cluster deployment, it linearly scales out on machine numbers up to the maximum network bandwidth. This horizontal scalability promises optimized, and best possible throughput Benefits of Greenplum Data Loader In summary, Greenplum Data Loader: Focuses on optimizing throughput with resource efficiency and linear scalability Enables higher throughput via parallel load, data locality and averaging files into similar-sized chunks Supports multiple data transfer jobs simultaneously Supports a wide variety of source data store/access protocols HDFS, Local FS (DAS), NFS, FTP, HTTPS Uses master/slave architecture and can be managed through both CLI and GUI Getting Started With Greenplum Data Loader This topic describes the Greenplum Data Loader components, and the RPMs included in the package. Greenplum Data Loader Components Bulkloader consists of the following components: Component Description BulkLoader Manager Provides an operational and administrative graphical user interface. It also provides REST programmatic interface for integration with other tools. 5

6 BulkLoader CLI A command line tool that interacts with BulkLoader Manager to provide the command line access for loading job operation. BulkLoader Scheduler Provides a job and task scheduling service. BulkLoader worker Performs data loading work. Greenplum Data Loader Dependencies Greenplum Data Loader has the following dependencies: Zookeeper Cluster: Provides registration and coordination service for Greenplum Data Loader MapReduce Cluster: Manages the Greenplum Data Loader cluster Persistent Storage: Provides a distributed, shared storage for Greenplum Data Loader cluster to store and access data transfer plan Greenplum Data Loader RPMs The following RPMs are part of this release: Package Name Description bulkloader-scheduler- 1.0-GA.x86_64.rpm bulkloader-scheduler provides the essential files to setup bulkloader master server. bulkloader-worker-1.0- GA.x86_64.rpm bulkloader-worker provides the essential files to setup bulkloader slave server. bulkloader-cli-1.0- GA.x86_64.rpm bulkloader-cli provides the essential files and binaries to setup bulkloader client. Client can interact with bulkloader server to perform data loading operations. bulkloader-datastore- 1.0-GA.x86_64.rpm bulkloader-datastore provides the essential files to support different data stores. bulkloader-manager-1.0- GA.x86_64.rpm bulkloader-manager provides the http server. bulkloader-bookkeeper- 1.0-GA.x86_64.rpm bulkloader-bookkeeper provides the essential files and binaries to set up the bookkeeper. bulkloader-httpfs-1.0- GA.x86_64.rpm Bulkloader-httpfs provides the essential files to setup httpfs. 6

7 bulkloader-zookeeper- 1.0-GA.x86_64.rpm bulkloader-zookeeper provides the essential files and binaries to setup zookeeper. Greenplum Data Loader Deployment Structure The Greenplum Data Loader cluster copies data from the source datastore to the destination cluster. The cluster is composed of three types of logical nodes: Master Node Slave Node CLI Node Note: If you already have a MapReduce deployment, you can choose to leverage the existing MapReduce and use its HDFS as source or destination data store. Otherwise, you can install a dedicated MapReduce cluster and use its JobTracker filesystem. Master Node You must install the following components: BulkLoader Manager BulkLoader Scheduler BulkLoader DataStore Note: In a dedicated MapReduce cluster, you can have the following components on the master machine: MapReduce JobTracker Hadoop-http-fs file system Slave Node You must install the following components: BulkLoader DataStore BulkLoader Worker Note: Each BulkLoader slave node must also have TaskTracker installed. BulkLoader CLI The CLI can be installed on any client machine that has access to BulkLoader Manager. 7

8 Preparing to Install Greenplum Data Loader Perform the following tasks to prepare your environment for Greenplum Data Loader. 1. Install the JDK: Download and install the Oracle JDK1.6 (Java SE6 or JDK 6) from ( 2. After installing JDK, set the JAVA_HOME environment variable referring to where you installed JDK. On a typical Linux installation with Oracle JDK 1.6, the value of this variable should be /usr/java/default/jre. Then add $JAVA_HOME/bin into your PATH environment variable. On a Linux platform with bash shell, you add the following lines into the file ~/.bashrc: export JAVA_HOME=/usr/java/default/jre export PATH=$JAVA_HOME/bin:$PATH 3. Install and set up Zookeeper cluster. Please refer to Appendix E: Zookeeper Installation and Configuration. 4. Install and set up the Map Reduce Cluster: If you need to install a new MapReduce cluster, see Appendix F: Installing and Configuring the MapReduce for more information. 5. Configure the MapReduce cluster for Greenplum Data Loader. a. Add the properties mapred.jobtracker.taskscheduler and mapred.job.tracker.http.address to the mapred-site.xml configuration file. Note: See the following sample mapred-site.xml file for more information. Sample mapred-site.xml file  <name>mapred.jobtracker.taskscheduler</name> <value>org.apache.hadoop.mapred.fairscheduler</value> <1-- replace with your JobTracker host name in the value --> <name>mapred.job.tracker.http.address</name> <value>your_jobtracker_hostname:50030</value> b. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader). Find and delete hadoop- 8

9 fairscheduler-*.*.*.jar in the HADOOP_HOME/lib. c. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader). Find the Bulkloader fair scheduler jar file in the bulkloader-hadoop-1.0-xxx.x86_64.rpm and copy it to $HADOOP_HOME/lib. Sample commands: sudo rm -f /usr/lib/gphd/hadoop/lib/hadoop-fairscheduler gphd jar sudo cp /var/gphd/bulkloader-1.0/runtime/hadoop/lib/hadoop-fairscheduler gphd jar /usr/lib/gphd/hadoop/lib 6. Install and configure Bookkeeper. Note: If you use an existing HDFS as your persistent storage, you can skip this step. If you use bookkeeper, please refer to Appendix G: Installing and Configuring Bookkeeper more information. Installing Greenplum Data Loader To install Greenplum Data Loader on the master node 1. Set up passwordless SSH connection to enable Bulkloader Scheduler and Manager: $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ ssh hostname0 # copy authorized_keys to all hosts(hostname1, hostname2 etc.) in the cluster using scp # NOTE: if an authorized_keys file already exists for # the user, rename your file authorized_keys2 $ scp /home/hadoop/.ssh/authorized_keys hostname1:/home/hadoop/.ssh/ # Set the permissions on the file on all hosts $ ssh hostname1 $ chmod 0600 ~/.ssh/authorized_keys 2. Install the following packages: $ sudo rpm -ivh bulkloader-datastore-1.0-xxx.x86_64.rpm $ sudo rpm -ivh bulkloader-scheduler-1.0-xxx.x86_64.rpm $ sudo rpm -ivh bulkloader-manager-1.0-xxx.x86_64.rpm To install Greenplum Data Loader on the Slave Node Install the following packages on all the slave nodes: 9

10 $ sudo rpm -ivh bulkloader-datastore-1.0-xxx.x86_64.rpm $ sudo rpm -ivh bulkloader-worker-1.0-xxx.x86_64.rpm To install Greenplum Data Loader on the CLI Node Install the client package so that the client can interact with the Greenplum Data Loader server: $ sudo rpm -ivh bulkloader-cli-1.0-xxx.x86_64.rpm Configuring Greenplum Data Loader To configure the common properties 1. Update the bulkloader-common.xml file for Zookeeper and HDFS or Bookkeeper configuration. 2. Copy this file to the corresponding conf directory to each and all nodes of Scheduler, Manager, and Worker processes. (The folder location should be /usr/local/gphd/bulkloader-1.0/<manager worker scheduler>/conf for different kinds of processes.) <configuration>  <! Each server appears in this list in the format of <hostname>:<port>, servers are separated with, --> <name>bulkloader.zk.address</name> <value>sdw2:2181,sdw1:2181,sdw3:2181,sdw4:2181,sdw5:2181</value>  <name>bulkloader.storage.type</name> <value>bk</value>  <name>bulkloader.storage.bk.ledger.size</name> <value> </value>  <name>bulkloader.storage.bk.entry.size</name> <value>524288</value> 10

11  <name>bulkloader.storage.hdfs.uri</name> <value>hdfs://hdfs_hostname:port</value>  <name>bulkloader.storage.hdfs.rootdir</name> <value>/storage/hdfs/root/directory</value> </configuration> To configure the datastore The bulkloader-datastore.xml contains the configuration properties for datastore. Copy the bulkloader-datastore.xml file to the corresponding conf directory on each node for the Scheduler, Manager, and Worker processes. <configuration>  <name>bulkloader.datastore.meta.dir</name> <value>/usr/local/gphd/bulkloader-1.0/datastore</value> </configuration> To configure the Scheduler 1. Configure the following properties in the /usr/local/gphd/bulkloader- 1.0/scheduler/conf/bulkloader-scheduler.xml file. <configuration>  <name>bulkloader.scheduler.mapred.conf.dir</name> <value>/mapreduce/cluster/configuration/directory</value> <!-- bulkloader.scheduler.service.rest.port is the scheduler service rest port. The default value is "12321" 11

12 --> <name>bulkloader.scheduler.service.rest.port</name> <value>12321</value>  <name>bulkloader.scheduler.service.rest.host</name> <value>scheduler_hostname</value>  <name>bulkloader.scheduler.taskscheduler.host</name> <value>scheduler_hostname</value> </configuration> 2. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Delete hadoop-core jar in /usr/local/gphd/bulkloader-1.0/scheduler/lib. 3. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Copy the hadoop-core-x.x.x-gphdx.x.x.x.jar from the MapReduce Cluster to the /usr/local/gphd/bulkloader-1.0/scheduler/lib. Sample commands: sudo rm -f /usr/local/gphd/bulkloader-1.0/scheduler/lib/hadoop-core jar sudo cp /usr/lib/gphd/hadoop/hadoop-core gphd jar /usr/local/gphd/bulkloader-1.0/scheduler/lib To configure the Manager 1. Copy bulkloader-manager.xml to /usr/local/gphd/bulkloader- 1.0/manager/conf on the master node, and to /usr/local/gphd/bulkloader-1.0/worker/conf on the slave nodes. 12

13 Sample bulkloader-manager xml: <configuration>  <name>bulkloader.manager.data.dir</name> <value>data</value>  <name>bulkloader.manager.monitoring.enable</name> <value>true</value>  <name>bulkloader.manager.monitoring.host</name> <value>manager_hostname</value> <!-- bulkloader.manager.monitoring.port is the manager monitoring port Default is > <name>bulkloader.manager.monitoring.port</name> <value>12345</value> </configuration> 2. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Delete the hadoop-core jar from the /usr/local/gphd/bulkloader-1.0/manager/lib. 3. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Copy the hadoop-core-x.x.x-gphdx.x.x.x.jar from the MapReduce Cluster to the /usr/local/gphd/bulkloader-1.0/manager/webapps/web-inf/lib sudo rm -f /usr/local/gphd/bulkloader-1.0/manager/webapps/web- INF/lib/hadoop-core jar 13

14 sudo cp /usr/lib/gphd/hadoop/hadoop-core gphd jar /usr/local/gphd/bulkloader-1.0/manager/webapps/web-inf/lib To configure the Worker Copy the bulkloader-worker.xml to the directory /usr/local/gphd/bulkloader- 1.0/worker/conf for each slave node. Sample bulkloader-worker.xml: <configuration>  <name>bulkloader.worker.reader.num</name> <value>2</value>  <name>bulkloader.worker.writer-pipeline.num</name> <value>5</value>  <name>bulkloader.worker.buffer.num</name> <value>16</value>  <name>bulkloader.worker.buffer.size</name> <value> </value>  <name>bulkloader.worker.progress.interval</name> <value>1800</value>  <name>bulkloader.worker.slice.retry.enable</name> 14

15 <value>true</value>  <name>bulkloader.worker.slice.retry.num</name> <value>3</value> </configuration> To configure the Client Configure the /usr/local/gphd/bulkloader-1.0/cli/conf/bulkcli.conf file to point to the manager address. That is, the hostname and port of your BulkLoader master node. Sample bulkcli.conf: bulkloader.api.url= To start the Data Loader service 1. Run the following command to start Scheduler on the master node. $ cd /usr/local/gphd/bulkloader-1.0/scheduler/bin $./start.sh 2. Run the following command to start manager. $ cd /usr/local/gphd/bulkloader-1.0/manager/bin $./start.sh To stop the Data Loader service 1. Run the following command to stop manager on the master node. $ cd /usr/local/gphd/bulkloader-1.0/manager/bin $./stop.sh 4. Run the following to stop scheduler on the master node. $ cd /usr/local/gphd/bulkloader-1.0/scheduler/bin $./stop.sh Using Greenplum Data Loader Before submitting any job to Greenplum Data Loader, you must register your datastore. You can use the command line tool or Data Loader Console with Greenplum Data Loader for any of the following tasks: Registering and unregistering a datastore 15

16 Starting or stopping a job Suspending or resuming a job Querying a job Configuring a job Listing jobs Monitoring job progress You can access the Data Loader Console at: Registering or Deleting a Data Store To load data from a data store, the data store must be registered with the Greenplum Data Loader. You can register the data store using the Command Line or through the Data Loader Console. To register a data store using the command line Note: Perform the following command line operations on the Bulkloader CLI machine. 1. On the Client node, create the property file for the datastore in the directory /usr/local/gphd/bulkloader-1.0/cli/bin. 16

17 Provide values for the following properties: Data store registry values Property Description datahandler.handlers Type of datastore. host Host name of the datastore to register. port The port number of the datastore. Does not require a local file store. scheme Datastore schema Supported Datastore Types nfs ftp internal HDFS http localfs GPHD1.1 GPHD1.2 HDFS apache1.0.3 GPHD Run the following command to register the data store in cli node: $ cd /usr/local/gphd/bulkloader-1.0/cli/bin $./bulkloader config --command register -i <propertyfile> Any file in the following Sample Property Files can be used as <propertyfile> in the above command. For example, to register an nfs datastore, the commands would be: $ cd /usr/local/gphd/bulkloader-1.0/cli/bin $./bulkloader config --command register -i nfs.property Regarding datastore registration properties, please see Appendix B: Data Store Registration Properties for more information. Sample Property Files Sample internalhdfs.property file type=internal_hdfs_ host=mdw port=9500 rootpath=/ scheme=hdfs dfs.replication=1 17

18 dfs.block.size= Note: dfs.replication and dfs.block.size are two special properties for HDFS data store. They are used to set replication number and block size of the destination HDFS. Sample hdfs1.1.property file type=hdfs_gphd1_1_ host=mdw port=9500 rootpath=/ scheme=hdfs Sample localfs.property file type=rawfs host=hdp1-w2 rootpath=/ scheme=localfs Note: The host must be the hostname of the node. Sample ftp.property file type=ftp host=sdw6 rootpath=/ port=21 scheme=ftp user=wangh26 passwd=password Note: user and passwd are two special properties for ftp data store. They are the username and password of ftp user. Sample http.property file type=http host= rootpath=/ port=80 scheme=http Sample nfs.property file type=nfs host=sdw6 rootpath=/ scheme=nfs mountpoint=/mnt 18

19 Note: mountpoint is specific to NFS data store. It is the mount point of NFS client. Sample GPHD1.2.property file type=hdfs_gphd1_2_ host=mdw port=9500 rootpath=/ scheme=hdfs Sample GPHD property file type=hdfs_gphd1_1_02_ host=mdw port=9500 rootpath=/ scheme=hdfs Sample hdfs_apache_1_0_3_.property file type=hdfs_apache_1_0_3_ host=mdw port=9500 rootpath=/ scheme=hdfs To register a Data Store using the Data Loader Console 1. Go to the web page at and select the data stores button to enter the data store register page. 2. Select the Register new Data Store button to bring up the New Data Store Instance page. 3. Enter values in the New Data Store Instance page to register new data store. Note: a. To register a NFS type data store, you should mount NFS server to both master machine and slave machines. Also, all the mount points should be the same. b. To register a local FS type data store, the slave machines must be used. You can register any data loader slave machines as local FS type data store. 4. If you need to specify more data store properties select Add Property. See Appendix B: Data Store Registration Properties for more information. 5. Select the Create button on the page to complete the registration. The new data store can be found in Data Store Instances page. To unregister or delete a data store using the command line 1. Run the following command to list all the registered datastore IDs: $ cd /usr/local/gphd/bulkloader-1.0/cli/bin 19

20 $./bulkloader config --command list 2. Run the following command to delete the datastore. $./bulkloader config --command delete -n datastore_id To unregister or delete a data store using the Data Loader Console 1. Select the Data Stores tab to see the Data Store Instances page. 2. Select the Remove button of the data store you want to remove. Submitting a Job To submit a job using the command line 1. Prepare the Transfer Specification File You must prepare the transfer specification file to submit a job using the command line. It is an XML file that specifies the source and destination data store, and the file or folder to load. Values for the transfer specification file Property Description fsuri in source The source data store hostname and port. rootpath in source The root path of the file sets to load. fullpath The full path of the file to load. type fsuri in target The entry file type. The types are as follows: folder - used to transfer the folder to the destination cluster file - transfers data listed in the specification file glob - expands and replaces glob patterns such as *.log to the list of matches The destination data store hostname and port. rootpath in target The destination root path of the files to copy. 2. Submit the job through the command line after the transfer specification file is ready: $ bulkloader submit -i <transfer specification file> -s <strategy> See Appendix D: Greenplum Data Loader Copy Strategies for more information. For example, the command to submit a job with a local FS data store is: $ bulkloader submit -i localdisk.xml -s localfs --max-disk-mapper 2 See Appendix A: Command Line Reference for more details. 20

21 File Entry Type Samples Sample ftp.xml file with file entry type "glob" <FileTransferSpec xmlns=" <source> <fileset fsuri="ftp://your_source_hostname:21" rootpath="/ "> <filepath type="glob"> <globstring>/photo//</globstring> <filterpattern>^mov.*$</filterpattern> </filepath> </fileset> </source> <target fsuri="hdfs://your_target_hostname:9500/" rootpath="/user/hadoop"/> </FileTransferSpec> The ftp.xml transfers filenames that match the pattern "^MOV.*$" from the source ftp://your_source_hostname:21/photo/*/* to the destination hdfs://your_target_hostname:9500/user/hadoop. Sample http.xml with file entry type "file" <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri=" rootpath="/rootpath"> <filepath fullpath="/filename_1" type="file"/> <filepath fullpath="/filename_2" type="file"/> <filepath fullpath="/filename_n" type="file"/> </fileset> </source> <target fstype="" fsuri="hdfs://your_target_hostname:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The specification transfers the files filename_1, filename_2, through to filename_n from the source to the destination hdfs://your_target_hostname:9500/destination_directory. Sample localfs.xml <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="file://your_slave_hostname/datastore_rootpath" rootpath="rootpath"> <filepath fullpath="filename_1 " type="file"/> <filepath fullpath= filename_2 type= file /> <filepath fullpath= filename_n type= file /> </fileset> 21

22 </source> <target fstype="" fsuri="hdfs://your_target_hostname:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The hostname name of the slave machine is YOUR_SLAVE_HOSTNAME. When you transfer data from the local FS to the destination HDFS, the local FS is a slave machine. Sample localfs_disk.xml file <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="file://your_slave_hostname/datastore_rootpath1" rootpath="rootpath" disk="disk0"> <filepath fullpath="filename_1 " type="file"/> <filepath fullpath="filename_2" type="file"/> <filepath fullpath="filename_n" type="file"/> </fileset> <fileset fstype="" fsuri="file://your_slave_hostname/datastore_rootpath2" rootpath="rootpath" disk="disk1"> <filepath fullpath="filename_1 " type="file"/> <filepath fullpath="filename_2" type="file"/> <filepath fullpath="filename_n" type="file"/> </fileset> </source> <target fstype="" fsuri="hdfs://hdsh211:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The example above is the FileTransferSpec for localfs strategy when you use max-diskmapper in the command line. See Appendix A: Command Line Reference for more detail. In the sample, YOUR_SLAVE_HOSTNAME is the hostname of a slave machine. Note that multiple disks can reside on the same host, but must be different physical disks. Sample nfs.xml file with file entry type "folder" <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="nfs://your_source_hostname/datastore_rootpath" rootpath="rootpath"> <filepath fullpath="/foldername_1" type="folder"/> <filepath fullpath="/foldername_2" type="folder"/> <filepath fullpath="/foldername_n" type="folder"/> </fileset> </source> <target fstype="" fsuri="hdfs://your_target_hostname:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The specification file transfers folders foldername_1, foldername_2, through 22

23 foldername_n from the source nfs://your_source_hostname/datastore_rootpath/rootpath to the destination hdfs://your_target_hostname:9500/destination_directory. Sample hdfs.xml file <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="hdfs://your_source_hdfs_hostname:9000/datastore_rootpath" rootpath="rootpath"> <filepath fullpath="filename_1" type="file"/> <filepath fullpath="filename_2" type="file"/> <filepath fullpath="filename_n" type="file"/> </fileset> </source> <target fstype="" fsuri="hdfs://your_target_hdfs_hostname:9500/" rootpath="/rootpath"/> </FileTransferSpec> To submit a job using the Greenplum Data Loader Console 1. Select Create a New Job on the home page. You can submit your job using basic or advanced properties. a. To submit using the basic option, provide the required values shown in Basic property values for submitting a job. Go to Step 2. Basic property values for submitting a job Property Description Source datastore The source data store hostname and port. Source path The path of the data you want to load in the source data store. Target URI The destination data store hostname and port. Target Path The path of the data you want to copy to in the destination data store. b. To submit using advanced options, click Show Advanced Options. See Advanced property values for submitting a job for more information about the required property values. Advanced property values for submitting a job Property Description Copy strategy Select a copy strategy. 23

24 Mapper number Band width Setting this value ensures that the job transfer speed is less than the available band width. Chunking Enables chunking. Chunking Threshold The minimum file size to chunk. Chunking Size Defines the size of the chunks. Overwrite existing file Enables overwriting. Compression Enables compression. Disk Mapper Necessary if you chose the localdisk strategy. Use it set the mapper number per disk. 2. Select Submit. Greenplum Data Loader uses the default value for optional fields. After submitting the job, you can search under the Running Jobs list in the home page. 3. Check the detailed information about the job by clicking the job ID. To monitor a job using the command line Check the job status and progress through the query command line option: $ bulkloader query -n <jobid> To monitor a job using the Greenplum Data Loader Console Click the job ID to monitor details. You can find the job detailed information. You can check the progress bar. Suspending a Job To suspend a job using the command line From the home page, select the job ID you want to suspend. $ bulkloader suspend -n <jobid> To suspend a job using the Greenplum Data Loader Console 1. In the Running Job list on the home page, find the job ID. 2. Select Suspend. 24

25 You can find the job in the suspended job list. Resuming a Job To resume a job using the command line $ bulkloader resume -n <jobid> To resume a job using the Data Loader Console 1. In the Suspended Jobs list on the Home page, find the Job ID. 2. Select Resume in the Job Operations list. You can check the home page to confirm that the job is running again. Stopping a Job Note: You can stop a job while it is running or been suspended. Once stopped you cannot resume a job. To stop a job using the command line $ bulkloader stop -n <jobid> To stop a job using the Greenplum Data Loader Console 1. On the home page, search the Running Jobs or the Suspended Jobs list to find the job you want to stop. 2. Select the Job ID. 3. From the Job Operation list, select the Cancel button. 4. Confirm that the job is listed in the Canceled Jobs list. Trouble Shooting Check the scheduler log on the master node, /usr/local/gphd/bulkloader- 1.0/scheduler/log/scheduler.log. Check the manager log on the manager node, /usr/local/gphd/bulkloader- 1.0/manager/log/manager. 25

26 Appendix A: Command Line Reference bulkloader This is the bulkloader client utility. Synopsis bulkloader [COMMAND] [OPTIONS] The bulkloader utility supports the following commands: submit suspend resume stop query list Submit You will need to create a specification file before using the submit command. The bulkloader command requires the following options: $ bulkloader submit -i <transfer specification> [-s <strategy>, -m <mapper number>, -b <bandwidth>, -k true false, -c <chunking size>, -t true false, -o true false, -z] Sample bulkloader command bulkloader submit -i myfileset.xml -k true -o true -c 512M -m 24 -b 2M -s intelligent You can expect the following when you issue this command: Receive a job id. See an error in the console if it fails. 26

27 Submit options and descriptions Option name Value type Default value Description -i (--input) Path to the configuration file. N/A This value is required. Contains the names of the source and target files. -m (--mappernum) Number of mapper Should be the same size as the file. This value is required if you select the connectionlimited strategy. Sets the number of mappers used to copy the files. -b (--bandwidth) Long value. For example, 3M is interpreted as 3 megabytes. No bandwidth control. Defines the maximum usable bandwidth. -k (--chunking) Boolean. False Indicates whether the data chunking is enabled. -c (--chunksize) Long value. For example, 3M is interpreted as 3 megabytes. 64M The size of each chunk file, if chunking is enabled. -t (-- chunkingthreshold) Long value. For example, 3M is interpreted as 3 megabytes. 1.6G The minimum file size to chunk, if chunking is enabled. -s (--strategy) One of the six: hdfslocality uniform localfs localdisk connectionlimited intelligent intelligent See Appendix D Greenplum Data Loader Copy Strategies for more information. -o (--overwrite) Boolean false Enable overwriting at the destination. -z (/) No value / Enable data compression. --max-disk-mapper Number of mappers per disk. / This option is only used with localfs strategy. When it is specified, additional configuration for disk is required in FileTransferSpec. See page 21 for an example. Suspend You can suspend a bulkloader data transfer job. bulkloader suspend -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist or is not running. 27

28 Suspends the target job. Suspend options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to suspend. Resume You can resume a suspended or failed bulkloader data transfer job. bulkloader resume -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist or is in an unexpected state. Resumes the target job. Resume options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to resume. Stop You can stop a bulkloader data transfer job. bulkloader stop -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist or if the job has already stopped. Stop the target job. This job cannot be resumed. Stop options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to stop. 28

29 Query You can query the progress of a specified data transfer job. bulkloader query -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist. Display status of the specified job. If Map Reduce is running, displays the progress of the transfer. Query options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to query. List You can list all the running and completed jobs. bulkloader list -status <options> List options and descriptions Option name Value type Default value Description -status One of the following: STARTED COMPLETED N/A List of running or completed jobs. CANCELED SUSPENDED Config You can configure the data store. $ bulkloader config --command list register -i <property_file> delete -n <Datastore_ID> 29

30 Config options and descriptions Option name Value type Default value Description --command One of the following: list register delete N/A Configures the data store. -i (/) String. N/A This is required for the register option. The name of the property file. -n (/) String. N/A Required for the delete command. The ID of the datastore. 30

31 Appendix B: Data Store Registration Properties This appendix describes the properties used to register each supported data store: FTP Data Store Registration Properties Property Value type ftp host The name or IP address of the FTP server. rootpath The root path of the source FTP server. port The port of the FTP server. The default port is 21. scheme ftp user The FTP username. passwd The FTP user s password. transfermode The FTP transfer mode, it can be stream, block or compressed. passive The FTP mode is passive. filetype The file type, it can be binary, ascii. HTTP Data Store Registration Properties Property Value type http host The name or IP address of the HTTP server. port The port of the HTTP server. The default is 80. rootpath The root path of the source HDFS cluster. scheme http dfs.replication The copy replication number in destination HDFS. Default value: 3 31

32 HDFS Data Store Registration Properties Property Value type For Internal HDFS, the value is: internal_hdfs_ For GPHD1.1 HDFS, the value is: hdfs_gphd1_1_ For GPHD1.2 HDFS, the value is: hdfs_gphd1_2_ For GPHD HDFS, the value is: hdfs_gphd1_1_1_02_ For Apache HDFS, the vaue is: hdfs_apache_1.0_3_ host The name of HDFS host, this is the same as the value in the hdfs dfs.name.dir. port The port of the HDFS directory, this is the same as the value for the port in the hdfs dfs.name.dir. rootpath The root path of the source HDFS cluster. scheme hdfs dfs.replication The copy replication number in destination HDFS. Default value: 3 LocalFS Data Store Registration Properties Properties of the LocalFS data store register Property Value type rawfs host The name of a local host machine. Data is copied from this machine when you select the localfs strategy. rootpath The root path of the local machine from where the data is shared. scheme localfs NFS Data Store Registration Properties Property Value type nfs host The nfs server ip address or host name. 32

33 rootpath The root path of the nfs server where data is shared from. scheme nfs mountpoint The NFS mount point on Bulkloader nodes. 33

34 NFS Data Store Key Default Value Description mountpoint / The NFS mount point on Bulkloader nodes HDFS Data Store Key Default Value Description dfs.replication 3 The copy replication number in destination HDFS. 34

35 Appendix C: Greenplum Data Loader Copy Strategies Copy Strategies Strategy Name Description Supported Source Supported Target locality This strategy applies to the case when source data is stored in HDFS cluster. With locality strategy, Greenplum Data Loader will try to deploy worker threads to HDFS datanodes, so that the worker thread collocates with data, and each worker thread reads data from local HDFS, and writes it to the destination. HDFS HDFS supports concat localfs This is a locality strategy when source data locates on native file system. In this case, Administrator needs to know the list of source data node, and Greenplum Data Loader will deploy worker threads to the source data nodes. native file system HDFS uniform Uniform strategy assigns loading tasks uniformly to all the loader machines according to file size. HTTP, FTP HDFS connection limited When source data is stored in FTP/HTTP server, the FTP/HTTP server may have restrictions on how many connections are allowed concurrently. When connection count exceeds the allowance, the server will not respond to data download request. The connectionlimited strategy is provided for such scenario. User can choose to use this strategy and specify the maximum connection number. The strategy will ensure the number of concurrent workers not exceeding the threshold. All data source HDFS inelligent With this strategy, Greenplum Data Loader will automatically pick the suitable copy strategy for the user scenario. For example, if copying from HDFS, and target HDFS supports concat, then locality strategy will be selected; if copy from local file system, localfs strategy will be selected; otherwise uniform strategy will be used. All data source HDFS 35

36 Copy strategies for different data store types Data Store Type Copy Strategy Policy HDFS NFS Locality (if destination data store supports concat*) Uniform connectionlimited dynamic intelligent uniform connectionlimited dynamic inelligent Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (if chunk enabling is enabled) Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (If chunk enabling is enabled) LocalFS uniform connectionlimited dynamic inelligent Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (if chunk enabling is enabled) FPP/HTTP uniform connectionlimited dynamic intelligent Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (if chunk enabling is enabled) * The concat feature in HDFS is the ability to concatenate two or more files into one big file. 36

37 Appendix D: Zookeeper Installation and Configuration To install Zookeeper 1. Select the zookeeper server machines. Typically, the number of severs you install on Zookeeper should be an odd number. 2. Run the following command to install Zookeeper on each machine: $ sudo rpm -ivh bulkloader-zookeeper-1.0-xxx.x86_64.rpm If your architecture requires more than one Zookeeper server, run the command on each machine. To configure Zookeeper 1. In the /var/gphd/bulkloader-1.0/runtime/zookeeper/conf directory, find the zookeeper configuration file, zoo_sample.cfg. 2. Make a copy called zoo.cfg. $ cp zoo_sample.cfg zoo.cfg 3. Specify the following values: Values for the Zookeeper configuration file Property Description datadir The directory where the snapshot is stored. server.n The port at which the clients will connect. (The "n" is the zookeeper server number) See the Sample zoo.cfg file for an example of how the myid file displays the server number of each machine. Sample zoo.cfg file # The number of milliseconds of each tick ticktime=2000 # The number of ticks that the initial # synchronization phase can take initlimit=10 # The number of ticks that can pass between 37

38 # sending a request and getting an acknowledgement synclimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. datadir=/data2/zookeeper # the port at which the clients will connect clientport=2181 server.1=sdw3:2888:3888 server.2=sdw4:2888:3888 server.3=sdw5:2888:3888 # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # [ e] # The number of snapshots to retain in datadir # autopurge.snapretaincount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature # autopurge.purgeinterval=1 4. Create a file called myid on each Zookeeper server and place it under the datadir directory (specified in zoo.cfg). The myid file myid contains the server number of the machine. 5. Add the variables $ZK_HOME, $ZOOCFGDIR and $ZK_HOME/bin to the.bashrc file. Sample Zookeeper.bashrc file export ZK_HOME=/var/gphd/bulkloader-1.0/runtime/zookeeper export ZOOCFGDIR= $ZK_HOME/conf export PATH=$PATH:$ZK_HOME/bin 6. To make changes to the.bashrc file take effect the user should logout and login again before taking the following steps. For user login via SSH, disconnect SSH connection and connect again. To start Zookeeper 1. Run the following command to start Zookeeper on each machine: $ zkserver.sh start 2. Check that the zkserver started successfully: $ echo ruok nc sdw imok $ If the zkserver started successfully, the system returns the result, imok. 38

39 To stop Zookeeper Run the following command to stop Zookeeper on each machine: $ zkserver.sh stop 39

40 Appendix E: Installing and Configuring the MapReduce Cluster To install a MapReduce Cluster 1. Run the following command to install bulkloader-hadoop-1.0- xxx.x86_64.rpm on the master machine and slave machines: sudo rpm -ivh bulkloader-hadoop-1.0-xxx.x86_64.rpm Skip this step if you are using an existing MapReduce installation. 2. (Optional, if you want MapReduce to usehttpfs as the Job Tracker.) To install and use HTTPFS as the Job Tracker file system on your master machine: sudo rpm -ivh bulkloader-httpfs-1.0-xxx.x86_64.rpm To configure MapReduce to use HDFS as the Job Tracker 1. Change to this directory: cd /var/gphd/bulkloader-1.0/runtime/hadoop/conf 2. Modify hadoop-env.sh to set up JAVA_HOME to point to the local version of the jdk Modify the core-site.xml to set up the fs.default.name property: Sample core-site.xml file for HDFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <name>fs.default.name</name> <value>hdfs://smdw:8020</value> <final>true</final> </configuration> 4. Modify mapred-site.xml to setup the following properties. 40

41 MapRed site xml file Property Description mapred.job.tracker Host or IP and port of JobTracker. Should be the host or ip of bulkloader master server. mapred.system.dir Path on the HDFS where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/. mapred.local.dir Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. mapred.jobtracker.tas kscheduler It is used to set task scheduler. Sample mapred-site.xml file for HDFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <name>mapred.job.tracker</name> <value>smdw:9051</value> <name>mapred.system.dir</name> <value>/hadoop/mapred/system</value> <name>mapred.tmp.dir</name> <value>/hadoop/mapred/temp</value> </configuration> 5. Ensure directories you specified in mapred-site.xml with property names mapred.system.dir and mapred.tmp.dir already exist, if not, create them. 6. Modify the hdfs-site.xml to setup the HDFS. hdfs-site.xml file Property Description dfs.name.dir The namenode directory. dfs.data.dir The data directory on datanode. dfs.permissions Check that the value is set to false. 41

42 Sample hdfs-site.xml file <configuration> <name>dfs.name.dir</name> <value>/data1/bulkloader_hadoop/namenode</value> <name>dfs.data.dir</name> <value>/data2/bulkloader_hadoop/data</value> <name>dfs.permissions</name> <value>false</value> </configuration> To configure MapReduce to use HTTPFS as the Job Tracker file system 1. Modify hadoop-env.sh to set up JAVA_HOME to point to the local version of the jdk Modify the core-site.xml to set up the fs.default.name property. See the following Sample core-site.xml file for HTTPFS for more information. Sample core-site.xml file for HTTPFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <name>fs.default.name</name> <value> <final>true</final> </configuration> 3. Modify mapred-site.xml to set up properties. See the following Sample mapred-site.xml file for HTTPFS for more information. Sample mapred-site.xml for HTTPFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <name>mapred.job.tracker</name> 42

43 <value>smdw:9051</value> <name>mapred.system.dir</name> <value>/wangh26/mapred/system</value> <name>mapred.tmp.dir</name> <value>/wangh26/mapred/temp</value> <name>fs.http.impl</name> <value>org.apache.hadoop.fs.http.client.httpfsfilesystem</v alue> </configuration> 4. Ensure directories you specified in mapred-site.xml with property names mapred.system.dir and mapred.tmp.dir already exist, if not, create them. To start a MapReduce Cluster with a HDFS 1. Start the Name Node: $ hadoop-daemon.sh start namenode 2. Start the Job Tracker: $ hadoop-daemon.sh start jobtracker 3. Start the Data Node: $ hadoop-daemon.sh start datanode 4. Start the Task Tracker: $ hadoop-daemon.sh start tasktracker To start a MapReduce Cluster with HTTPFS 1. If your Job Tracker file system is HTTPFS, start HTTPFS before starting your MapReduce cluster. 2. On master machine, start HTTPFS: $ cd /usr/local/gphd/bulkloader-1.0/httpfs/bin $./start.sh 3. To start the JobTracker: 43

44 $ hadoop-daemon.sh start jobtracker 4. Start the TaskTracker on slave machines: $ hadoop-daemon.sh start tasktracker To stop the MapReduce Cluster with HDFS 1. Stop the NameNode: $ hadoop-daemon.sh stop namenode 2. Stop the JobTracker $ hadoop-daemon.sh stop jobtracker 3. Stop the DataNode: $ hadoop-daemon.sh stop datanode 4. Stop the TaskTracker: $ hadoop-daemon.sh stop tasktracker To stop the MapReduce Cluster with HTTPFS 1. Stop the HTTPFS instance running on the Master Node: $ cd /usr/local/gphd/bulkloader-1.0/httpfs/bin $./stop.sh 2. Stop the JobTracker: $ hadoop-daemon.sh stop jobtracker 3. Stop TaskTracker on the slave machines: $ hadoop-daemon.sh stop tasktracker 44

45 Appendix F: Installing and Configuring Bookkeeper To install Bookkeeper 1. Discover the resource you use to record the Bulkloader entity: HDFS - If you are using HDFS to record entity, you can continue to record. Bookkeeper - If you are using Bookkeeper, perform the following: 2. Change the property in the bulkloader-common.xml configuration file. 3. Run the following command to install Bookkeeper. $ sudo rpm -ivh bulkloader-bookeeper-1.0-xxx.x86_64.rpm If your architecture requires more than one Bookkeeper server, run the command on each machine. To configure Bookkeeper 1. Configure the following properties in the bk_server.conf file: Property Description journaldirectory Directory Bookkeeper outputs its write ahead log. ledgerdirectories Directory Bookkeeper outputs ledger snapshots. zkledgersrootpath Root zookeeper path to store ledger metadata. flushinterval Interval to watch whether bookie is dead or not, in milliseconds. zkservers A list of one of more servers on which zookeeper is running. zktimeout ZooKeeper client session timeout in milliseconds. Note: The ledger dirs and journal dir should be on a different device to reduce the contention between random I/O and sequential write. 2. Create a directory on Zookeeper server. The directory name is specified in the property zkledgersrootpath in the bk_server.conf file. 3. Run the following command on one of the Zookeeper servers. 45

46 $ zkcli.sh 4. Check that the Zookeeper client can connect to the Zookeeper server. 5. Run the following command to create the Zookeeper ledgers root path: zk: localhost:2181(connected) 0] create /ledgers "" You can check the newly created path with the following command: [zk: localhost:2181(connected) 0] ls /ledgers 6. Complete the process to create the available path as follows: [zk: localhost:2181(connected) 0] ls /ledgers Note: If your architecture requires more than one Bookkeeper server, run the command on each machine. Sample bk_server.conf file ## Bookie settings # Port that bookie server listen on bookieport=3181 # Directory Bookkeeper outputs its write ahead log journaldirectory=/data1/bookkeeper/bk-txn # Directory Bookkeeper outputs ledger snapshots # could define multi directories to store snapshots, separated by ',' # For example: # ledgerdirectories=/tmp/bk1-data,/tmp/bk2-data # # Ideally ledger dirs and journal dir are each in a differet device, # which reduce the contention between random i/o and sequential write. # It is possible to run with a single disk, but performance will be significantly lower. ledgerdirectories=/data2/bookkeeper/bk-data # Root zookeeper path to store ledger metadata # This parameter is used by zookeeper-based ledger manager as a root znode to # store all ledgers. zkledgersrootpath=/ledgers # How long the interval to flush ledger index pages to disk, in milliseconds # Flushing index files will introduce much random disk I/O. # If separating journal dir and ledger dirs each on different devices, # flushing would not affect performance. But if putting journal dir # and ledger dirs on same device, performance degrade significantly # on too frequent flushing. You can consider increment flush interval # to get better performance, but you need to pay more time on bookie # server restart after failure. flushinterval=100 ## zookeeper client settings 46

47 # A list of one of more servers on which zookeeper is running. # The server list can be comma separated values, for example: # zkservers=zk1:2181,zk2:2181,zk3:2181 zkservers=sdw3:2181,sdw4:2181,sdw5:2181 # ZooKeeper client session timeout in milliseconds # Bookie server will exit if it received SESSION_EXPIRED because it # was partitioned off from ZooKeeper for more than the session timeout # JVM garbage collection, disk I/O will cause SESSION_EXPIRED. # Increment this value could help avoiding this issue zktimeout= Add the $BK_HOME and $BK_HOME/bin variables to the.bashrc file. Sample Bookkeeper ~/.bashrc file export BK_HOME=/var/gphd/bulkloader-1.0/bk export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin:$ZK_HOME/bin:$BK_HOME/ bin To start Bookkeeper 1. Run the following command to start the Bookkeeper server on each Bookkeeper server machine. $ bookkeeper bookie > book.log 2>&1 & 2. Use the following command to check that the Bookkeeper server is running. $ ps -ef grep BookieServer To stop Bookkeeper Kill the Bookkeeper process. 47

Appendix G: Sample Deployment Topology This appendix contains two deployment samples: Using an existing MapReduce cluster and the associated HDFS.

48 Appendix G: Sample Deployment Topology This appendix contains two deployment samples: Using an existing MapReduce cluster and the associated HDFS. Installing a dedicated MapReduce cluster and using the JobTracker file system. Using an Existing MapReduce Cluster An existing MapReduce should already have an associated JobTracker HDFS. We can reuse this HDFS as a source or destination data store. Installing a Dedicated MapReduce Cluster If you install a dedicated MapReduce cluster, Greenplum Data Loader uses the associated JobTracker file system. This file system can be configured using HDFS or HTTPFS. 48

49 49

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. SDJ INFOSOFT PVT. LTD Apache Hadoop 2.6.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.x Table of Contents Topic Software Requirements