Greenplum Data Loader Installation and User Guide

Size: px
Start display at page:

Download "Greenplum Data Loader Installation and User Guide"

Transcription

1 Greenplum DataLoader 1.2 Installation and User Guide Rev: A01 1

2 Copyright 2012 EMC Corporation. All rights reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICUAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com All other trademarks used herein are the property of their respective owners. 2

3 Greenplum Data Loader Installation and User Guide Greenplum Data Loader Installation and User Guide... 3 Overview of Greenplum Data Loader... 5 Benefits of Greenplum Data Loader... 5 Getting Started With Greenplum Data Loader... 5 Greenplum Data Loader Components... 5 Greenplum Data Loader Dependencies... 6 Greenplum Data Loader RPMs... 6 Greenplum Data Loader Deployment Structure... 7 Master Node... 7 Slave Node... 7 BulkLoader CLI... 7 Preparing to Install Greenplum Data Loader... 8 Installing Greenplum Data Loader... 9 Configuring Greenplum Data Loader Using Greenplum Data Loader Registering or Deleting a Data Store Submitting a Job Suspending a Job Resuming a Job Stopping a Job Trouble Shooting Appendix A: Command Line Reference bulkloader Synopsis Submit Suspend Resume Stop Query List Config Appendix B: Data Store Registration Properties FTP Data Store Registration Properties

4 HTTP Data Store Registration Properties HDFS Data Store Registration Properties LocalFS Data Store Registration Properties NFS Data Store Registration Properties NFS Data Store HDFS Data Store Appendix C: Greenplum Data Loader Copy Strategies Copy Strategies Appendix D: Zookeeper Installation and Configuration Appendix E: Installing and Configuring the MapReduce Cluster Appendix F: Installing and Configuring Bookkeeper Appendix G: Sample Deployment Topology Using an Existing MapReduce Cluster Installing a Dedicated MapReduce Cluster Appendix H: Properties for Each Datastore Type Glossary

5 Overview of Greenplum Data Loader Greenplum Data Loader is an advanced Big Data transporting tool. It focuses on loading Big Data into data analytics platforms. It is an enterprise solution for staged, batch data-loading. It features loading batch data onto large data warehouse or analytics platforms for offline analysis. It deploys code, partitions data into chunks, splits jobs into multiple tasks, schedules the tasks taking into account data locality and network topology, and handles job failures. Greenplum Data Loader can dynamically scale the execution of data loading tasks to maximize the system resource. With single node deployment, it linearly scales out on disk numbers up to the maximum machine bandwidth. With multi-node cluster deployment, it linearly scales out on machine numbers up to the maximum network bandwidth. This horizontal scalability promises optimized, and best possible throughput Benefits of Greenplum Data Loader In summary, Greenplum Data Loader: Focuses on optimizing throughput with resource efficiency and linear scalability Enables higher throughput via parallel load, data locality and averaging files into similar-sized chunks Supports multiple data transfer jobs simultaneously Supports a wide variety of source data store/access protocols HDFS, Local FS (DAS), NFS, FTP, HTTPS Uses master/slave architecture and can be managed through both CLI and GUI Getting Started With Greenplum Data Loader This topic describes the Greenplum Data Loader components, and the RPMs included in the package. Greenplum Data Loader Components Bulkloader consists of the following components: Component Description BulkLoader Manager Provides an operational and administrative graphical user interface. It also provides REST programmatic interface for integration with other tools. 5

6 BulkLoader CLI A command line tool that interacts with BulkLoader Manager to provide the command line access for loading job operation. BulkLoader Scheduler Provides a job and task scheduling service. BulkLoader worker Performs data loading work. Greenplum Data Loader Dependencies Greenplum Data Loader has the following dependencies: Zookeeper Cluster: Provides registration and coordination service for Greenplum Data Loader MapReduce Cluster: Manages the Greenplum Data Loader cluster Persistent Storage: Provides a distributed, shared storage for Greenplum Data Loader cluster to store and access data transfer plan Greenplum Data Loader RPMs The following RPMs are part of this release: Package Name Description bulkloader-scheduler- 1.0-GA.x86_64.rpm bulkloader-scheduler provides the essential files to setup bulkloader master server. bulkloader-worker-1.0- GA.x86_64.rpm bulkloader-worker provides the essential files to setup bulkloader slave server. bulkloader-cli-1.0- GA.x86_64.rpm bulkloader-cli provides the essential files and binaries to setup bulkloader client. Client can interact with bulkloader server to perform data loading operations. bulkloader-datastore- 1.0-GA.x86_64.rpm bulkloader-datastore provides the essential files to support different data stores. bulkloader-manager-1.0- GA.x86_64.rpm bulkloader-manager provides the http server. bulkloader-bookkeeper- 1.0-GA.x86_64.rpm bulkloader-bookkeeper provides the essential files and binaries to set up the bookkeeper. bulkloader-httpfs-1.0- GA.x86_64.rpm Bulkloader-httpfs provides the essential files to setup httpfs. 6

7 bulkloader-zookeeper- 1.0-GA.x86_64.rpm bulkloader-zookeeper provides the essential files and binaries to setup zookeeper. Greenplum Data Loader Deployment Structure The Greenplum Data Loader cluster copies data from the source datastore to the destination cluster. The cluster is composed of three types of logical nodes: Master Node Slave Node CLI Node Note: If you already have a MapReduce deployment, you can choose to leverage the existing MapReduce and use its HDFS as source or destination data store. Otherwise, you can install a dedicated MapReduce cluster and use its JobTracker filesystem. Master Node You must install the following components: BulkLoader Manager BulkLoader Scheduler BulkLoader DataStore Note: In a dedicated MapReduce cluster, you can have the following components on the master machine: MapReduce JobTracker Hadoop-http-fs file system Slave Node You must install the following components: BulkLoader DataStore BulkLoader Worker Note: Each BulkLoader slave node must also have TaskTracker installed. BulkLoader CLI The CLI can be installed on any client machine that has access to BulkLoader Manager. 7

8 Preparing to Install Greenplum Data Loader Perform the following tasks to prepare your environment for Greenplum Data Loader. 1. Install the JDK: Download and install the Oracle JDK1.6 (Java SE6 or JDK 6) from ( 2. After installing JDK, set the JAVA_HOME environment variable referring to where you installed JDK. On a typical Linux installation with Oracle JDK 1.6, the value of this variable should be /usr/java/default/jre. Then add $JAVA_HOME/bin into your PATH environment variable. On a Linux platform with bash shell, you add the following lines into the file ~/.bashrc: export JAVA_HOME=/usr/java/default/jre export PATH=$JAVA_HOME/bin:$PATH 3. Install and set up Zookeeper cluster. Please refer to Appendix E: Zookeeper Installation and Configuration. 4. Install and set up the Map Reduce Cluster: If you need to install a new MapReduce cluster, see Appendix F: Installing and Configuring the MapReduce for more information. 5. Configure the MapReduce cluster for Greenplum Data Loader. a. Add the properties mapred.jobtracker.taskscheduler and mapred.job.tracker.http.address to the mapred-site.xml configuration file. Note: See the following sample mapred-site.xml file for more information. Sample mapred-site.xml file <!-- mapred.jobtracker.taskscheduler property must be set to the following value --> <name>mapred.jobtracker.taskscheduler</name> <value>org.apache.hadoop.mapred.fairscheduler</value> <1-- replace with your JobTracker host name in the value --> <name>mapred.job.tracker.http.address</name> <value>your_jobtracker_hostname:50030</value> b. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader). Find and delete hadoop- 8

9 fairscheduler-*.*.*.jar in the HADOOP_HOME/lib. c. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader). Find the Bulkloader fair scheduler jar file in the bulkloader-hadoop-1.0-xxx.x86_64.rpm and copy it to $HADOOP_HOME/lib. Sample commands: sudo rm -f /usr/lib/gphd/hadoop/lib/hadoop-fairscheduler gphd jar sudo cp /var/gphd/bulkloader-1.0/runtime/hadoop/lib/hadoop-fairscheduler gphd jar /usr/lib/gphd/hadoop/lib 6. Install and configure Bookkeeper. Note: If you use an existing HDFS as your persistent storage, you can skip this step. If you use bookkeeper, please refer to Appendix G: Installing and Configuring Bookkeeper more information. Installing Greenplum Data Loader To install Greenplum Data Loader on the master node 1. Set up passwordless SSH connection to enable Bulkloader Scheduler and Manager: $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ ssh hostname0 # copy authorized_keys to all hosts(hostname1, hostname2 etc.) in the cluster using scp # NOTE: if an authorized_keys file already exists for # the user, rename your file authorized_keys2 $ scp /home/hadoop/.ssh/authorized_keys hostname1:/home/hadoop/.ssh/ # Set the permissions on the file on all hosts $ ssh hostname1 $ chmod 0600 ~/.ssh/authorized_keys 2. Install the following packages: $ sudo rpm -ivh bulkloader-datastore-1.0-xxx.x86_64.rpm $ sudo rpm -ivh bulkloader-scheduler-1.0-xxx.x86_64.rpm $ sudo rpm -ivh bulkloader-manager-1.0-xxx.x86_64.rpm To install Greenplum Data Loader on the Slave Node Install the following packages on all the slave nodes: 9

10 $ sudo rpm -ivh bulkloader-datastore-1.0-xxx.x86_64.rpm $ sudo rpm -ivh bulkloader-worker-1.0-xxx.x86_64.rpm To install Greenplum Data Loader on the CLI Node Install the client package so that the client can interact with the Greenplum Data Loader server: $ sudo rpm -ivh bulkloader-cli-1.0-xxx.x86_64.rpm Configuring Greenplum Data Loader To configure the common properties 1. Update the bulkloader-common.xml file for Zookeeper and HDFS or Bookkeeper configuration. 2. Copy this file to the corresponding conf directory to each and all nodes of Scheduler, Manager, and Worker processes. (The folder location should be /usr/local/gphd/bulkloader-1.0/<manager worker scheduler>/conf for different kinds of processes.) <configuration> <!--bulkloader.zk.address is servers list where you installed ZooKeeper during the Install and setup Zookeeper Cluster step.--> <! Each server appears in this list in the format of <hostname>:<port>, servers are separated with, --> <name>bulkloader.zk.address</name> <value>sdw2:2181,sdw1:2181,sdw3:2181,sdw4:2181,sdw5:2181</value> <!--bulkloader.storage.type is the storage type default value is "bk". The value can also be "hdfs". --> <name>bulkloader.storage.type</name> <value>bk</value> <!-- bulkloader.storage.bk.ledger.size is used to set bookkeeper ledger size. The default value is Unit is byte. It is only available when bulkloader.storage.type is set to "bk" --> <name>bulkloader.storage.bk.ledger.size</name> <value> </value> <!-- bulkloader.storage.bk.entry.size is used to set bookkeeper entry size. The default value is Unit is byte. It is only available when bulkloader.storage.type is set to "bk" --> <name>bulkloader.storage.bk.entry.size</name> <value>524288</value> 10

11 <!-- bulkloader.storage.hdfs.uri is used to set storage hdfs uri. It is only available when bulkloader.storage.type is set to "hdfs" --> <name>bulkloader.storage.hdfs.uri</name> <value>hdfs://hdfs_hostname:port</value> <!-- bulkloader.storage.hdfs.rootdir is used to set the root directory in hdfs. It is only available when bulkloader.storage.type is set to "hdfs" --> <name>bulkloader.storage.hdfs.rootdir</name> <value>/storage/hdfs/root/directory</value> </configuration> To configure the datastore The bulkloader-datastore.xml contains the configuration properties for datastore. Copy the bulkloader-datastore.xml file to the corresponding conf directory on each node for the Scheduler, Manager, and Worker processes. <configuration> <!-- bulkloader.datastore.meta.dir is the datastore directory installed with RPM package. The default is "/usr/local/gphd/bulkloader-1.0/datastore" --> <name>bulkloader.datastore.meta.dir</name> <value>/usr/local/gphd/bulkloader-1.0/datastore</value> </configuration> To configure the Scheduler 1. Configure the following properties in the /usr/local/gphd/bulkloader- 1.0/scheduler/conf/bulkloader-scheduler.xml file. <configuration> <!-- bulkloader.scheduler.mapred.conf.dir is set to the MapReduce cluster configuration directory --> <name>bulkloader.scheduler.mapred.conf.dir</name> <value>/mapreduce/cluster/configuration/directory</value> <!-- bulkloader.scheduler.service.rest.port is the scheduler service rest port. The default value is "12321" 11

12 --> <name>bulkloader.scheduler.service.rest.port</name> <value>12321</value> <!-- bulkloader.scheduler.service.rest.host is the scheduler service rest host. It is the scheduler hostname. --> <name>bulkloader.scheduler.service.rest.host</name> <value>scheduler_hostname</value> <!-- bulkloader.scheduler.taskscheduler.port is the scheduler task scheduler port. The default value is > <name>bulkloader.scheduler.taskscheduler.port</name> <value>11809</value> <!-- bulkloader.scheduler.taskscheduler.host is the scheduler task scheduler host. It is the scheduler hostname. --> <name>bulkloader.scheduler.taskscheduler.host</name> <value>scheduler_hostname</value> </configuration> 2. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Delete hadoop-core jar in /usr/local/gphd/bulkloader-1.0/scheduler/lib. 3. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Copy the hadoop-core-x.x.x-gphdx.x.x.x.jar from the MapReduce Cluster to the /usr/local/gphd/bulkloader-1.0/scheduler/lib. Sample commands: sudo rm -f /usr/local/gphd/bulkloader-1.0/scheduler/lib/hadoop-core jar sudo cp /usr/lib/gphd/hadoop/hadoop-core gphd jar /usr/local/gphd/bulkloader-1.0/scheduler/lib To configure the Manager 1. Copy bulkloader-manager.xml to /usr/local/gphd/bulkloader- 1.0/manager/conf on the master node, and to /usr/local/gphd/bulkloader-1.0/worker/conf on the slave nodes. 12

13 Sample bulkloader-manager xml: <configuration> <!-- bulkloader.manager.service.port is the manager service port. Default value is > <name>bulkloader.manager.service.port</name> <value>8080</value> <!-- bulkloader.manager.data.dir is the manager data directory. Default is "data". The "data" directory would be here: /usr/local/gphd/bulkloader- 1.0/manager/bin/data --> <name>bulkloader.manager.data.dir</name> <value>data</value> <!-- bulkloader.manager.monitoring.enable is to enable or disable manager monitoring Default is true --> <name>bulkloader.manager.monitoring.enable</name> <value>true</value> <!-- bulkloader.manager.monitoring.host is the manager monitoring host --> <name>bulkloader.manager.monitoring.host</name> <value>manager_hostname</value> <!-- bulkloader.manager.monitoring.port is the manager monitoring port Default is > <name>bulkloader.manager.monitoring.port</name> <value>12345</value> </configuration> 2. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Delete the hadoop-core jar from the /usr/local/gphd/bulkloader-1.0/manager/lib. 3. (Optional, only needed if your MapReduce cluster is not setup with the installation package shipped with Bulk Loader.) Copy the hadoop-core-x.x.x-gphdx.x.x.x.jar from the MapReduce Cluster to the /usr/local/gphd/bulkloader-1.0/manager/webapps/web-inf/lib sudo rm -f /usr/local/gphd/bulkloader-1.0/manager/webapps/web- INF/lib/hadoop-core jar 13

14 sudo cp /usr/lib/gphd/hadoop/hadoop-core gphd jar /usr/local/gphd/bulkloader-1.0/manager/webapps/web-inf/lib To configure the Worker Copy the bulkloader-worker.xml to the directory /usr/local/gphd/bulkloader- 1.0/worker/conf for each slave node. Sample bulkloader-worker.xml: <configuration> <!-- bulkloader.worker.reader.num is the worker reader number. Default is 2 --> <name>bulkloader.worker.reader.num</name> <value>2</value> <!-- bulkloader.worker.writer-pipeline.num is the worker writer pipeline number. Default is 5 --> <name>bulkloader.worker.writer-pipeline.num</name> <value>5</value> <!-- bulkloader.worker.buffer.num is the worker buffer number. Default is 16 --> <name>bulkloader.worker.buffer.num</name> <value>16</value> <!-- bulkloader.worker.buffer.size is the worker buffer size. Default is 16M. The unit is byte --> <name>bulkloader.worker.buffer.size</name> <value> </value> <!-- bulkloader.worker.progress.interval is teh worker progress interval. Default is 1800 ms --> <name>bulkloader.worker.progress.interval</name> <value>1800</value> <!-- bulkloader.worker.slice.retry.enable is to enable or disable the worker slice retry Default is true --> <name>bulkloader.worker.slice.retry.enable</name> 14

15 <value>true</value> <!-- bulkloader.worker.slice.retry.num is the worker slice retry number. Default is 3 --> <name>bulkloader.worker.slice.retry.num</name> <value>3</value> </configuration> To configure the Client Configure the /usr/local/gphd/bulkloader-1.0/cli/conf/bulkcli.conf file to point to the manager address. That is, the hostname and port of your BulkLoader master node. Sample bulkcli.conf: bulkloader.api.url= To start the Data Loader service 1. Run the following command to start Scheduler on the master node. $ cd /usr/local/gphd/bulkloader-1.0/scheduler/bin $./start.sh 2. Run the following command to start manager. $ cd /usr/local/gphd/bulkloader-1.0/manager/bin $./start.sh To stop the Data Loader service 1. Run the following command to stop manager on the master node. $ cd /usr/local/gphd/bulkloader-1.0/manager/bin $./stop.sh 4. Run the following to stop scheduler on the master node. $ cd /usr/local/gphd/bulkloader-1.0/scheduler/bin $./stop.sh Using Greenplum Data Loader Before submitting any job to Greenplum Data Loader, you must register your datastore. You can use the command line tool or Data Loader Console with Greenplum Data Loader for any of the following tasks: Registering and unregistering a datastore 15

16 Starting or stopping a job Suspending or resuming a job Querying a job Configuring a job Listing jobs Monitoring job progress You can access the Data Loader Console at: Registering or Deleting a Data Store To load data from a data store, the data store must be registered with the Greenplum Data Loader. You can register the data store using the Command Line or through the Data Loader Console. To register a data store using the command line Note: Perform the following command line operations on the Bulkloader CLI machine. 1. On the Client node, create the property file for the datastore in the directory /usr/local/gphd/bulkloader-1.0/cli/bin. 16

17 Provide values for the following properties: Data store registry values Property Description datahandler.handlers Type of datastore. host Host name of the datastore to register. port The port number of the datastore. Does not require a local file store. scheme Datastore schema Supported Datastore Types nfs ftp internal HDFS http localfs GPHD1.1 GPHD1.2 HDFS apache1.0.3 GPHD Run the following command to register the data store in cli node: $ cd /usr/local/gphd/bulkloader-1.0/cli/bin $./bulkloader config --command register -i <propertyfile> Any file in the following Sample Property Files can be used as <propertyfile> in the above command. For example, to register an nfs datastore, the commands would be: $ cd /usr/local/gphd/bulkloader-1.0/cli/bin $./bulkloader config --command register -i nfs.property Regarding datastore registration properties, please see Appendix B: Data Store Registration Properties for more information. Sample Property Files Sample internalhdfs.property file type=internal_hdfs_ host=mdw port=9500 rootpath=/ scheme=hdfs dfs.replication=1 17

18 dfs.block.size= Note: dfs.replication and dfs.block.size are two special properties for HDFS data store. They are used to set replication number and block size of the destination HDFS. Sample hdfs1.1.property file type=hdfs_gphd1_1_ host=mdw port=9500 rootpath=/ scheme=hdfs Sample localfs.property file type=rawfs host=hdp1-w2 rootpath=/ scheme=localfs Note: The host must be the hostname of the node. Sample ftp.property file type=ftp host=sdw6 rootpath=/ port=21 scheme=ftp user=wangh26 passwd=password Note: user and passwd are two special properties for ftp data store. They are the username and password of ftp user. Sample http.property file type=http host= rootpath=/ port=80 scheme=http Sample nfs.property file type=nfs host=sdw6 rootpath=/ scheme=nfs mountpoint=/mnt 18

19 Note: mountpoint is specific to NFS data store. It is the mount point of NFS client. Sample GPHD1.2.property file type=hdfs_gphd1_2_ host=mdw port=9500 rootpath=/ scheme=hdfs Sample GPHD property file type=hdfs_gphd1_1_02_ host=mdw port=9500 rootpath=/ scheme=hdfs Sample hdfs_apache_1_0_3_.property file type=hdfs_apache_1_0_3_ host=mdw port=9500 rootpath=/ scheme=hdfs To register a Data Store using the Data Loader Console 1. Go to the web page at and select the data stores button to enter the data store register page. 2. Select the Register new Data Store button to bring up the New Data Store Instance page. 3. Enter values in the New Data Store Instance page to register new data store. Note: a. To register a NFS type data store, you should mount NFS server to both master machine and slave machines. Also, all the mount points should be the same. b. To register a local FS type data store, the slave machines must be used. You can register any data loader slave machines as local FS type data store. 4. If you need to specify more data store properties select Add Property. See Appendix B: Data Store Registration Properties for more information. 5. Select the Create button on the page to complete the registration. The new data store can be found in Data Store Instances page. To unregister or delete a data store using the command line 1. Run the following command to list all the registered datastore IDs: $ cd /usr/local/gphd/bulkloader-1.0/cli/bin 19

20 $./bulkloader config --command list 2. Run the following command to delete the datastore. $./bulkloader config --command delete -n datastore_id To unregister or delete a data store using the Data Loader Console 1. Select the Data Stores tab to see the Data Store Instances page. 2. Select the Remove button of the data store you want to remove. Submitting a Job To submit a job using the command line 1. Prepare the Transfer Specification File You must prepare the transfer specification file to submit a job using the command line. It is an XML file that specifies the source and destination data store, and the file or folder to load. Values for the transfer specification file Property Description fsuri in source The source data store hostname and port. rootpath in source The root path of the file sets to load. fullpath The full path of the file to load. type fsuri in target The entry file type. The types are as follows: folder - used to transfer the folder to the destination cluster file - transfers data listed in the specification file glob - expands and replaces glob patterns such as *.log to the list of matches The destination data store hostname and port. rootpath in target The destination root path of the files to copy. 2. Submit the job through the command line after the transfer specification file is ready: $ bulkloader submit -i <transfer specification file> -s <strategy> See Appendix D: Greenplum Data Loader Copy Strategies for more information. For example, the command to submit a job with a local FS data store is: $ bulkloader submit -i localdisk.xml -s localfs --max-disk-mapper 2 See Appendix A: Command Line Reference for more details. 20

21 File Entry Type Samples Sample ftp.xml file with file entry type "glob" <FileTransferSpec xmlns=" <source> <fileset fsuri="ftp://your_source_hostname:21" rootpath="/ "> <filepath type="glob"> <globstring>/photo//</globstring> <filterpattern>^mov.*$</filterpattern> </filepath> </fileset> </source> <target fsuri="hdfs://your_target_hostname:9500/" rootpath="/user/hadoop"/> </FileTransferSpec> The ftp.xml transfers filenames that match the pattern "^MOV.*$" from the source ftp://your_source_hostname:21/photo/*/* to the destination hdfs://your_target_hostname:9500/user/hadoop. Sample http.xml with file entry type "file" <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri=" rootpath="/rootpath"> <filepath fullpath="/filename_1" type="file"/> <filepath fullpath="/filename_2" type="file"/> <filepath fullpath="/filename_n" type="file"/> </fileset> </source> <target fstype="" fsuri="hdfs://your_target_hostname:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The specification transfers the files filename_1, filename_2, through to filename_n from the source to the destination hdfs://your_target_hostname:9500/destination_directory. Sample localfs.xml <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="file://your_slave_hostname/datastore_rootpath" rootpath="rootpath"> <filepath fullpath="filename_1 " type="file"/> <filepath fullpath= filename_2 type= file /> <filepath fullpath= filename_n type= file /> </fileset> 21

22 </source> <target fstype="" fsuri="hdfs://your_target_hostname:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The hostname name of the slave machine is YOUR_SLAVE_HOSTNAME. When you transfer data from the local FS to the destination HDFS, the local FS is a slave machine. Sample localfs_disk.xml file <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="file://your_slave_hostname/datastore_rootpath1" rootpath="rootpath" disk="disk0"> <filepath fullpath="filename_1 " type="file"/> <filepath fullpath="filename_2" type="file"/> <filepath fullpath="filename_n" type="file"/> </fileset> <fileset fstype="" fsuri="file://your_slave_hostname/datastore_rootpath2" rootpath="rootpath" disk="disk1"> <filepath fullpath="filename_1 " type="file"/> <filepath fullpath="filename_2" type="file"/> <filepath fullpath="filename_n" type="file"/> </fileset> </source> <target fstype="" fsuri="hdfs://hdsh211:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The example above is the FileTransferSpec for localfs strategy when you use max-diskmapper in the command line. See Appendix A: Command Line Reference for more detail. In the sample, YOUR_SLAVE_HOSTNAME is the hostname of a slave machine. Note that multiple disks can reside on the same host, but must be different physical disks. Sample nfs.xml file with file entry type "folder" <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="nfs://your_source_hostname/datastore_rootpath" rootpath="rootpath"> <filepath fullpath="/foldername_1" type="folder"/> <filepath fullpath="/foldername_2" type="folder"/> <filepath fullpath="/foldername_n" type="folder"/> </fileset> </source> <target fstype="" fsuri="hdfs://your_target_hostname:9500/" rootpath="/destination_directory"/> </FileTransferSpec> The specification file transfers folders foldername_1, foldername_2, through 22

23 foldername_n from the source nfs://your_source_hostname/datastore_rootpath/rootpath to the destination hdfs://your_target_hostname:9500/destination_directory. Sample hdfs.xml file <FileTransferSpec xmlns=" <source> <fileset fstype="" fsuri="hdfs://your_source_hdfs_hostname:9000/datastore_rootpath" rootpath="rootpath"> <filepath fullpath="filename_1" type="file"/> <filepath fullpath="filename_2" type="file"/> <filepath fullpath="filename_n" type="file"/> </fileset> </source> <target fstype="" fsuri="hdfs://your_target_hdfs_hostname:9500/" rootpath="/rootpath"/> </FileTransferSpec> To submit a job using the Greenplum Data Loader Console 1. Select Create a New Job on the home page. You can submit your job using basic or advanced properties. a. To submit using the basic option, provide the required values shown in Basic property values for submitting a job. Go to Step 2. Basic property values for submitting a job Property Description Source datastore The source data store hostname and port. Source path The path of the data you want to load in the source data store. Target URI The destination data store hostname and port. Target Path The path of the data you want to copy to in the destination data store. b. To submit using advanced options, click Show Advanced Options. See Advanced property values for submitting a job for more information about the required property values. Advanced property values for submitting a job Property Description Copy strategy Select a copy strategy. 23

24 Mapper number Band width Setting this value ensures that the job transfer speed is less than the available band width. Chunking Enables chunking. Chunking Threshold The minimum file size to chunk. Chunking Size Defines the size of the chunks. Overwrite existing file Enables overwriting. Compression Enables compression. Disk Mapper Necessary if you chose the localdisk strategy. Use it set the mapper number per disk. 2. Select Submit. Greenplum Data Loader uses the default value for optional fields. After submitting the job, you can search under the Running Jobs list in the home page. 3. Check the detailed information about the job by clicking the job ID. To monitor a job using the command line Check the job status and progress through the query command line option: $ bulkloader query -n <jobid> To monitor a job using the Greenplum Data Loader Console Click the job ID to monitor details. You can find the job detailed information. You can check the progress bar. Suspending a Job To suspend a job using the command line From the home page, select the job ID you want to suspend. $ bulkloader suspend -n <jobid> To suspend a job using the Greenplum Data Loader Console 1. In the Running Job list on the home page, find the job ID. 2. Select Suspend. 24

25 You can find the job in the suspended job list. Resuming a Job To resume a job using the command line $ bulkloader resume -n <jobid> To resume a job using the Data Loader Console 1. In the Suspended Jobs list on the Home page, find the Job ID. 2. Select Resume in the Job Operations list. You can check the home page to confirm that the job is running again. Stopping a Job Note: You can stop a job while it is running or been suspended. Once stopped you cannot resume a job. To stop a job using the command line $ bulkloader stop -n <jobid> To stop a job using the Greenplum Data Loader Console 1. On the home page, search the Running Jobs or the Suspended Jobs list to find the job you want to stop. 2. Select the Job ID. 3. From the Job Operation list, select the Cancel button. 4. Confirm that the job is listed in the Canceled Jobs list. Trouble Shooting Check the scheduler log on the master node, /usr/local/gphd/bulkloader- 1.0/scheduler/log/scheduler.log. Check the manager log on the manager node, /usr/local/gphd/bulkloader- 1.0/manager/log/manager. 25

26 Appendix A: Command Line Reference bulkloader This is the bulkloader client utility. Synopsis bulkloader [COMMAND] [OPTIONS] The bulkloader utility supports the following commands: submit suspend resume stop query list Submit You will need to create a specification file before using the submit command. The bulkloader command requires the following options: $ bulkloader submit -i <transfer specification> [-s <strategy>, -m <mapper number>, -b <bandwidth>, -k true false, -c <chunking size>, -t true false, -o true false, -z] Sample bulkloader command bulkloader submit -i myfileset.xml -k true -o true -c 512M -m 24 -b 2M -s intelligent You can expect the following when you issue this command: Receive a job id. See an error in the console if it fails. 26

27 Submit options and descriptions Option name Value type Default value Description -i (--input) Path to the configuration file. N/A This value is required. Contains the names of the source and target files. -m (--mappernum) Number of mapper Should be the same size as the file. This value is required if you select the connectionlimited strategy. Sets the number of mappers used to copy the files. -b (--bandwidth) Long value. For example, 3M is interpreted as 3 megabytes. No bandwidth control. Defines the maximum usable bandwidth. -k (--chunking) Boolean. False Indicates whether the data chunking is enabled. -c (--chunksize) Long value. For example, 3M is interpreted as 3 megabytes. 64M The size of each chunk file, if chunking is enabled. -t (-- chunkingthreshold) Long value. For example, 3M is interpreted as 3 megabytes. 1.6G The minimum file size to chunk, if chunking is enabled. -s (--strategy) One of the six: hdfslocality uniform localfs localdisk connectionlimited intelligent intelligent See Appendix D Greenplum Data Loader Copy Strategies for more information. -o (--overwrite) Boolean false Enable overwriting at the destination. -z (/) No value / Enable data compression. --max-disk-mapper Number of mappers per disk. / This option is only used with localfs strategy. When it is specified, additional configuration for disk is required in FileTransferSpec. See page 21 for an example. Suspend You can suspend a bulkloader data transfer job. bulkloader suspend -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist or is not running. 27

28 Suspends the target job. Suspend options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to suspend. Resume You can resume a suspended or failed bulkloader data transfer job. bulkloader resume -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist or is in an unexpected state. Resumes the target job. Resume options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to resume. Stop You can stop a bulkloader data transfer job. bulkloader stop -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist or if the job has already stopped. Stop the target job. This job cannot be resumed. Stop options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to stop. 28

29 Query You can query the progress of a specified data transfer job. bulkloader query -n <jobid> The options related with this command: You can expect the following when you issue this command: Receive an error if the specified job does not exist. Display status of the specified job. If Map Reduce is running, displays the progress of the transfer. Query options and descriptions Option name Value type Default value Description -n (--jobid) String. N/A This value is required. Contains ID of the job to query. List You can list all the running and completed jobs. bulkloader list -status <options> List options and descriptions Option name Value type Default value Description -status One of the following: STARTED COMPLETED N/A List of running or completed jobs. CANCELED SUSPENDED Config You can configure the data store. $ bulkloader config --command list register -i <property_file> delete -n <Datastore_ID> 29

30 Config options and descriptions Option name Value type Default value Description --command One of the following: list register delete N/A Configures the data store. -i (/) String. N/A This is required for the register option. The name of the property file. -n (/) String. N/A Required for the delete command. The ID of the datastore. 30

31 Appendix B: Data Store Registration Properties This appendix describes the properties used to register each supported data store: FTP Data Store Registration Properties Property Value type ftp host The name or IP address of the FTP server. rootpath The root path of the source FTP server. port The port of the FTP server. The default port is 21. scheme ftp user The FTP username. passwd The FTP user s password. transfermode The FTP transfer mode, it can be stream, block or compressed. passive The FTP mode is passive. filetype The file type, it can be binary, ascii. HTTP Data Store Registration Properties Property Value type http host The name or IP address of the HTTP server. port The port of the HTTP server. The default is 80. rootpath The root path of the source HDFS cluster. scheme http dfs.replication The copy replication number in destination HDFS. Default value: 3 31

32 HDFS Data Store Registration Properties Property Value type For Internal HDFS, the value is: internal_hdfs_ For GPHD1.1 HDFS, the value is: hdfs_gphd1_1_ For GPHD1.2 HDFS, the value is: hdfs_gphd1_2_ For GPHD HDFS, the value is: hdfs_gphd1_1_1_02_ For Apache HDFS, the vaue is: hdfs_apache_1.0_3_ host The name of HDFS host, this is the same as the value in the hdfs dfs.name.dir. port The port of the HDFS directory, this is the same as the value for the port in the hdfs dfs.name.dir. rootpath The root path of the source HDFS cluster. scheme hdfs dfs.replication The copy replication number in destination HDFS. Default value: 3 LocalFS Data Store Registration Properties Properties of the LocalFS data store register Property Value type rawfs host The name of a local host machine. Data is copied from this machine when you select the localfs strategy. rootpath The root path of the local machine from where the data is shared. scheme localfs NFS Data Store Registration Properties Property Value type nfs host The nfs server ip address or host name. 32

33 rootpath The root path of the nfs server where data is shared from. scheme nfs mountpoint The NFS mount point on Bulkloader nodes. 33

34 NFS Data Store Key Default Value Description mountpoint / The NFS mount point on Bulkloader nodes HDFS Data Store Key Default Value Description dfs.replication 3 The copy replication number in destination HDFS. 34

35 Appendix C: Greenplum Data Loader Copy Strategies Copy Strategies Strategy Name Description Supported Source Supported Target locality This strategy applies to the case when source data is stored in HDFS cluster. With locality strategy, Greenplum Data Loader will try to deploy worker threads to HDFS datanodes, so that the worker thread collocates with data, and each worker thread reads data from local HDFS, and writes it to the destination. HDFS HDFS supports concat localfs This is a locality strategy when source data locates on native file system. In this case, Administrator needs to know the list of source data node, and Greenplum Data Loader will deploy worker threads to the source data nodes. native file system HDFS uniform Uniform strategy assigns loading tasks uniformly to all the loader machines according to file size. HTTP, FTP HDFS connection limited When source data is stored in FTP/HTTP server, the FTP/HTTP server may have restrictions on how many connections are allowed concurrently. When connection count exceeds the allowance, the server will not respond to data download request. The connectionlimited strategy is provided for such scenario. User can choose to use this strategy and specify the maximum connection number. The strategy will ensure the number of concurrent workers not exceeding the threshold. All data source HDFS inelligent With this strategy, Greenplum Data Loader will automatically pick the suitable copy strategy for the user scenario. For example, if copying from HDFS, and target HDFS supports concat, then locality strategy will be selected; if copy from local file system, localfs strategy will be selected; otherwise uniform strategy will be used. All data source HDFS 35

36 Copy strategies for different data store types Data Store Type Copy Strategy Policy HDFS NFS Locality (if destination data store supports concat*) Uniform connectionlimited dynamic intelligent uniform connectionlimited dynamic inelligent Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (if chunk enabling is enabled) Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (If chunk enabling is enabled) LocalFS uniform connectionlimited dynamic inelligent Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (if chunk enabling is enabled) FPP/HTTP uniform connectionlimited dynamic intelligent Chunking Bandwidth-throttling Overwrite Set Num of Mappers Compression (if chunk enabling is enabled) * The concat feature in HDFS is the ability to concatenate two or more files into one big file. 36

37 Appendix D: Zookeeper Installation and Configuration To install Zookeeper 1. Select the zookeeper server machines. Typically, the number of severs you install on Zookeeper should be an odd number. 2. Run the following command to install Zookeeper on each machine: $ sudo rpm -ivh bulkloader-zookeeper-1.0-xxx.x86_64.rpm If your architecture requires more than one Zookeeper server, run the command on each machine. To configure Zookeeper 1. In the /var/gphd/bulkloader-1.0/runtime/zookeeper/conf directory, find the zookeeper configuration file, zoo_sample.cfg. 2. Make a copy called zoo.cfg. $ cp zoo_sample.cfg zoo.cfg 3. Specify the following values: Values for the Zookeeper configuration file Property Description datadir The directory where the snapshot is stored. server.n The port at which the clients will connect. (The "n" is the zookeeper server number) See the Sample zoo.cfg file for an example of how the myid file displays the server number of each machine. Sample zoo.cfg file # The number of milliseconds of each tick ticktime=2000 # The number of ticks that the initial # synchronization phase can take initlimit=10 # The number of ticks that can pass between 37

38 # sending a request and getting an acknowledgement synclimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. datadir=/data2/zookeeper # the port at which the clients will connect clientport=2181 server.1=sdw3:2888:3888 server.2=sdw4:2888:3888 server.3=sdw5:2888:3888 # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # [ e] # The number of snapshots to retain in datadir # autopurge.snapretaincount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature # autopurge.purgeinterval=1 4. Create a file called myid on each Zookeeper server and place it under the datadir directory (specified in zoo.cfg). The myid file myid contains the server number of the machine. 5. Add the variables $ZK_HOME, $ZOOCFGDIR and $ZK_HOME/bin to the.bashrc file. Sample Zookeeper.bashrc file export ZK_HOME=/var/gphd/bulkloader-1.0/runtime/zookeeper export ZOOCFGDIR= $ZK_HOME/conf export PATH=$PATH:$ZK_HOME/bin 6. To make changes to the.bashrc file take effect the user should logout and login again before taking the following steps. For user login via SSH, disconnect SSH connection and connect again. To start Zookeeper 1. Run the following command to start Zookeeper on each machine: $ zkserver.sh start 2. Check that the zkserver started successfully: $ echo ruok nc sdw imok $ If the zkserver started successfully, the system returns the result, imok. 38

39 To stop Zookeeper Run the following command to stop Zookeeper on each machine: $ zkserver.sh stop 39

40 Appendix E: Installing and Configuring the MapReduce Cluster To install a MapReduce Cluster 1. Run the following command to install bulkloader-hadoop-1.0- xxx.x86_64.rpm on the master machine and slave machines: sudo rpm -ivh bulkloader-hadoop-1.0-xxx.x86_64.rpm Skip this step if you are using an existing MapReduce installation. 2. (Optional, if you want MapReduce to usehttpfs as the Job Tracker.) To install and use HTTPFS as the Job Tracker file system on your master machine: sudo rpm -ivh bulkloader-httpfs-1.0-xxx.x86_64.rpm To configure MapReduce to use HDFS as the Job Tracker 1. Change to this directory: cd /var/gphd/bulkloader-1.0/runtime/hadoop/conf 2. Modify hadoop-env.sh to set up JAVA_HOME to point to the local version of the jdk Modify the core-site.xml to set up the fs.default.name property: Sample core-site.xml file for HDFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <name>fs.default.name</name> <value>hdfs://smdw:8020</value> <final>true</final> </configuration> 4. Modify mapred-site.xml to setup the following properties. 40

41 MapRed site xml file Property Description mapred.job.tracker Host or IP and port of JobTracker. Should be the host or ip of bulkloader master server. mapred.system.dir Path on the HDFS where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/. mapred.local.dir Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. mapred.jobtracker.tas kscheduler It is used to set task scheduler. Sample mapred-site.xml file for HDFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <name>mapred.job.tracker</name> <value>smdw:9051</value> <name>mapred.system.dir</name> <value>/hadoop/mapred/system</value> <name>mapred.tmp.dir</name> <value>/hadoop/mapred/temp</value> </configuration> 5. Ensure directories you specified in mapred-site.xml with property names mapred.system.dir and mapred.tmp.dir already exist, if not, create them. 6. Modify the hdfs-site.xml to setup the HDFS. hdfs-site.xml file Property Description dfs.name.dir The namenode directory. dfs.data.dir The data directory on datanode. dfs.permissions Check that the value is set to false. 41

42 Sample hdfs-site.xml file <configuration> <name>dfs.name.dir</name> <value>/data1/bulkloader_hadoop/namenode</value> <name>dfs.data.dir</name> <value>/data2/bulkloader_hadoop/data</value> <name>dfs.permissions</name> <value>false</value> </configuration> To configure MapReduce to use HTTPFS as the Job Tracker file system 1. Modify hadoop-env.sh to set up JAVA_HOME to point to the local version of the jdk Modify the core-site.xml to set up the fs.default.name property. See the following Sample core-site.xml file for HTTPFS for more information. Sample core-site.xml file for HTTPFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <name>fs.default.name</name> <value> <final>true</final> </configuration> 3. Modify mapred-site.xml to set up properties. See the following Sample mapred-site.xml file for HTTPFS for more information. Sample mapred-site.xml for HTTPFS <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <name>mapred.job.tracker</name> 42

43 <value>smdw:9051</value> <name>mapred.system.dir</name> <value>/wangh26/mapred/system</value> <name>mapred.tmp.dir</name> <value>/wangh26/mapred/temp</value> <name>fs.http.impl</name> <value>org.apache.hadoop.fs.http.client.httpfsfilesystem</v alue> </configuration> 4. Ensure directories you specified in mapred-site.xml with property names mapred.system.dir and mapred.tmp.dir already exist, if not, create them. To start a MapReduce Cluster with a HDFS 1. Start the Name Node: $ hadoop-daemon.sh start namenode 2. Start the Job Tracker: $ hadoop-daemon.sh start jobtracker 3. Start the Data Node: $ hadoop-daemon.sh start datanode 4. Start the Task Tracker: $ hadoop-daemon.sh start tasktracker To start a MapReduce Cluster with HTTPFS 1. If your Job Tracker file system is HTTPFS, start HTTPFS before starting your MapReduce cluster. 2. On master machine, start HTTPFS: $ cd /usr/local/gphd/bulkloader-1.0/httpfs/bin $./start.sh 3. To start the JobTracker: 43

44 $ hadoop-daemon.sh start jobtracker 4. Start the TaskTracker on slave machines: $ hadoop-daemon.sh start tasktracker To stop the MapReduce Cluster with HDFS 1. Stop the NameNode: $ hadoop-daemon.sh stop namenode 2. Stop the JobTracker $ hadoop-daemon.sh stop jobtracker 3. Stop the DataNode: $ hadoop-daemon.sh stop datanode 4. Stop the TaskTracker: $ hadoop-daemon.sh stop tasktracker To stop the MapReduce Cluster with HTTPFS 1. Stop the HTTPFS instance running on the Master Node: $ cd /usr/local/gphd/bulkloader-1.0/httpfs/bin $./stop.sh 2. Stop the JobTracker: $ hadoop-daemon.sh stop jobtracker 3. Stop TaskTracker on the slave machines: $ hadoop-daemon.sh stop tasktracker 44

45 Appendix F: Installing and Configuring Bookkeeper To install Bookkeeper 1. Discover the resource you use to record the Bulkloader entity: HDFS - If you are using HDFS to record entity, you can continue to record. Bookkeeper - If you are using Bookkeeper, perform the following: 2. Change the property in the bulkloader-common.xml configuration file. 3. Run the following command to install Bookkeeper. $ sudo rpm -ivh bulkloader-bookeeper-1.0-xxx.x86_64.rpm If your architecture requires more than one Bookkeeper server, run the command on each machine. To configure Bookkeeper 1. Configure the following properties in the bk_server.conf file: Property Description journaldirectory Directory Bookkeeper outputs its write ahead log. ledgerdirectories Directory Bookkeeper outputs ledger snapshots. zkledgersrootpath Root zookeeper path to store ledger metadata. flushinterval Interval to watch whether bookie is dead or not, in milliseconds. zkservers A list of one of more servers on which zookeeper is running. zktimeout ZooKeeper client session timeout in milliseconds. Note: The ledger dirs and journal dir should be on a different device to reduce the contention between random I/O and sequential write. 2. Create a directory on Zookeeper server. The directory name is specified in the property zkledgersrootpath in the bk_server.conf file. 3. Run the following command on one of the Zookeeper servers. 45

46 $ zkcli.sh 4. Check that the Zookeeper client can connect to the Zookeeper server. 5. Run the following command to create the Zookeeper ledgers root path: zk: localhost:2181(connected) 0] create /ledgers "" You can check the newly created path with the following command: [zk: localhost:2181(connected) 0] ls /ledgers 6. Complete the process to create the available path as follows: [zk: localhost:2181(connected) 0] ls /ledgers Note: If your architecture requires more than one Bookkeeper server, run the command on each machine. Sample bk_server.conf file ## Bookie settings # Port that bookie server listen on bookieport=3181 # Directory Bookkeeper outputs its write ahead log journaldirectory=/data1/bookkeeper/bk-txn # Directory Bookkeeper outputs ledger snapshots # could define multi directories to store snapshots, separated by ',' # For example: # ledgerdirectories=/tmp/bk1-data,/tmp/bk2-data # # Ideally ledger dirs and journal dir are each in a differet device, # which reduce the contention between random i/o and sequential write. # It is possible to run with a single disk, but performance will be significantly lower. ledgerdirectories=/data2/bookkeeper/bk-data # Root zookeeper path to store ledger metadata # This parameter is used by zookeeper-based ledger manager as a root znode to # store all ledgers. zkledgersrootpath=/ledgers # How long the interval to flush ledger index pages to disk, in milliseconds # Flushing index files will introduce much random disk I/O. # If separating journal dir and ledger dirs each on different devices, # flushing would not affect performance. But if putting journal dir # and ledger dirs on same device, performance degrade significantly # on too frequent flushing. You can consider increment flush interval # to get better performance, but you need to pay more time on bookie # server restart after failure. flushinterval=100 ## zookeeper client settings 46

47 # A list of one of more servers on which zookeeper is running. # The server list can be comma separated values, for example: # zkservers=zk1:2181,zk2:2181,zk3:2181 zkservers=sdw3:2181,sdw4:2181,sdw5:2181 # ZooKeeper client session timeout in milliseconds # Bookie server will exit if it received SESSION_EXPIRED because it # was partitioned off from ZooKeeper for more than the session timeout # JVM garbage collection, disk I/O will cause SESSION_EXPIRED. # Increment this value could help avoiding this issue zktimeout= Add the $BK_HOME and $BK_HOME/bin variables to the.bashrc file. Sample Bookkeeper ~/.bashrc file export BK_HOME=/var/gphd/bulkloader-1.0/bk export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin:$ZK_HOME/bin:$BK_HOME/ bin To start Bookkeeper 1. Run the following command to start the Bookkeeper server on each Bookkeeper server machine. $ bookkeeper bookie > book.log 2>&1 & 2. Use the following command to check that the Bookkeeper server is running. $ ps -ef grep BookieServer To stop Bookkeeper Kill the Bookkeeper process. 47

48 Appendix G: Sample Deployment Topology This appendix contains two deployment samples: Using an existing MapReduce cluster and the associated HDFS. Installing a dedicated MapReduce cluster and using the JobTracker file system. Using an Existing MapReduce Cluster An existing MapReduce should already have an associated JobTracker HDFS. We can reuse this HDFS as a source or destination data store. Installing a Dedicated MapReduce Cluster If you install a dedicated MapReduce cluster, Greenplum Data Loader uses the associated JobTracker file system. This file system can be configured using HDFS or HTTPFS. 48

49 49

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. SDJ INFOSOFT PVT. LTD Apache Hadoop 2.6.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.x Table of Contents Topic Software Requirements

More information

How to Install and Configure Big Data Edition for Hortonworks

How to Install and Configure Big Data Edition for Hortonworks How to Install and Configure Big Data Edition for Hortonworks 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,

More information

Hadoop On Demand: Configuration Guide

Hadoop On Demand: Configuration Guide Hadoop On Demand: Configuration Guide Table of contents 1 1. Introduction...2 2 2. Sections... 2 3 3. HOD Configuration Options...2 3.1 3.1 Common configuration options...2 3.2 3.2 hod options... 3 3.3

More information

Hadoop Quickstart. Table of contents

Hadoop Quickstart. Table of contents Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone

More information

BIG DATA TRAINING PRESENTATION

BIG DATA TRAINING PRESENTATION BIG DATA TRAINING PRESENTATION TOPICS TO BE COVERED HADOOP YARN MAP REDUCE SPARK FLUME SQOOP OOZIE AMBARI TOPICS TO BE COVERED FALCON RANGER KNOX SENTRY MASTER IMAGE INSTALLATION 1 JAVA INSTALLATION: 1.

More information

Pivotal HD DataLoader

Pivotal HD DataLoader PRODUCT DOCUMENTATION Pivotal HD DataLoader Version 2.0 Installation and User Guide Rev: A05 2013 GoPivotal, Inc. Copyright 2013 GoPivotal, Inc. All rights reserved. GoPivotal, Inc. believes the information

More information

Inria, Rennes Bretagne Atlantique Research Center

Inria, Rennes Bretagne Atlantique Research Center Hadoop TP 1 Shadi Ibrahim Inria, Rennes Bretagne Atlantique Research Center Getting started with Hadoop Prerequisites Basic Configuration Starting Hadoop Verifying cluster operation Hadoop INRIA S.IBRAHIM

More information

Cluster Setup. Table of contents

Cluster Setup. Table of contents Table of contents 1 Purpose...2 2 Pre-requisites...2 3 Installation...2 4 Configuration... 2 4.1 Configuration Files...2 4.2 Site Configuration... 3 5 Cluster Restartability... 10 5.1 Map/Reduce...10 6

More information

Beta. VMware vsphere Big Data Extensions Administrator's and User's Guide. vsphere Big Data Extensions 1.0 EN

Beta. VMware vsphere Big Data Extensions Administrator's and User's Guide. vsphere Big Data Extensions 1.0 EN VMware vsphere Big Data Extensions Administrator's and User's Guide vsphere Big Data Extensions 1.0 This document supports the version of each product listed and supports all subsequent versions until

More information

Hadoop Setup Walkthrough

Hadoop Setup Walkthrough Hadoop 2.7.3 Setup Walkthrough This document provides information about working with Hadoop 2.7.3. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting

More information

Rev: A02 Updated: July 15, 2013

Rev: A02 Updated: July 15, 2013 Rev: A02 Updated: July 15, 2013 Welcome to Pivotal Command Center Pivotal Command Center provides a visual management console that helps administrators monitor cluster performance and track Hadoop job

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem Sep 27, 2017 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE INSTALLATION....................................

More information

Part II (c) Desktop Installation. Net Serpents LLC, USA

Part II (c) Desktop Installation. Net Serpents LLC, USA Part II (c) Desktop ation Desktop ation ation Supported Platforms Required Software Releases &Mirror Sites Configure Format Start/ Stop Verify Supported Platforms ation GNU Linux supported for Development

More information

Hadoop Setup on OpenStack Windows Azure Guide

Hadoop Setup on OpenStack Windows Azure Guide CSCI4180 Tutorial- 2 Hadoop Setup on OpenStack Windows Azure Guide ZHANG, Mi mzhang@cse.cuhk.edu.hk Sep. 24, 2015 Outline Hadoop setup on OpenStack Ø Set up Hadoop cluster Ø Manage Hadoop cluster Ø WordCount

More information

Hortonworks Technical Preview for Apache Falcon

Hortonworks Technical Preview for Apache Falcon Architecting the Future of Big Data Hortonworks Technical Preview for Apache Falcon Released: 11/20/2013 Architecting the Future of Big Data 2013 Hortonworks Inc. All Rights Reserved. Welcome to Hortonworks

More information

High-Performance Analytics Infrastructure 2.5

High-Performance Analytics Infrastructure 2.5 SAS High-Performance Analytics Infrastructure 2.5 Installation and Configuration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2014. SAS High-Performance

More information

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog About the Tutorial HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools

More information

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g. Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You

More information

Tuning Enterprise Information Catalog Performance

Tuning Enterprise Information Catalog Performance Tuning Enterprise Information Catalog Performance Copyright Informatica LLC 2015, 2018. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States

More information

Deploying Custom Step Plugins for Pentaho MapReduce

Deploying Custom Step Plugins for Pentaho MapReduce Deploying Custom Step Plugins for Pentaho MapReduce This page intentionally left blank. Contents Overview... 1 Before You Begin... 1 Pentaho MapReduce Configuration... 2 Plugin Properties Defined... 2

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Hadoop On Demand User Guide

Hadoop On Demand User Guide Table of contents 1 Introduction...3 2 Getting Started Using HOD... 3 2.1 A typical HOD session... 3 2.2 Running hadoop scripts using HOD...5 3 HOD Features... 6 3.1 Provisioning and Managing Hadoop Clusters...6

More information

Getting Started with Hadoop

Getting Started with Hadoop Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation

More information

HOD User Guide. Table of contents

HOD User Guide. Table of contents Table of contents 1 Introduction...3 2 Getting Started Using HOD... 3 2.1 A typical HOD session... 3 2.2 Running hadoop scripts using HOD...5 3 HOD Features... 6 3.1 Provisioning and Managing Hadoop Clusters...6

More information

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

VMware vsphere Big Data Extensions Command-Line Interface Guide

VMware vsphere Big Data Extensions Command-Line Interface Guide VMware vsphere Big Data Extensions Command-Line Interface Guide vsphere Big Data Extensions 1.0 This document supports the version of each product listed and supports all subsequent versions until the

More information

Elixir Ambience Installation Guide

Elixir Ambience Installation Guide Elixir Ambience Installation Guide Release 2.5.0 Elixir Technology Pte Ltd Elixir Ambience Installation Guide: Release 2.5.0 Elixir Technology Pte Ltd Published 2013 Copyright 2013 Elixir Technology Pte

More information

Linux Administration

Linux Administration Linux Administration This course will cover all aspects of Linux Certification. At the end of the course delegates will have the skills required to administer a Linux System. It is designed for professionals

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam. Vendor: Cloudera Exam Code: CCA-505 Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam Version: Demo QUESTION 1 You have installed a cluster running HDFS and MapReduce

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

Big Data Analytics by Using Hadoop

Big Data Analytics by Using Hadoop Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Big Data Analytics by Using Hadoop Chaitanya Arava Governors State University

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

ZooKeeper Getting Started Guide

ZooKeeper Getting Started Guide by Table of contents 1 Getting Started: Coordinating Distributed Applications with ZooKeeper...2 1.1 Pre-requisites... 2 1.2 Download... 2 1.3 Standalone Operation... 2 1.4 Managing ZooKeeper Storage...3

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

NexentaStor VVOL

NexentaStor VVOL NexentaStor 5.1.1 VVOL Admin Guide Date: January, 2018 Software Version: NexentaStor 5.1.1 VVOL Part Number: 3000-VVOL-5.1.1-000065-A Table of Contents Preface... 3 Intended Audience 3 References 3 Document

More information

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform Computer and Information Science; Vol. 11, No. 1; 2018 ISSN 1913-8989 E-ISSN 1913-8997 Published by Canadian Center of Science and Education The Analysis and Implementation of the K - Means Algorithm Based

More information

Dell Storage Compellent Integration Tools for VMware

Dell Storage Compellent Integration Tools for VMware Dell Storage Compellent Integration Tools for VMware Version 4.0 Administrator s Guide Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use of your

More information

Isilon InsightIQ. Version Installation Guide

Isilon InsightIQ. Version Installation Guide Isilon InsightIQ Version 4.1.0 Installation Guide Copyright 2009-2016 EMC Corporation All rights reserved. Published October 2016 Dell believes the information in this publication is accurate as of its

More information

Xcalar Installation Guide

Xcalar Installation Guide Xcalar Installation Guide Publication date: 2018-03-16 www.xcalar.com Copyright 2018 Xcalar, Inc. All rights reserved. Table of Contents Xcalar installation overview 5 Audience 5 Overview of the Xcalar

More information

Installation Guide. Community release

Installation Guide. Community release Installation Guide Community 151 release This document details step-by-step deployment procedures, system and environment requirements to assist Jumbune deployment 1 P a g e Table of Contents Introduction

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Table of Contents HOL-SDC-1409

Table of Contents HOL-SDC-1409 Table of Contents Lab Overview - - vsphere Big Data Extensions... 2 Lab Guidance... 3 Verify Hadoop Clusters are Running... 5 Module 1 - Hadoop POC In Under an Hour (45 Min)... 9 Module Overview... 10

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Dell Storage Integration Tools for VMware

Dell Storage Integration Tools for VMware Dell Storage Integration Tools for VMware Version 4.1 Administrator s Guide Notes, cautions, and warnings NOTE: A NOTE indicates important information that helps you make better use of your product. CAUTION:

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) Hortonworks Hadoop-PR000007 Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) http://killexams.com/pass4sure/exam-detail/hadoop-pr000007 QUESTION: 99 Which one of the following

More information

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016 Installing Hadoop 2.7.3 / Yarn, Hive 2.1.0, Scala 2.11.8, and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes By: Nicholas Propes 2016 1 NOTES Please follow instructions PARTS in order because the results

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

VMware vsphere Big Data Extensions Administrator's and User's Guide

VMware vsphere Big Data Extensions Administrator's and User's Guide VMware vsphere Big Data Extensions Administrator's and User's Guide vsphere Big Data Extensions 1.1 This document supports the version of each product listed and supports all subsequent versions until

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache

More information

Running Kmeans Spark on EC2 Documentation

Running Kmeans Spark on EC2 Documentation Running Kmeans Spark on EC2 Documentation Pseudo code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step1: Read D from HDFS as RDD Step 2: Initialize first k data

More information

Cloud Computing II. Exercises

Cloud Computing II. Exercises Cloud Computing II Exercises Exercise 1 Creating a Private Cloud Overview In this exercise, you will install and configure a private cloud using OpenStack. This will be accomplished using a singlenode

More information

SAS. High- Performance Analytics Infrastructure 1.6 Installation and Configuration Guide. SAS Documentation

SAS. High- Performance Analytics Infrastructure 1.6 Installation and Configuration Guide. SAS Documentation SAS High- Performance Analytics Infrastructure 1.6 Installation and Configuration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. SAS

More information

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS

9.4 Hadoop Configuration Guide for Base SAS. and SAS/ACCESS SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS Second Edition SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS 9.4 Hadoop

More information

Guidelines - Configuring PDI, MapReduce, and MapR

Guidelines - Configuring PDI, MapReduce, and MapR Guidelines - Configuring PDI, MapReduce, and MapR This page intentionally left blank. Contents Overview... 1 Set Up Your Environment... 2 Get MapR Server Information... 2 Set Up Your Host Environment...

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Data Protection Guide

Data Protection Guide SnapCenter Software 4.0 Data Protection Guide For Custom Plug-ins March 2018 215-12932_C0 doccomments@netapp.com Table of Contents 3 Contents Deciding on whether to read the SnapCenter Data Protection

More information

SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS, Fourth Edition

SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS, Fourth Edition SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS, Fourth Edition SAS Documentation August 31, 2017 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2016.

More information

MapReduce for Parallel Computing

MapReduce for Parallel Computing MapReduce for Parallel Computing Amit Jain 1/44 Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy pulling, and when one ox couldn t budge a log, they didn t try to grow a larger

More information

EMC Documentum External Viewing Services for SAP

EMC Documentum External Viewing Services for SAP EMC Documentum External Viewing Services for SAP Version 6.0 Administration Guide P/N 300 005 459 Rev A01 EMC Corporation Corporate Headquarters: Hopkinton, MA 01748 9103 1 508 435 1000 www.emc.com Copyright

More information

HOD Scheduler. Table of contents

HOD Scheduler. Table of contents Table of contents 1 Introduction...2 2 HOD Users...2 2.1 Getting Started...2 2.2 HOD Features...5 2.3 Troubleshooting...14 3 HOD Administrators...21 3.1 Getting Started...21 3.2 Prerequisites... 22 3.3

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Big Data and Scripting map reduce in Hadoop

Big Data and Scripting map reduce in Hadoop Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks

More information

How to Install and Configure EBF15545 for MapR with MapReduce 2

How to Install and Configure EBF15545 for MapR with MapReduce 2 How to Install and Configure EBF15545 for MapR 4.0.2 with MapReduce 2 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Veritas NetBackup Backup, Archive, and Restore Getting Started Guide. Release 8.1.2

Veritas NetBackup Backup, Archive, and Restore Getting Started Guide. Release 8.1.2 Veritas NetBackup Backup, Archive, and Restore Getting Started Guide Release 8.1.2 Veritas NetBackup Backup, Archive, and Restore Getting Started Guide Last updated: 2018-09-19 Legal Notice Copyright 2017

More information

How to Configure Big Data Management 10.1 for MapR 5.1 Security Features

How to Configure Big Data Management 10.1 for MapR 5.1 Security Features How to Configure Big Data Management 10.1 for MapR 5.1 Security Features 2014, 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,

More information

VMware vsphere Big Data Extensions Command-Line Interface Guide

VMware vsphere Big Data Extensions Command-Line Interface Guide VMware vsphere Big Data Extensions Command-Line Interface Guide vsphere Big Data Extensions 2.1 This document supports the version of each product listed and supports all subsequent versions until the

More information

About ADS 1.1 ADS comprises the following components: HAWQ PXF MADlib

About ADS 1.1 ADS comprises the following components: HAWQ PXF MADlib Rev: A02 Updated: July 15, 2013 Welcome to Pivotal Advanced Database Services 1.1 Pivotal Advanced Database Services (ADS), extends Pivotal Hadoop (HD) Enterprise, adding rich, proven parallel SQL processing

More information

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc. D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels

More information

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

High-performance computing on Microsoft Azure: GlusterFS

High-performance computing on Microsoft Azure: GlusterFS High-performance computing on Microsoft Azure: GlusterFS Introduction to creating an Azure HPC cluster and HPC storage Azure Customer Advisory Team (AzureCAT) April 2018 Contents Introduction... 3 Quick

More information

Dynamic Hadoop Clusters

Dynamic Hadoop Clusters Dynamic Hadoop Clusters Steve Loughran Julio Guijarro 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice 2 25 March, 2009 Hadoop on a cluster

More information

VMware vsphere Big Data Extensions Command-Line Interface Guide

VMware vsphere Big Data Extensions Command-Line Interface Guide VMware vsphere Big Data Extensions Command-Line Interface Guide vsphere Big Data Extensions 2.0 This document supports the version of each product listed and supports all subsequent versions until the

More information

Commands Manual. Table of contents

Commands Manual. Table of contents Table of contents 1 Overview...2 1.1 Generic Options...2 2 User Commands...3 2.1 archive... 3 2.2 distcp...3 2.3 fs... 3 2.4 fsck... 3 2.5 jar...4 2.6 job...4 2.7 pipes...5 2.8 version... 6 2.9 CLASSNAME...6

More information

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1 TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1 ABSTRACT This introductory white paper provides a technical overview of the new and improved enterprise grade features introduced

More information

ClarityNow Best Practices Guide

ClarityNow Best Practices Guide ClarityNow Best Practices Guide Abstract A guide containing ClarityNow best practices and recommendations for common deployment to help avoid difficulties. Document includes descriptions of some default

More information

EMC Documentum Composer

EMC Documentum Composer EMC Documentum Composer Version 6.5 SP2 User Guide P/N 300-009-462 A01 EMC Corporation Corporate Headquarters: Hopkinton, MA 01748-9103 1-508-435-1000 www.emc.com Copyright 2008 2009 EMC Corporation. All

More information

Getting Started with Hadoop/YARN

Getting Started with Hadoop/YARN Getting Started with Hadoop/YARN Michael Völske 1 April 28, 2016 1 michael.voelske@uni-weimar.de Michael Völske Getting Started with Hadoop/YARN April 28, 2016 1 / 66 Outline Part One: Hadoop, HDFS, and

More information

<Partner Name> <Partner Product> RSA Ready Implementation Guide for. MapR Converged Data Platform 3.1

<Partner Name> <Partner Product> RSA Ready Implementation Guide for. MapR Converged Data Platform 3.1 RSA Ready Implementation Guide for MapR Jeffrey Carlson, RSA Partner Engineering Last Modified: 02/25/2016 Solution Summary RSA Analytics Warehouse provides the capacity

More information

Data Protection Guide

Data Protection Guide SnapCenter Software 4.0 Data Protection Guide For VMs and Datastores using the SnapCenter Plug-in for VMware vsphere March 2018 215-12931_C0 doccomments@netapp.com Table of Contents 3 Contents Deciding

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor)

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor) Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor) 1 OUTLINE Objective What is Big data Characteristics of Big Data Setup Requirements Hadoop Setup Word Count

More information

How to Install and Configure EBF14514 for IBM BigInsights 3.0

How to Install and Configure EBF14514 for IBM BigInsights 3.0 How to Install and Configure EBF14514 for IBM BigInsights 3.0 2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

EMC Greenplum Data Computing Appliance to x Software Upgrade Guide. Rev: A02

EMC Greenplum Data Computing Appliance to x Software Upgrade Guide. Rev: A02 EMC Greenplum Data Computing Appliance 1.2.0.1 to 1.2.1.x Software Upgrade Guide Rev: A02 Copyright 2013 EMC Corporation. All rights reserved. EMC believes the information in this publication is accurate

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Commands Guide. Table of contents

Commands Guide. Table of contents Table of contents 1 Overview...2 1.1 Generic Options...2 2 User Commands...3 2.1 archive... 3 2.2 distcp...3 2.3 fs... 3 2.4 fsck... 3 2.5 jar...4 2.6 job...4 2.7 pipes...5 2.8 queue...6 2.9 version...

More information

SAS Viya 3.2 and SAS/ACCESS : Hadoop Configuration Guide

SAS Viya 3.2 and SAS/ACCESS : Hadoop Configuration Guide SAS Viya 3.2 and SAS/ACCESS : Hadoop Configuration Guide SAS Documentation July 6, 2017 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2017. SAS Viya 3.2 and SAS/ACCESS

More information

EMC Voyence Integration Adaptor

EMC Voyence Integration Adaptor EMC Voyence Integration Adaptor Version 2.0.0 EMC SMARTS P/N 300-007-379 REV A03 EMC Corporation Corporate Headquarters Hopkinton, MA 01748-9103 1-508-435-1000 www.emc.com COPYRIGHT Copyright 2008 EMC

More information