Integrating with Apache Hadoop

Size: px

Start display at page:

Download "Integrating with Apache Hadoop"

Margery Scott
6 years ago
Views:

1 HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 10/10/2017

2 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HPE shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice. Restricted Rights Legend Confidential computer software. Valid license from HPE required for possession, use or copying. Consistent with FAR and , Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. Copyright Notice Copyright Hewlett Packard Enterprise Development LP Trademark Notices Adobe is a trademark of Adobe Systems Incorporated. Apache Hadoop and Hadoop are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. UNIX is a registered trademark of The Open Group. This product includes an interface of the 'zlib' general purpose compression library, which is Copyright Jean-loup Gailly and Mark Adler. HPE Vertica Analytic Database (7.2.x) Page 2 of 139

3 Contents Introduction to Hadoop Integration 8 Hadoop Distributions 8 Integration Options 8 File Paths 9 Cluster Layout 10 Co-Located Clusters 10 Hardware Recommendations 11 Configuring Hadoop for Co-Located Clusters 12 webhdfs 12 YARN 12 Hadoop Balancer 13 Replication Factor 13 Disk Space for Non-HDFS Use 13 Separate Clusters 13 Choosing Which Hadoop Interface to Use 16 Creating an HDFS Storage Location 16 Reading ORC and Parquet Files 16 Using the HCatalog Connector 16 Using the HDFS Connector 17 Using the MapReduce Connector 17 Using Kerberos with Hadoop 18 How Vertica uses Kerberos With Hadoop 18 User Authentication 18 Vertica Authentication 19 See Also 20 Configuring Kerberos 21 Prerequisite: Setting Up Users and the Keytab File 21 HCatalog Connector 21 HDFS Connector 21 HDFS Storage Location 22 Token Expiration 22 See Also 22 Reading Native Hadoop File Formats 23 Requirements 23 HPE Vertica Analytic Database (7.2.x) Page 3 of 139

4 Creating External Tables 23 Loading Data 24 Supported Data Types 24 Kerberos Authentication 25 Examples 25 See Alsos 25 Query Performance 25 Considerations When Writing Files 26 Predicate Pushdown 26 Data Locality 26 Configuring hdfs:/// Access 27 Troubleshooting Reads from Native File Formats 27 webhdfs Error When Using hdfs URIs 28 Reads from Parquet Files Report Unexpected Data-Type Mismatches 28 Time Zones in Timestamp Values Are Not Correct 28 Some Date and Timestamp Values Are Wrong by Several Days 29 Error 7087: Wrong Number of Columns 29 Using the HCatalog Connector 31 Hive, HCatalog, and WebHCat Overview 31 HCatalog Connection Features 32 HCatalog Connection Considerations 32 How the HCatalog Connector Works 33 HCatalog Connector Requirements 33 Vertica Requirements 33 Hadoop Requirements 34 Testing Connectivity 34 Installing the Java Runtime on Your Vertica Cluster 35 Installing a Java Runtime 35 Setting the JavaBinaryForUDx Configuration Parameter 36 Configuring Vertica for HCatalog 37 Copy Hadoop Libraries and Configuration Files 37 Install the HCatalog Connector 40 Upgrading to a New Version of Vertica 40 Additional Options for Native File Formats 41 Using the HCatalog Connector with HA NameNode 41 Defining a Schema Using the HCatalog Connector 42 Querying Hive Tables Using HCatalog Connector 43 Viewing Hive Schema and Table Metadata 44 Synchronizing an HCatalog Schema or Table With a Local Schema or Table 48 HPE Vertica Analytic Database (7.2.x) Page 4 of 139

5 Examples 49 Data Type Conversions from Hive to Vertica 50 Data-Width Handling Differences Between Hive and Vertica 51 Using Non-Standard SerDes 52 Determining Which SerDe You Need 52 Installing the SerDe on the Vertica Cluster 53 Troubleshooting HCatalog Connector Problems 54 Connection Errors 54 UDx Failure When Querying Data: Error SerDe Errors 56 Differing Results Between Hive and Vertica Queries 57 Preventing Excessive Query Delays 57 Using the HDFS Connector 59 HDFS Connector Requirements 59 Uninstall Prior Versions of the HDFS Connector 59 webhdfs Requirements 60 Kerberos Authentication Requirements 60 Testing Your Hadoop webhdfs Configuration 60 Loading Data Using the HDFS Connector 63 The HDFS File URL 64 Copying Files in Parallel 64 Viewing Rejected Rows and Exceptions 66 Creating an External Table with an HDFS Source 66 Load Errors in External Tables 67 HDFS ConnectorTroubleshooting Tips 68 User Unable to Connect to Kerberos-Authenticated Hadoop Cluster 68 Resolving Error Transfer Rate Errors 70 Error Loading Many Files 71 Using HDFS Storage Locations 72 Storage Location for HDFS Requirements 72 HDFS Space Requirements 73 Additional Requirements for Backing Up Data Stored on HDFS 73 How the HDFS Storage Location Stores Data 74 What You Can Store on HDFS 74 What HDFS Storage Locations Cannot Do 75 Creating an HDFS Storage Location 75 Creating a Storage Location Using Vertica for SQL on Apache Hadoop 76 Adding HDFS Storage Locations to New Nodes 77 Creating a Storage Policy for HDFS Storage Locations 77 Storing an Entire Table in an HDFS Storage Location 78 HPE Vertica Analytic Database (7.2.x) Page 5 of 139

6 Storing Table Partitions in HDFS 78 Moving Partitions to a Table Stored on HDFS 80 Backing Up Vertica Storage Locations for HDFS 81 Configuring Vertica to Restore HDFS Storage Locations 82 Configuration Overview 83 Installing a Java Runtime 83 Finding Your Hadoop Distribution's Package Repository 84 Configuring Vertica Nodes to Access the Hadoop Distribution s Package Repository 84 Installing the Required Hadoop Packages 86 Setting Configuration Parameters 87 Setting Kerberos Parameters 88 Confirming that distcp Runs 89 Troubleshooting 90 Configuring Hadoop and Vertica to Enable Backup of HDFS Storage 90 Granting Superuser Status on Hortonworks Granting Superuser Status on Cloudera Manually Enabling Snapshotting for a Directory 92 Additional Requirements for Kerberos 92 Testing the Database Account's Ability to Make HDFS Directories Snapshottable 92 Performing Backups Containing HDFS Storage Locations 93 Removing HDFS Storage Locations 93 Removing Existing Data from an HDFS Storage Location 94 Moving Data to Another Storage Location 94 Clearing Storage Policies 95 Changing the Usage of HDFS Storage Locations 97 Dropping an HDFS Storage Location 98 Removing Storage Location Files from HDFS 99 Removing Backup Snapshots 99 Removing the Storage Location Directories 100 Troubleshooting HDFS Storage Locations 100 HDFS Storage Disk Consumption 101 Kerberos Authentication When Creating a Storage Location 102 Backup or Restore Fails When Using Kerberos 103 Using the MapReduce Connector 105 MapReduce Connector Features 105 Prerequisites 105 Hadoop and Vertica Cluster Scaling 106 Installing the Connector 106 Accessing Vertica Data From Hadoop 108 Selecting VerticaInputFormat 108 Setting the Query to Retrieve Data From Vertica 109 Using a Simple Query to Extract Data From Vertica 109 Using a Parameterized Query and Parameter Lists 110 Using a Discrete List of Values 110 HPE Vertica Analytic Database (7.2.x) Page 6 of 139

7 Using a Collection Object 110 Scaling Parameter Lists for the Hadoop Cluster 111 Using a Query to Retrieve Parameter Values for a Parameterized Query 112 Writing a Map Class That Processes Vertica Data 112 Working with the VerticaRecord Class 112 Writing Data to Vertica From Hadoop 114 Configuring Hadoop to Output to Vertica 114 Defining the Output Table 114 Writing the Reduce Class 115 Storing Data in the VerticaRecord 116 Passing Parameters to the Vertica Connector for Hadoop Map Reduce At Run Time 119 Specifying the Location of the Connector.jar File 119 Specifying the Database Connection Parameters 119 Parameters for a Separate Output Database 120 Example Vertica Connector for Hadoop Map Reduce Application 121 Compiling and Running the Example Application 125 Compiling the Example (optional) 126 Running the Example Application 127 Verifying the Results 128 Using Hadoop Streaming with the Vertica Connector for Hadoop Map Reduce 129 Reading Data From Vertica in a Streaming Hadoop Job 129 Writing Data to Vertica in a Streaming Hadoop Job 132 Loading a Text File From HDFS into Vertica 133 Accessing Vertica From Pig 135 Registering the Vertica.jar Files 135 Reading Data From Vertica 135 Writing Data to Vertica 136 Integrating Vertica with the MapR Distribution of Hadoop 138 Send Documentation Feedback 139 HPE Vertica Analytic Database (7.2.x) Page 7 of 139

8 Introduction to Hadoop Integration Introduction to Hadoop Integration Apache Hadoop, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System. You can use HDFS from Vertica in several ways: You can import HDFS data into locally-stored ROS files. You can access HDFS data in place, using external tables. You can use HDFS as a storage location for ROS files. Hadoop includes two other components of interest: Hive, a data warehouse that provides the ability to query data stored in Hadoop. HCatalog, a component that makes Hive metadata available to applications, such as Vertica, outside of Hadoop. A Hadoop cluster can use Kerberos authentication to protect data stored in HDFS. Vertica integrates with Kerberos to access HDFS data if needed. See Using Kerberos with Hadoop. Hadoop Distributions Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. See Vertica Integrations for Hadoop for the specific versions that are supported. Integration Options Vertica supports two cluster architectures. Which you use affects the decisions you make about integration. You can co-locate Vertica on some or all of your Hadoop nodes. Vertica can then take advantage of local data. This option is supported only for Vertica for SQL on Apache Hadoop. You can build a Vertica cluster that is separate from your Hadoop cluster. In this configuration, Vertica can fully use each of its nodes; it does not share resources with Hadoop. This option is not supported for Vertica for SQL on Apache Hadoop.. These layout options are described in Cluster Layout. Both layouts support several interfaces for using Hadoop: HPE Vertica Analytic Database (7.2.x) Page 8 of 139

9 Introduction to Hadoop Integration An HDFS Storage Location uses HDFS to hold Vertica data (ROS files). The HCatalog Connector lets Vertica query data that is stored in a Hive database the same way you query data stored natively in a Vertica schema. Vertica can directly query data in Reading Native Hadoop File Formats (ORC and Parquet). This option is faster than using the HCatalog Connector for this type of data. The HDFS Connector lets Vertica import HDFS data. It also lets Vertica read HDFS data as an external table without using Hive. The MapReduce Connector lets you create Hadoop MapReduce jobs that retrieve data from Vertica. These jobs can also insert data into Vertica. File Paths Hadoop file paths are generally expressed using the webhdfs scheme, such as 'webhdfs://somehost:port/opt/data/filename'. These paths are URIs, so if you need to escape a special character in a path, use URI escaping. For example: webhdfs://somehost:port/opt/data/my%20file HPE Vertica Analytic Database (7.2.x) Page 9 of 139

10 Cluster Layout Cluster Layout Vertica and Hadoop each use a cluster of nodes for distributed processing. These clusters can be co-located, meaning you run both products on the same machines, or separate. Co-Located Clusters are for use with Vertica for SQL on Apache Hadoop licenses. Separate Clusters are for use with Premium Edition and Community Edition licenses. With either architecture, if you are using the hdfs scheme to read ORC or Parquet files, you must do some additional configuration. See Configuring hdfs:/// Access. Co-Located Clusters With co-located clusters, Vertica is installed on some or all of your Hadoop nodes. The Vertica nodes use a private network in addition to the public network used by all Hadoop nodes, as the following figure shows: You might choose to place Vertica on all of your Hadoop nodes or only on some of them. If you are using HDFS Storage Locations you should use at least three Vertica nodes, the minimum number for K-Safety. Using more Vertica nodes can improve performance because the HDFS data needed by a query is more likely to be local. Normally, both Hadoop and Vertica use the entire node. Because this configuration uses shared nodes, you must address potential resource contention in your configuration on those nodes. See Configuring Hadoop for Co-Located Clusters for more information. No changes are needed on Hadoop-only nodes. You can place Hadoop and Vertica clusters within a single rack, or you can span across many racks and nodes. Spreading node types across racks can improve efficiency. HPE Vertica Analytic Database (7.2.x) Page 10 of 139

11 Cluster Layout Hardware Recommendations Hadoop clusters frequently do not have identical provisioning requirements or hardware configurations. However, Vertica nodes should be equivalent in size and capability, per the best-practice standards recommended in General Hardware and OS Requirements and Recommendations in Installing Vertica. Because Hadoop cluster specifications do not always meet these standards, Hewlett Packard Enterprise recommends the following specifications for Vertica nodes in your Hadoop cluster. Specifications For... Processor Recommendation For best performance, run: Two-socket servers with 8 14 core CPUs, clocked at or above 2.6 GHz for clusters over 10 TB Single-socket servers with 8 12 cores clocked at or above 2.6 GHz for clusters under 10 TB Memory Distribute the memory appropriately across all memory channels in the server: Minimum 8 GB of memory per physical CPU core in the server High-performance applications GB of memory per physical core Type at least DDR3-1600, preferably DDR Storage Read/write: Minimum 40 MB/s per physical core of the CPU For best performance MB/s per physical core Storage post RAID: Each node should have 1 9 TB. For a production setting, RAID 10 is recommended. In some cases, RAID 50 is acceptable. Because of the heavy compression and encoding that Vertica does, SSDs are not required. In most cases, a RAID of more, lessexpensive HDDs performs just as well as a RAID of fewer SSDs. If you intend to use RAID 50 for your data partition, you should keep HPE Vertica Analytic Database (7.2.x) Page 11 of 139

12 Cluster Layout a spare node in every rack, allowing for manual failover of a Vertica node in the case of a drive failure. A Vertica node recovery is faster than a RAID 50 rebuild. Also, be sure to never put more than 10 TB compressed on any node, to keep node recovery times at an acceptable rate. Network 10 GB networking in almost every case. With the introduction of 10 GB over cat6a (Ethernet), the cost difference is minimal. Configuring Hadoop for Co-Located Clusters If you are co-locating Vertica on any HDFS nodes, there are some additional configuration requirements. webhdfs Hadoop has two services that can provide web access to HDFS: webhdfs httpfs For Vertica, you must use the webhdfs service. YARN The YARN service is available in newer releases of Hadoop. It performs resource management for Hadoop clusters. When co-locating Vertica on YARNmanaged Hadoop nodes you must make some changes in YARN. HPE recommends reserving at least 16GB of memory for Vertica on shared nodes. Reserving more will improve performance. How you do this depends on your Hadoop distribution: If you are using Hortonworks, create a "Vertica" node label and assign this to the nodes that are running Vertica. If you are using Cloudera, enable and configure static service pools. Consult the documentation for your Hadoop distribution for details. Alternatively, you can disable YARN on the shared nodes. HPE Vertica Analytic Database (7.2.x) Page 12 of 139

13 Cluster Layout Hadoop Balancer The Hadoop Balancer can redistribute data blocks across HDFS. For many Hadoop services, this feature is useful. However, for Vertica this can reduce performance under some conditions. If you are using HDFS storage locations, the Hadoop load balancer can move data away from the Vertica nodes that are operating on it, degrading performance. This behavior can also occur when reading ORC or Parquet files if Vertica is not running on all Hadoop nodes. (If you are using separate Vertica and Hadoop clusters, all Hadoop access is over the network, and the performance cost is less noticeable.) To prevent the undesired movement of data blocks across the HDFS cluster, consider excluding Vertica nodes from rebalancing. See the Hadoop documentation to learn how to do this. Replication Factor By default, HDFS stores three copies of each data block. Vertica is generally set up to store two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save space and still provide data protection. To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in Troubleshooting HDFS Storage Locations. Disk Space for Non-HDFS Use You also need to reserve some disk space for non-hdfs use. To reserve disk space using Ambari, set dfs.datanode.du.reserved to a value in the hdfs-site.xml configuration file. Setting this parameter preserves space for non-hdfs files that Vertica requires. Separate Clusters In the Premium Edition product, your Vertica and Hadoop clusters must be set up on separate nodes, ideally connected by a high-bandwidth network connection. This is different from the configuration for Vertica for SQL on Apache Hadoop, in which Vertica nodes are co-located on Hadoop nodes. The following figure illustrates the configuration for separate clusters:: HPE Vertica Analytic Database (7.2.x) Page 13 of 139

Cluster Layout The network is a key performance component of any well-configured cluster. When Vertica stores data to HDFS it writes and reads data across the network.

14 Cluster Layout The network is a key performance component of any well-configured cluster. When Vertica stores data to HDFS it writes and reads data across the network. The layout shown in the figure calls for two networks, and there are benefits to adding a third: Database Private Network: Vertica uses a private network for command and control and moving data between nodes in support of its database functions. In some networks, the command and control and passing of data are split across two networks. Database/Hadoop Shared Network: Each Vertica node must be able to connect to each Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated network for the Hadoop cluster. This is not a technical requirement, but a dedicated network improves Hadoop performance. Vertica and Hadoop should share the dedicated Hadoop network. Optional Client Network: Outside clients may access the clustered networks through a client network. This is not an absolute requirement, but the use of a third network that supports client connections to either Vertica or Hadoop can improve HPE Vertica Analytic Database (7.2.x) Page 14 of 139

15 Cluster Layout performance. If the configuration does not support a client network, than client connections should use the shared network. HPE Vertica Analytic Database (7.2.x) Page 15 of 139

16 Choosing Which Hadoop Interface to Use Choosing Which Hadoop Interface to Use Vertica provides several ways to interact with data stored in Hadoop. This section explains how to choose among them. Decisions about Cluster Layout can affect the decisions you make about Hadoop interfaces. Creating an HDFS Storage Location Using a storage location to store data in the Vertica native file format (ROS) delivers the best query performance among the available Hadoop options. (Storing ROS files on the local disk rather than in Hadoop is faster still.) If you already have data in Hadoop, however, doing this means you are importing that data into Vertica. For co-located clusters, which does not use local file storage, you might still choose to use an HDFS storage location for better performance. You can use the HDFS Connector to load data that is already in HDFS into Vertica. For separate clusters, which use local file storage, consider using an HDFS storage location for lower-priority data. See Using HDFS Storage Locations and Using the HDFS Connector. Reading ORC and Parquet Files If your data is stored in the Optimized Row Columnar (ORC) or Parquet format, Vertica can query that data directly from HDFS. This option is faster than using the HCatalog Connector, but you cannot pull schema definitions from Hive directly into the database. Vertica reads the data in place; no extra copies are made. See Reading Native Hadoop File Formats. Using the HCatalog Connector The HCatalog Connector uses Hadoop services (Hive and HCatalog) to query data stored in HDFS. Like the ORC Reader, it reads data in place rather than making copies. Using this interface you can read all file formats supported by Hadoop, including Parquet and ORC, and Vertica can use Hive's schema definitions. However, performance can be poor in some cases. The HCatalog Connector is also sensitive to HPE Vertica Analytic Database (7.2.x) Page 16 of 139

17 Choosing Which Hadoop Interface to Use changes in the Hadoop libraries on which it depends; upgrading your Hadoop cluster might affect your HCatalog connections. See Using the HCatalog Connector. Using the HDFS Connector The HDFS Connector can be used to create and query external tables, reading the data in place rather than making copies. The HDFS Connector can be used with any data format for which a parser is available. It does not use Hive data; you have to define the table yourself. Its performance can be poor because, like the HCatalog Connector, it cannot take advantage of the benefits of columnar file formats. See Using the HDFS Connector. Using the MapReduce Connector The other interfaces described in this section allow you to read Hadoop data from Vertica or create Vertica data in Hadoop. The MapReduce Connector, in contrast, allows you to integrate with Hadoop's MapReduce jobs. Use this connector to send Vertica data to MapReduce or to have MapReduce jobs create data in Vertica. See Using the MapReduce Connector. HPE Vertica Analytic Database (7.2.x) Page 17 of 139

18 Using Kerberos with Hadoop Using Kerberos with Hadoop If your Hadoop cluster uses Kerberos authentication to restrict access to HDFS, you must configure Vertica to make authenticated connections. The details of this configuration vary, based on which methods you are using to access HDFS data: How Vertica uses Kerberos With Hadoop Configuring Kerberos How Vertica uses Kerberos With Hadoop Vertica authenticates with Hadoop in two ways that require different configurations: User Authentication On behalf of the user, by passing along the user's existing Kerberos credentials, as occurs with the HDFS Connector and the HCatalog Connector. Vertica Authentication On behalf of system processes (such as the Tuple Mover), by using a special Kerberos credential stored in a keytab file. User Authentication To use Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or sign in to Active Directory, for example. A user who authenticates to a Kerberos server receives a Kerberos ticket. At the beginning of a client session, Vertica automatically retrieves this ticket.the database then uses this ticket to get a Hadoop token, which Hadoop uses to grant access. Vertica uses this token to access HDFS, such as when executing a query on behalf of the user. When the token expires, the database automatically renews it, also renewing the Kerberos ticket if necessary. The user must have been granted permission to access the relevant files in HDFS. This permission is checked the first time Vertica reads HDFS data. The following figure shows how the user, Vertica, Hadoop, and Kerberos interact in user authentication: HPE Vertica Analytic Database (7.2.x) Page 18 of 139

19 Using Kerberos with Hadoop When using the HDFS Connector or the HCatalog Connector, or when reading an ORC or Parquet file stored in HDFS, Vertica uses the client identity as the preceding figure shows. Vertica Authentication Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, Vertica uses a special identity (principal) stored in a keytab file on every database node. (This approach is also used for Vertica clusters that use Kerberos but do not use Hadoop.) After you configure the keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos ticket, much as in the client scenario. In this case, the client does not interact with Kerberos. The following figure shows the interactions required for Vertica authentication: HPE Vertica Analytic Database (7.2.x) Page 19 of 139

20 Using Kerberos with Hadoop Each Vertica node uses its own principal; it is common to incorporate the name of the node into the principal name. You can either create one keytab per node, containing only that node's principal, or you can create a single keytab containing all the principals and distribute the file to all nodes. Either way, the node uses its principal to get a Kerberos ticket and then uses that ticket to get a Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one database node. When creating HDFS storage locations Vertica uses the principal in the keytab file, not the principal of the user issuing the CREATE LOCATION statement. See Also For specific configuration instructions, see Configuring Kerberos. HPE Vertica Analytic Database (7.2.x) Page 20 of 139

21 Using Kerberos with Hadoop Configuring Kerberos Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. This documentation assumes that you are using Kerberos for both your HDFS and Vertica clusters. Prerequisite: Setting Up Users and the Keytab File If you have not already configured Kerberos authentication for Vertica, follow the instructions in Configure for Kerberos Authentication. In particular: Create one Kerberos principal per node. Place the keytab file(s) in the same location on each database node and set its location in KerberosKeytabFile (see Specify the Location of the Keytab File). Set KerberosServiceName to the name of the principal (see Inform About the Kerberos Principal). HCatalog Connector You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of Vertica users. If the current user has a Kerberos key, then Vertica passes it to the HCatalog connector automatically. Verify that all users who need access to Hive have been granted access to HDFS. In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do this is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you do this before running hcatutil (see Configuring Vertica for HCatalog). HDFS Connector The HDFS Connector loads data from HDFS into Vertica on behalf of the user, using a User Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to access HDFS. Verify that all users who use this connector have been granted access to HDFS. HPE Vertica Analytic Database (7.2.x) Page 21 of 139

22 Using Kerberos with Hadoop HDFS Storage Location You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector). After you create Kerberos principals for each node, give all of them read and write permissions to the HDFS directory you will use as a storage location. If you plan to back up HDFS storage locations, take the following additional steps: Grant Hadoop superuser privileges to the new principals. Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" in Configuring Vertica to Restore HDFS Storage Locations. Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all database nodes, use the CREATE LOCATION statement to create the storage location as usual. Token Expiration Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. The HadoopFSTokenRefreshFrequency configuration parameter specifies the frequency in seconds: => ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400'; If the current age of the token is greater than the value specified in this parameter, Vertica refreshes the token before accessing data stored in HDFS. See Also How Vertica uses Kerberos With Hadoop Troubleshooting Kerberos Authentication HPE Vertica Analytic Database (7.2.x) Page 22 of 139

23 Reading Native Hadoop File Formats Reading Native Hadoop File Formats When you create external tables or copy data into tables, you can access data in certain native Hadoop formats directly. Currently, Vertica supports the ORC (Optimized Row Columnar) and Parquet formats. Because this approach allows you to define your tables yourself instead of fetching the metadata through webhcat, these readers can provide slightly better performance than the HCatalog Connector.. If you are already using the HCatalog Connector for other reasons, however, you might find it more convenient to use it to read data in these formats also. See Using the HCatalog Connector. You can use the hdfs scheme to access ORC and Parquet files stored in HDFS, as explained later in this section. To use this scheme you must perform some additional configuration; see Configuring hdfs:/// Access. Requirements The ORC or Parquet files must not use complex data types. All simple data types supported in Hive version 0.11 or later are supported. Files compressed by Hive or Impala require Zlib (GZIP) or Snappy compression. Vertica does not support LZO compression for these formats. Creating External Tables In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC or PARQUET as follows: => CREATE EXTERNAL TABLE tablename (columns) AS COPY FROM path ORC; => CREATE EXTERNAL TABLE tablename (columns) AS COPY FROM path PARQUET; If the file resides on the local file system of the node where you issue the command Use a local file path for path. Escape special characters in file paths with backslashes. HPE Vertica Analytic Database (7.2.x) Page 23 of 139

24 Reading Native Hadoop File Formats If the file resides elsewhere in HDFS Use the hdfs:/// prefix (three slashes), and then specify the file path. Escape special characters in HDFS paths using URI encoding, for example %20 for space. Vertica automatically converts from the hdfs scheme to the webhdfs scheme if necessary. You can also directly use a webhdfs:// prefix and specify the host name, port, and file path. Using the hdfs scheme potentially provides better performance when reading files not protected by Kerberos. When defining an external table, you must define all of the columns in the file. Unlike with some other data sources, you cannot select only the columns of interest. If you omit columns, the ORC or Parquet reader aborts with an error. Files stored in HDFS are governed by HDFS privileges. For files stored on the local disk, however, Vertica requires that users be granted access. All users who have administrative privileges have access. For other users, you must create a storage location and grant access to it. See CREATE EXTERNAL TABLE AS COPY. HDFS privileges are still enforced, so it is safe to create a location for webhdfs://host:port. Only users who have access to both the Vertica user storage location and the HDFS directory can read from the table. Loading Data In the COPY statement, specify a format of ORC or PARQUET: => COPY tablename FROM path ORC; => COPY tablenamefrom path PARQUET; For files that are not local, specify ON ANY NODE to improve performance. => COPY t FROM 'hdfs:///opt/data/orcfile' ON ANY NODE ORC; As with external tables, path may be a local or hdfs:/// path. Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest. Supported Data Types The Vertica ORC and Parquet file readers can natively read columns of all data types supported in Hive version 0.11 and later except for complex types. If complex types such as maps are encountered, the COPY or CREATE EXTERNAL TABLE AS COPY statement aborts with an error message. The readers do not attempt to read only HPE Vertica Analytic Database (7.2.x) Page 24 of 139

25 Reading Native Hadoop File Formats some columns; either the entire file is read or the operation fails. For a complete list of supported types, see HIVE Data Types. Kerberos Authentication If the file to be read is located on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal to authenticate. It does not use the database's principal. Examples The following example shows how you can read from all ORC files in a local directory. This example uses all supported data types. => CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP, a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20), a14 DECIMAL(10,5)) AS COPY FROM '/data/orc_test_*.orc' ORC; The following example shows the error that is produced if the file you specify is not recognized as an ORC file: => CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT) AS COPY FROM '/data/not_an_orc_file.orc' ORC; ERROR 0: Failed to read orc source [/data/not_an_orc_file.orc]: Not an ORC file See Alsos Query Performance Troubleshooting Reads from Native File Formats Query Performance When working with external tables in native formats, Vertica tries to improve performance in two ways: Pushing query execution closer to the data so less has to be read and transmitted Using data locality in planning the query HPE Vertica Analytic Database (7.2.x) Page 25 of 139

26 Reading Native Hadoop File Formats Considerations When Writing Files The decisions you make when writing ORC and Parquet files can affect performance when using them. To get the best performance from Vertica, follow these guidelines when writing your files: Use the latest available Hive version. (You can still read your files with earlier versions.) Use a large stripe size. 256 MB or greater is preferred. Partition the data at the table level. Sort the columns based on frequency of access, with most-frequently accessed columns appearing first. Use Snappy or Zlib/GZIP compression. Predicate Pushdown Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics, and row group indexes. Predicates are applied only to the first two levels. Parquet files can have statistics in the ColumnMetaData and DataPageHeader. Predicates are applied only to the ColumnMetaData. Predicate pushdown is automatically applied for files written with Hive version 0.14 and later. Files written with earlier versions of Hive might not contain the required statistics. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_ EVENTS system table. If you are seeing performance problems with your queries, check this table for these events. Data Locality In a cluster where Vertica nodes are co-located on HDFS nodes, the query can use data locality to improve performance. For Vertica to do so, both the following conditions must exist:: The data is on an HDFS node where a database node is also present. The query is not restricted to specific nodes using ON NODE. HPE Vertica Analytic Database (7.2.x) Page 26 of 139

27 Reading Native Hadoop File Formats When both these conditions exist, the query planner uses the co-located database node to read that data locally, instead of making a network call. You can see how much data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC data matched with co-located Vertica nodes". To increase the volume of local reads, consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data. Configuring hdfs:/// Access When reading ORC or Parquet files from HDFS, you can use the hdfs scheme instead of the webhdfs scheme. Using the hdfs scheme can improve performance by bypassing the webhdfs service. To support the hdfs scheme, your Vertica nodes need access to certain Hadoop configuration files. If Vertica is co-located on HDFS nodes, then those files are already present. Verify that the HadoopConfDir environment variable is correctly set. Its path should include a directory containing the core-site.xml and hdfs-site.xml files. If Vertica is running on a separate cluster, you must copy the required files to those nodes and set the HadoopConfDir environment variable. A simple way to do so is to configure your Vertica nodes as Hadoop edge nodes. Edge nodes are used to run client applications; from Hadoop's perspective, Vertica is a client application. You can use Ambari or Cloudera Manager to configure edge nodes. For more information, see the documentation for your Hadoop vendor. Using the hdfs scheme does not remove the need for access to the webhdfs service. The hdfs scheme is not available for all files. If hdfs is not available, then Vertica automatically uses webhdfs instead. If you update the configuration files after starting Vertica, use the following statement to refresh them: => SELECT CLEAR_CACHES(); Troubleshooting Reads from Native File Formats You might encounter the following issues when reading ORC or Parquet files. HPE Vertica Analytic Database (7.2.x) Page 27 of 139

28 Reading Native Hadoop File Formats webhdfs Error When Using hdfs URIs When creating an external table or loading data and using the hdfs scheme, you might see errors from webhdfs failures. Such errors indicate that Vertica was not able to use the hdfs scheme and fell back to webhdfs, but that the webhdfs configuration is incorrect. Verify that the HDFS configuration files in HadoopConfDir have the correct webhdfs configuration for your Hadoop cluster. See Configuring hdfs:/// Access for information about use of these files. See your Hadoop documentation for information about webhdfs configuration. Reads from Parquet Files Report Unexpected Data-Type Mismatches If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INT, you might see an unclear error message. In this case Vertica reports the column in the Parquet file as BYTE_ARRAY, as shown in the following example: ERROR 0: Datatype mismatch: column 2 in the parquet_cpp source [/tmp/nation.0.parquet] has type BYTE_ARRAY, expected int This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet does not natively support the STRING type and uses BYTE_ARRAY for strings instead. Because the Parquet file reports its type as BYTE_ARRAY, Vertica has no way to determine if the type is actually a BYTE_ARRAY or a STRING. Time Zones in Timestamp Values Are Not Correct Reading time stamps from an ORC or Parquet file in Vertica might result in different values, based on the local time zone. This issue occurs because the ORC and Parquet formats do not support the SQL TIMESTAMP data type. If you define the column in your table with the TIMESTAMP data type, Vertica interprets time stamps read from ORC or Parquet files as values in the local time zone. This same behavior occurs in Hive. When this situation occurs, Vertica produces a warning at query time, such as the following: WARNING 0: SQL TIMESTAMPTZ is more appropriate for ORC TIMESTAMP because values are stored in UTC When creating the table in Vertica, you can avoid this issue by using the TIMESTAMPTZ data type instead of TIMESTAMP. HPE Vertica Analytic Database (7.2.x) Page 28 of 139

29 Reading Native Hadoop File Formats Some Date and Timestamp Values Are Wrong by Several Days When Hive writes ORC or Parquet files, it converts dates before 1583 from the Gregorian calendar to the Julian calendar. Vertica does not perform this conversion. If your file contains dates before this time, values in Hive and the corresponding values in Vertica can differ by up to ten days. This difference applies to both DATE and TIMESTAMP values. Error 7087: Wrong Number of Columns When loading data, you might see an error stating that you have the wrong number of columns: => CREATE TABLE nation (nationkey bigint, name varchar(500), regionkey bigint, comment varchar(500)); CREATE TABLE => COPY nation from :orc_dir ORC; ERROR 7087: Attempt to load 4 columns from an orc source [/tmp/orc_glob/test.orc] that has 9 columns When you load data from Hadoop native file formats, your table must consume all of the data in the file, or this error results. To avoid this problem, add the missing columns to your table definition. HPE Vertica Analytic Database (7.2.x) Page 29 of 139

30 Reading Native Hadoop File Formats HPE Vertica Analytic Database (7.2.x) Page 30 of 139

31 Using the HCatalog Connector Using the HCatalog Connector The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table. If your files are in the Optimized Columnar Row (ORC) or Parquet format and do not use complex types, the HCatalog Connector creates an external table and uses the ORC or Parquet reader instead of using the Java SerDe. See Reading Native Hadoop File Formats for more information about these readers. The HCatalog Connector performs predicate pushdown to improve query performance. Instead of reading all data across the network to evaluate a query, the HCatalog Connector moves the evaluation of predicates closer to the data. Predicate pushdown applies to Hive partition pruning, ORC stripe pruning, and Parquet row-group pruning. The HCatalog Connector supports predicate pushdown for the following predicates: >, >=, =, <>, <=, <. Hive, HCatalog, and WebHCat Overview There are several Hadoop components that you need to understand in order to use the HCatalog connector: Apache's Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the same way you query data stored in a relational database. Behind the scenes, Hive uses a set of serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and break it into columns and rows. Each SerDe handles data files in a specific format. For example, one SerDe extracts data from comma-separated data files while another interprets data stored in JSON format. Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata available to other Hadoop components (such as Pig). WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a REST web API. Through it, you can make an HTTP request to retrieve data stored in Hive, as well as information about the Hive schema. Vertica's HCatalog Connector lets you transparently access data that is available through WebHCat. You use the connector to define a schema in Vertica that corresponds to a Hive database or schema. When you query data within this schema, the HCatalog Connector transparently extracts and formats the data from Hadoop into tabular data. The data within this HCatalog schema appears as if it is native to Vertica. HPE Vertica Analytic Database (7.2.x) Page 31 of 139

32 Using the HCatalog Connector You can even perform operations such as joins between Vertica-native tables and HCatalog tables. For more details, see How the HCatalog Connector Works. HCatalog Connection Features The HCatalog Connector lets you query data stored in Hive using the Vertica native SQL syntax. Some of its main features are: The HCatalog Connector always reflects the current state of data stored in Hive. The HCatalog Connector uses the parallel nature of both Vertica and Hadoop to process Hive data. The result is that querying data through the HCatalog Connector is often faster than querying the data directly through Hive. Since Vertica performs the extraction and parsing of data, the HCatalog Connector does not signficantly increase the load on your Hadoop cluster. The data you query through the HCatalog Connector can be used as if it were native Vertica data. For example, you can execute a query that joins data from a table in an HCatalog schema with a native table. HCatalog Connection Considerations There are a few things to keep in mind when using the HCatalog Connector: Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and deserialized each time it is queried. This deserialization causes Hive's performance to be much slower than Vertica. The HCatalog Connector has to perform the same process as Hive to read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much slower than querying a native Vertica table. If you need to perform extensive analysis on data stored in Hive, you should consider loading it into Vertica through the HCatalog Connector or the WebHDFS connector. Vertica optimization often makes querying data through the HCatalog Connector faster than directly querying it through Hive. Hive supports complex data types such as lists, maps, and structs that Vertica does not support. Columns containing these data types are converted to a JSON representation of the data type and stored as a VARCHAR. See Data Type Conversions from Hive to Vertica. Note: The HCatalog Connector is read only. It cannot insert data into Hive. HPE Vertica Analytic Database (7.2.x) Page 32 of 139

Using the HCatalog Connector How the HCatalog Connector Works When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts the WebHCat

33 Using the HCatalog Connector How the HCatalog Connector Works When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts the WebHCat server in your Hadoop cluster to determine if the table exists. If it does, the connector retrieves the table's metadata from the metastore database so the query planning can continue. When the query executes, all nodes in the Vertica cluster directly retrieve the data necessary for completing the query from HDFS. They then use the Hive SerDe classes to extract the data so the query can execute. This approach takes advantage of the parallel nature of both Vertica and Hadoop. In addition, by performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact of the query on the Hadoop cluster. HCatalog Connector Requirements Before you can use the HCatalog Connector, both your Vertica and Hadoop installations must meet the following requirements. Vertica Requirements All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing the Java Runtime on Your Vertica Cluster. HPE Vertica Analytic Database (7.2.x) Page 33 of 139

34 Using the HCatalog Connector You must also add certain libraries distributed with Hadoop and Hive to your Vertica installation directory. See Configuring Vertica for HCatalog. Hadoop Requirements Your Hadoop cluster must meet several requirements to operate correctly with the Vertica Connector for HCatalog: It must have Hive and HCatalog installed and running. See Apache's HCatalog page for more information. It must have WebHCat (formerly known as Templeton) installed and running. See Apache' s WebHCat page for details. The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly accessible from all of the hosts in your Vertica database. Verify that any firewall separating the Hadoop cluster and the Vertica cluster will pass WebHCat, metastore database, and HDFS traffic. The data that you want to query must be in an internal or external Hive table. If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on your Vertica cluster before you can query the data. See Using Non- Standard SerDes. Testing Connectivity To test the connection between your database cluster and WebHcat, log into a node in your Vertica cluster. Then, run the following command to execute an HCatalog query: $ curl Where: webhcatserver is the IP address or hostname of the WebHCat server port is the port number assigned to the WebHCat service (usually 50111) hcatusername is a valid username authorized to use HCatalog Usually, you want to append ;echo to the command to add a linefeed after the curl command's output. Otherwise, the command prompt is automatically appended to the command's output, making it harder to read. For example: $ curl echo HPE Vertica Analytic Database (7.2.x) Page 34 of 139

HPE Storage Optimizer Software Version: 5.4. Support Matrix

HPE Storage Optimizer Software Version: 5.4 Support Matrix Document Release Date: November 2016 Software Release Date: November 2016 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise