.
Table of Contents New in This Version... 4 Changed in This Version... 14 Upgrade Notes... 16 Supported Browsers, Processing Engines, Data Sources and Hadoop Distributions... 16 Resolved Issues... 17 Known Issues... 17
New in This Version Filtergram enhancements All Filtergrams have been redesigned to provide a unified interface with powerful new tools to dynamically filter your data with great precision. A summary of the new functionality is provided in these release notes. For a comprehensive explanation of the filtering actions and operations you can take for each filter type, see the application's online help. Numeric Filtergram
Make selections from the histogram to dynamically filter your data. Explore your selections and make additional refinements with the "Show Selected Items" option. All filtering actions you take continue to dynamically update your dataset.
View your entire range of data as a list. Continue to make very precise selections from the list. Your dataset continues to dynamically update with every selection you make. Search for values using the enhanced search function. Text Filtergram The new "Show Selected Items" option and the enhanced search capabilities allow you to explore your selections and make additional refinements. All filtering actions you take continue to dynamically update your dataset.
New Date/Time Filtergram In addition to the Filtergram enhancements, this release introduces a new Filtergram that displays date/time values as a Timeline histogram. All of the new functionality implemented for Numeric Filtergrams is also available for the Timeline histogram: Drag your mouse over values in the histogram to make single or multiple selections; the dataset dynamically filters to display your selection(s). Move your mouse over the x-axis and use the scroll wheel to zoom into your timeline selections. Data bins display to represent a granular view of the selected range. Mouse over any bin to view the number of values in a bin. Use the scroll wheel to zoom deeper. Turn on the Overview tool to view your zoomed locations. Use the "Show Selected Items" option to explore your selections and make additional refinements: manually add specific dates or date ranges. Toggle to "Exclude" and hide any specified dates or ranges while working with your data. View the entire range of date/time values as a list from which you can make very precise selections. Search for any values with enhanced search capability.
Additionally, the Date/Time filter has five different charts for filtering your date/time data. The Timeline is the default chart. The additional charts can be opened from the "available charts" tab, and you can work simultaneously with all open charts. The application's online help provides an explanation of all filtering actions and operations you can take with the Date/Time Filtergram.
Sampling Sampling on Import, Lookup and Append Sampling is now available when you import a base dataset into a Project and perform a Lookup or Append Step in a Project. This option allows you to sample a very large data source for initial discovery before bringing all of the data into a Project. This is particularly useful if your Administrator has source file size limits in place to protect cluster resources. Sampling as Project Step with new Sampling Tool In addition to sampling a data source, the new Sampling tool gives you the flexibility to filter down to a specific set of rows in your data and then sample on those rows. When your exploration is complete, you can easily remove the sampling operation by either muting or deleting it in the Steps panel. For both Sampling on Import/Lookup/Append and Sampling as a Step in your Project, the sampling operation can be based on a percentage of the dataset or a specific number of rows in the dataset. When sampling by percentage, you have the option to specify a column in the source file to use for generating the sample. In this case, only the data in the column is used for determining the sample.
Note: when performing any sampling operation, a "sampling seed" is provided to ensure that you can always repeat your sample. New grid tools Three new tools are available in the application's footer for working with data on the grid. The application s online help provides details for how to use each of these tools. Column Lineage A column's lineage can now be displayed in the Steps panel through "Lineage Mode." In Lineage Mode, all Steps that affected the column are displayed with an orange outline. The outlines allow you to quickly identify the Steps that affected the column or changed its data. If there are Steps in the Editor that did not affect the column, those Steps are grayed out, collapsed and labeled to note how many Steps are collapsed. New tools for working with Cluster + Edit
Two new tools provide visual queues to better recognize how the suggested value for a Cluster was derived: Fixed-width font setting: by default, Cluster values display in a variable-width font. Click this option to display Cluster values in a fixed-width font. The fixed-width option aligns all text characters, which allows you to more easily identify extra spaces within a Cluster value and differentiate characters across the Clusters. Highlight tools: highlighting allows you to easily recognize how the suggested Cluster replacement value was derived. The Additions tool highlights the characters that are in addition to all common characters. The Deletions tool indicates where deletions have been made in order to derive the common characters. Deletions are condensed into a red (x). The Additions and Deletions tools can be simultaneously enabled.
Data Library enhancements The following enhancements have been implemented for Data Library import and export functionality: Compressed files from HDFS can now be imported into the Data Library. The following compression formats are currently supported: bzip, gzip, deflate. When adding a dataset to the library, new options under the Advanced Settings tab allow you to strip smart quotes from a delimited file on import and parse a file that uses row separators other than the standard defaults for Value and Line Separators. When exporting a dataset from the Data Library, you can specify an alternative row separator other than the standard default value. Support for proxy connectivity from Cisco Data Prep to Salesforce: a new configuration enhancement allows Cisco Data Prep to connect to a cloud-based Salesforce server through a proxy server. Refer to the "Cisco Data Prep Installation" guide for configuration details.
Admin updates The following enhancements have been implemented for the administration pages: Users page: when logged in as "superuser" or "admin" a new Last Login column provides information indicating the last login date and time for each user. Roles page: a new Library permission has been introduced to separate the rights required to locally download a dataset versus exporting it to HDFS. Tenant configuration A tenant-based user session timeout option is now available. This option specifies the maximum number of seconds that a user session can be idle before the system automatically closes the session. The default setting is to never timeout users. New clusters.properties parameter to transform case-insensitive names from LDAP/SAML to HDFS If you use HDFS for import or export with the passthru option, there is a new parameter in the clusters.properties file: px.cluster.clustername.passthru.transform This parameter allows you to specify a function to transform case-insensitive names from LDAP/SAML to HDFS user names, which are case sensitive. Refer to the Server Administration Guide for details. Changed in This Version Nested XML and JSON Smart parsing for nested XML and JSON data now creates new columns for all of the nested data instead of dropping or writing it into a single row. Example: Nested XML data
Results prior to 1.2 release: Smart parsing results with this release: Important note: the order of columns in your source XML or JSON file is not always preserved during the import, and you may notice a different column order after the file is imported into the Data Library. This is a known issue and a fix is forthcoming. The current and recommended workaround is to bring the data into a Project and use the Columns tool to re-order the columns.
Upgrade Notes Resource Level Permissions are not enforced after export to Hive becaues the Cisco Data Prep application does not provision any Hive authorization platform. Supported Browsers, Processing Engines, Data Sources and Hadoop Distributions Browsers Mozilla Firefox: Extended Support Release (ESR) 38.6.1 for Mac and Windows Google Chrome: 48 for Mac and Windows The recommended resolution for the Cisco Data Prep application is 1024x768. Processing engines Apache Spark 1.4 CDH 5.4.X Spark 1.3 o YARN o standalone Supported data sources and export formats Sources o HDFS o JDBC o Salesforce o Local files: Microsoft:.xls,.xlsx,.xlsm Character-separated or fixed-field-length text:.txt,.csv,.tsv Structured text:.json,.xml Hadoop Avro :.avro Export formats o JSON o AVRO o XML o CSV; delimiter-separated; tab-separated o fixed-width Supported Hadoop Distributions Cloudera CDH 5.4.0 Hortonworks HDP 2.3.2
Resolved Issues Filtergrams: if you have more than one filter open for your dataset, the remove row operation now works when one of the filters is "inverted". Project automation setup: the Cancel button for "Import Dataset Set it up Now?" now correctly cancels the function for a dataset that cannot be automated. The user interface treats all numeric values as double precision (64-bit) IEEE 754 values. Integer values that exceed this supported range are no longer rounded in the user interface. A computed column expression with a negative number in the LEFT or RIGHT argument now returns a syntax error notifying user that negative numbers are not permitted. Previously, an "unexpected error" was issued and the Project's dataset did not refresh. When performing a split column operation on a Lookup Step, the "min" and "max" column values now correctly display. When working on a shared Project in which one user does not have permission to a dataset being used by the Project, an error message now displays in the Steps editor panel to notify the user of the permission issue. JDBC imports, particularly from Hive, are now significantly faster. Newly provisioned users in newly provisioned LDAP groups no longer see an error on initial login to the application. The keytab file used for connecting users to a Hadoop cluster can now successively login to the cluster to obtain new Kerberos authentication tickets. Known Issues Issue Rapid actions on the Steps Editor panel trigger an error and loss of local changes Erroneous error message when deleting datafiles for projects that no longer exist in Cisco Data Prep Description When performing rapid actions on Steps in a Project, for example quickly muting Steps, an error message is issued and the local changes cannot be saved. If you delete a Cisco Data Prep project and then later delete the datafile(s) associated with that project, you receive a message indicating the datafile(s) are currently in use even though you ve deleted the project:
Issue Arrays not properly published Regular expression syntax with backslash requires an additional backslash Numeric columns not properly cast after a transpose operation Description When publishing the results of an Array aggregation from a shape step, the Array column is not properly published. To publish an Array column, the column must be converted to text after the shape step. If you need to use a backslash (\) in a regular expression, you require an additional backslash (\\) one to escape the other. For example, RegEx \d is entered as: \\d After a transpose operation on a column with numeric values, the column type is being set to type text. Before mathematical operations can be conducted on this column, the column must be transformed back to a numeric type column. column remains marked as text data type transform to numeric Pipeline installation package has 4 different versions, each supporting a specific distribution. Run cisco-data-prep-pipeline on the correct distribution. cisco-data-prep-pipeline-cdh5.4.0-2.9.2-1.noarch.rpm ==> compatible with Cloudera CDH 5.4.x distribution cisco-data-prep-pipeline-cdh5.5.0-2.9.2-1.noarch.rpm ==> compatible with Cloudera CDH 5.5.x distribution cisco-data-prep-pipeline-db1.4.1-2.9.2-1.noarch.rpm ==> compatible with Databricks Spark 1.4.1 distribution cisco-data-prep-pipeline-db1.5.1-2.9.2-1.noarch.rpm ==> compatible with Databricks Spark 1.5.1 distribution For users upgrading pipeline to 1.2, first backup the pipeline/config files, then remove the existing cisco-dataprep-pipeline, and then install the new cisco-data-preppipeline rpm file.