EMC InfoArchive Documentum Connector Version 3.0 User Guide EMC Corporation Corporate Headquarters Hopkinton, MA 01748-9103 1-508-435-1000 www.emc.com
Legal Notice Copyright 2014 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. Adobe and Adobe PDF Library are trademarks or registered trademarks of Adobe Systems Inc. in the U.S. and other countries. All other trademarks used herein are the property of their respective owners. Documentation Feedback Your opinion matters. We want to hear from you regarding our product documentation. If you have feedback about how we can make our documentation better or easier to use, please send us your feedback directly at IIGDocumentationFeedback@emc.com
Table of Contents Chapter 1 Overview... 5 Chapter 2 Generating SIPs... 7 3
Table of Contents 4
Overview Chapter 1 EMC InfoArchive Documentum Connector is a command-line data extraction and transformation utility that lets you export content to be archived directly from the EMC Documentum repository and generate Submission Information Packages (SIPs) to be ingested into InfoArchive. InfoArchive Documentum Connector extracts persistent objects from the repository by executing a DQL query statement defined in a configuration file. You can also specify what type of information about the objects (attributes, contents, object types, folders, and relations) to extract by configuring the settings in the configuration file. All extracted information is stored in the resulting PDI file (eas_pdi.xml) as a part of the SIP generated by InfoArchive Documentum Connector. InfoArchive Documentum Connector can run against any version of the EMC Documentum repository. Since the SIP is repository version-neutral, you can extract content from an earlier version of repository but archive and access it through a later version. For example, you can extract content from a Documentum 4.2 repository into SIPs and archive them into InfoArchive through a Documentum 7.0 repository so that you can still access the content even when Documentum 4.2 is decommissioned. The complete distribution package of InfoArchive Documentum Connector comprises these files: Script files for supported platforms: eas-launch-documentum-extractor.bat (Windows) and eas-launch-documentum-extractor.sh (Linux) Configuration file eas_documentum_extractor.properties PDI schema file DctmDocbase.xsd The DctmExtractor.jar file and its dependency.jar files in the lib sub-directory 5
Overview 6
Chapter 2 Generating SIPs 1. Make sure the computer from which you run InfoArchive Documentum Connector meets the following prerequisites: You can connect to the Content Server from the computer and the Documentum repository from which you want to extract content for archiving is up and running. Documentum Foundation Classes (DFC) along with the compatible JRE version is installed locally. DFC is the published and supported programming interface for accessing the functionality of the Documentum platform. Content Server and any EMC Documentum client product that uses DFC install DFC automatically. If you have not installed these products on the computer, you must install DFC as a standalone product. The EMC Documentum Foundation Classes Installation Guide contains information about how to install DFC individually. EMC recommends that you run InfoArchive Documentum Connector from the Content Server host where you can perform data extraction faster and more conveniently. 2. Open eas-launch-documentum-extractor.bat (Windows) or eas-launch-documentum -extractor.sh (Linux) in a text editor and set the following environment variables: JAVA_HOME: The path to your JRE/JDK installation BASEDIR: The path to the directory where the InfoArchive Documentum Connector files are located For example: SET BASEDIR=C:\eia\DctmExtractor SET JAVA_HOME=C:\Program Files\Java\jdk1.7.0_17 3. Edit eas_documentum_extractor.properties and configure data extraction and transformation settings. a. Configure how to extract content from the repository by specifying the following information through parameters: The required information to access the Documentum repository The DQL query statement for selecting objects to be extracted for archiving What type of information about the objects will be extracted (object properties and attributes are always extracted) Parameter Docbase Description The name of the repository to connect to. 7
Generating SIPs Parameter username/password dqlpredicate Description The credential to log in to the repository. Partial DQL query string after the FROM clause. The string you specify here is automatically appended to the preset string "select r_object_id from" to form the complete DQL query statement when you execute the utility. For example, if you set dqlpredicate as follows: dqlpredicate=dm_sysobject where folder('/phonecalls' ) The complete DQL query statement that will be constructed and executed is: select r_object_id from dm_sysobject where folder('/phonecalls') extractcontents extracttypes extractfolders extractrelations Only objects returned by the query statement will be extracted for archiving. Set this to true or false to specify whether to extract information of content files associated with the selected objects. When extracted, content file information is stored in the contents element in eas_pdi.xm as part of the generated SIP. The default value is true. Set this to true or false to specify whether to extract object type information of the selected objects. If the objects belong to multiple object types (a subtype with its supertypes; for example, dm_document and dm_sysobject), all these types will be extracted. When extracted, object type information is stored in the types element in eas_pdi.xml as part of the generated SIP. The default value is true. Set this to true or false to specify whether to extract information of the folder (and all its parent folders, if any) of the selected objects. When extracted, folder information is nested in the folders element in eas_pdi.xml as part of the generated SIP. The default value is true. Set this to true or false to specify whether to extract information of relationship (represented by dm_relation_type and dm_relation objects) associated with the selected objects. When extracted, relationship information is stored in the relations element in eas_pdi.xml as part of the generated SIP. The default value is true. b. Configure how to generate SIPs by setting the following parameters. Parameter xsdfilevalidation Description Specify DctmDocbase.xsd as the schema (.xsd) file against which the PDI file generated by the utility will be validated. If you don t set a value, there will not be any validation. 8
Generating SIPs Parameter validationmode Description Whether you want the utility to stop running when errors occur: false: Once an error occurs (for example, when the content files associated with an object being extracted is not available), the extraction process stops, and you must fix the issue and run the utility again. This can be time-consuming when there many issues during the extraction process. true: When errors occur, the utility displays error messages but do not stop the extraction process. This way, you can catch all the issues and address them before running the utility again. holding pdischema pdischemaversion application workingdir Note: Do not ingest SIPs generated in this mode. If errors occur during the extraction process, generated SIPs are invalid. The target holding into which the extracted objects will be archived. The XML schema to be applied to the PDI file. You should also specify the schema version. Default: urn:x-emc:eas:schema:documentum:1.0 Leave it blank. The pdischema parameter should contain the schema version. The business application that produced the data to be archived. The full path of the working directory in which to generate SIPs. If the directory does not exist, the utility will create it when executed. For example: workingdir=c:\\app\\eia\\dctmextractor\\data\\ maxobjectspersip priority producer Note: Use double slashes as the delimiter in the path. The maximum number of objects in a SIP. The ingestion priority of the SIP. The greater the value, the higher the priority. The order by which SIPs are ingested is determined first by ingestion priority (higher-priority SIPs are ingested first), and then by ingestion deadline date (SIPs with earlier deadlines are ingested first). The application that generates the SIP. 9
Generating SIPs Parameter entity archiveid Description The business entity that owns the data to be archived. The string value used to generate a unique archive ID for each SIP. A unique archiveid can contain up to 32 characters and consists of a value, a UUID, and the SIP creation date in the following format: archiveid_%uuid%_%yyyy%%mm%%dd% For example: MyArchive_QULFpiFb_2014116 1. Open a command prompt window and execute eas-launch-documentum-extractor.bat (Windows) or eas-launch-documentum-extractor.sh (Linux). Note: On Windows, the dfcpath system variables cannot contain spaces; otherwise, an error message appears: dfc.jar not found. 2. The data extraction and transformation process begins. You can see tracing information in the command prompt window similar to the following: >>> START... Reader instantiated. >>> DQL query : select r_object_id from dm_sysobject New PDI File (count:0) > 0: Writing 0900c86580000d9b >>> DQL query : select relation_name, description, child_id from dm_relation where parent_id = 0900c86580000d9b > 1: Writing 0900c86580000d9f >>> DQL query : select relation_name, description, child_id from dm_relation where parent_id = 0900c86580000d9f [...] > 539: Writing 4c00c86580000177 >>> DQL query : select relation_name, description, child_id from dm_relation where parent_id = 4c00c86580000177 Deleting c:\eia\dctmextractor\data\archive1 >>> END! 3. When the data extraction and transformation process is complete, you can see SIP files generated in the working directory (specified by the workingdir parameter in the configuration file). 10
Generating SIPs 11