Frequently Asked Questions. Fulltext Indexing on Large Documentum Repositories For Content Server Versions up to 5.2.x

Similar documents
TRANSFORMATION GATEWAY

Database performance becomes an important issue in the presence of

EMC Documentum Connector for Microsoft SharePoint Farm Solution

The Design and Optimization of Database

Performance Optimization for Informatica Data Services ( Hotfix 3)

Optimizing Testing Performance With Data Validation Option

7. Query Processing and Optimization

Oracle Hyperion Profitability and Cost Management

Performance Pack. Benchmarking with PlanetPress Connect and PReS Connect

What's New In Informatica Data Quality 9.0.1

EMC Documentum Search

E-Guide DATABASE DESIGN HAS EVERYTHING TO DO WITH PERFORMANCE

Performance Tuning in Informatica Developer

Teamcenter Volume Management Guide. Publication Number PLM00104 I

QDA Miner. Addendum v2.0

SQL Server 2014 Column Store Indexes. Vivek Sanil Microsoft Sr. Premier Field Engineer

ETL Transformations Performance Optimization

Increasing Performance for PowerCenter Sessions that Use Partitions

Meet-Me Conferencing

Meet-Me Conferencing

Improving IBM Red Brick Warehouse Query Performance

Inline LOBs (Large Objects)

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Informatica Data Explorer Performance Tuning

SAS Scalable Performance Data Server 4.3

Operating Systems : Overview

Practical Guide For Transformer in Production

IBM Tivoli OMEGAMON XE for Storage on z/os Version Tuning Guide SC

Crystal Reports. Overview. Contents. How to report off a Teradata Database

Giving Your Headings Meaningful Names (Desktop and Plus) p. 158 Rearranging the Order of the Output p. 160 Formatting Data p. 163 Formatting Columns

Database Administration and Tuning

COGNOS BI I) BI introduction Products Introduction Architecture Workflows

VERITAS Storage Foundation for Windows FlashSnap Option

Teiid - Scalable Information Integration. Teiid Caching Guide 7.2

Teiid - Scalable Information Integration. Teiid Caching Guide 7.6

Advanced Data Management Technologies Written Exam

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Chapter 12. File Management

Quest Central for DB2

CHAPTER. Oracle Database 11g Architecture Options

Integration Services. Creating an ETL Solution with SSIS. Module Overview. Introduction to ETL with SSIS Implementing Data Flow

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Part 1: Indexes for Big Data

DB Change Manager. User Guide. Version 17.0 Published December 2017

Replication. Some uses for replication:

INFORMATICA PERFORMANCE

EMC Documentum Forms Builder

Asset Arena InvestOne

Performance Tuning BI on SAP NetWeaver Using DB2 for i5/os and i5 Navigator

Business Insight Authoring

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Cognos Analytics Reporting User Interface

CA Performance Center

EMC Documentum Composer

Improving overall Robinhood performance for use on large-scale deployments Colin Faber

HP WebInspect Enterprise

DeltaV Continuous Historian

Chapter 2. Architecture of a Search Engine

DiskSavvy Disk Space Analyzer. DiskSavvy DISK SPACE ANALYZER. User Manual. Version Dec Flexense Ltd.

Using Oracle STATSPACK to assist with Application Performance Tuning

Database Optimization

EMC Documentum Composer

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Cannot Create Index On View 'test' Because

New Features Summary. SAP Sybase Event Stream Processor 5.1 SP02

Performance Tuning for MDM Hub for IBM DB2

Oracle Advanced Compression: Reduce Storage, Reduce Costs, Increase Performance Bill Hodak Principal Product Manager

Moving You Forward A first look at the New FileBound 6.5.2

IBM Endpoint Manager Version 9.0. Software Distribution User's Guide

How to Integrate SmartDeploy with Windows Deployment Services

VLDB. Partitioning Compression

System Architecture PARALLEL FILE SYSTEMS

Data Informatics. Seon Ho Kim, Ph.D.

Table of Contents DATA MANAGEMENT TOOLS 4. IMPORT WIZARD 6 Setting Import File Format (Step 1) 7 Setting Source File Name (Step 2) 8

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

DELL POWERVAULT NX3500 INTEGRATION WITHIN A MICROSOFT WINDOWS ENVIRONMENT

Designing Large Lists and Maximizing List Performance

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

ORACLE DATA SHEET ORACLE PARTITIONING

Architectural Design. CSCE Lecture 12-09/27/2016

Information Retrieval

Definition of RAID Levels

Table Compression in Oracle9i Release2. An Oracle White Paper May 2002

EMC Documentum Dump and Load Technical Details and Troubleshooting

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Data Compression in Blackbaud CRM Databases

Reporting Best Practices

CONNECT to Notes ProductInfo

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

ECM Extensions xcp 2.2 xcelerator Abstract

Perform scalable data exchange using InfoSphere DataStage DB2 Connector

A Gentle Introduction to Ceph

Designing dashboards for performance. Reference deck

Manual Speedy Report. Copyright 2013 Im Softly. All rights reserved.

Introduction to SQL Server 2005/2008 and Transact SQL

QlikView Technical Library

Transcription:

Frequently Asked Questions Fulltext Indexing on Large Documentum Repositories For Content Server Versions up to 5.2.x FAQ Version 1.0 Performance Engineering Page 1 of 8

FAQ1. Q. How will my Hardware requirements change?... 3 FAQ2. Q. How many Documents can I store in a Filestore with Fulltext Indexing?... 3 FAQ3. Q. What is Parallel Searching and how does it work?... 4 FAQ4. Q. What are the Limitations of a Fulltext index Collection and its Partitions?... 4 FAQ5. Q. How can I make a large Fulltext Index more efficient?... 5 FAQ6. Q. How can I improve Fulltext Query Performance in a large Documentum Repository?... 6 FAQ7. Q. How do I Tune the Fulltext index on a large Documentum Repository?... 7 FAQ8. Q. What is the Verity Zone Search and how does it work?... 7 In Documentum Content Server 5.2.x, the Verity fulltext engine version 2.7 is used. The Documentum Systems Administration manual covers configuration and administration. This document will only cover items of interest for Documentum Repositories with more than 1 million index-able objects, or very large content size index-able objects. Repository Filestore01 Filestore01 Collection A Collection B Filestore02 Filestore02 Collection C Collection D partition a partition c partition a partition c partition b partition d partition b partition d Figure 1. The logical components in a 5.2x Fulltext Index. To make documents in a filestore available for searching, you must first create one or more collections. A collection is set of partitions (directories and files) that allow search users to use the Verity search engine to quickly find and display documents matching various search criteria. [Scope] The Challenges of Fulltext indexing on large Documentum Repositories. Performance Engineering Page 2 of 8

The main issues for large Documentum Repositories, are around the length of time it takes to process requests. The procedures for building and searching through the fulltext index, will take longer to complete. These procedures include: Batch creation of fulltext index Incremental updates to an existing fulltext index De-fragmentation of fulltext index files Searching and Security constraints A well planned design for the index creation can yield faster search response times. FAQ1. Q. How will my Hardware requirements change? A. In addition to increasing the disk space requirements, full text indexing consumes additional CPU cycles and increases the I/O to the content disks.. A more robust system will provide better performance and throughput when processing fulltext requests. CPU The Verity processes for updating the index will consume CPU cycles. The more processes that are spawned the faster the throughput will be, as well as increasing the consumption of CPU cycles. Disk Space- There is a general rule of thumb that the disk space needed for a fulltext index of a given filestore is an additional 30-40% over the content consumption. This will vary depending on the percentage of index-able words in documents, and the number and size of index-able object attributes. Disk I/O Due to the additional data being stored in the Verity collections, the disk I/O subsystem must be able to support the increased I/O rates to the content storage areas. FAQ2. Q. How many Documents can I store in a Filestore with Fulltext Indexing? A. A single filestore can contain approximately 10 million indexable documents. It is not advisable to split documents into multiple filestores when using fulltext indexing. There are performance implications for the default serial search as it reads through all collections and merges the results before returning it to the requestor. If multiple filestores are necessary, a customized parallel search scheme can be used to increase search performance. Performance Engineering Page 3 of 8

Select... From SEARCH TOPIC Enterprise Any matches here?..here?.. here?..here? Filestore01 Filestore02 Filestore03 Filestore04 Figure 2. The default serial search path through a Fulltext Index. FAQ3. Q. What is Parallel Searching and how does it work? A. A customized Parallel search specifies a single filestore as part of the search criteria. You can issue multiple search requests to the individual filestores, thus spawning multiple search requests. The documents returned are not merged so they must be programmatically merged. Select SEARCH TOPIC Enterprise in FTINDEX filestore01 Select SEARCH TOPIC Enterprise in FTINDEX filestore02 Any matches here? Any matches here? Filestore01 Filestore02 Filestore03 Figure 3. Parallel search through a Fulltext Index. Filestore04 FAQ4. Q. What are the Limitations of a Fulltext index Collection and its Partitions? A. There may be a practical limit to the number of objects in a single Verity Collection. The architectural limit is 64,000 documents per partition or a single partition file size larger than 2 GB. Within Documentum, a Collection is also divided into multiple partitions based on document format. Eventually, search performance degrades as the collection becomes fragmented with many partitions. Performance Engineering Page 4 of 8

Collection 2GB max Filestore01 Filestore02 Partition 64K docs max collectiona { partition_a partition_b partition_c partition_d } collectionb FAQ5. Q. How can I make a large Fulltext Index more efficient? A. To make your full text index more efficient, you should avoid indexing nonselective or commonly found words. The Verity Style Stop files can be configured to process a word exclusion listing when building the collection. The default stop file stops indexing of all single character words and most common prepositions, articles, adverbs, and conjunctions. So, words like of, the, only, or, and the are not indexed. However you can omit additional words that are common in your industry and are not useful in querying a document. This way your fulltext index will be smaller, and more meaningful for searches. For example, 75% of your documents contain the word Documentum. However this is not meaningful search criteria. So adding Documentum to the stop file list can reduce the build, maintenance and search costs of the fulltext index. Omit the word documentum from the index Style.stp ------------ ------------ Performance Engineering Page 5 of 8

FAQ6. Q. How can I improve Fulltext Query Performance in a large Documentum Repository? A. There are few search options that can improve performance. Efficient Security Checks Documentum differs from other "internet-like" search technologies in that it doesn't just return to you what "exists", it returns what "exists" AND "what you are allowed to see". This extra property means that your access to the result set is checked for each document that could be in it. In 5.2, the algorithm will obtain the search result from the full text environment and then do an access join against the ACL tables to sort out the documents that you do not have rights to see. Reducing the number of ACLs through System ACLs, and maintaining current statistics can help to make this join efficient. Avoid Unselective Queries In 5.2.x there are several ways to harden your user interface against unselective queries: o Modify your user interface to do an ESTIMATE_SEARCH method prior to doing the actual search. The ESTIMATE_SEARCH will return an "estimate" (not exact) on the number of rows that qualify from the query. If this is over a pre-defined limit, then the user interface could prompt the user to refine the query to make it more selective. For example, if one worked for a car company and entered the word "SUV" there might be a large number of hits, but "Fuel Efficient SUV" might be significantly more selective. It s an estimate because it has not done the security join. For Example: execute ESTIMATE_SEARCH with name = filestore01, type = dm_document, query= Fuel Efficient SUV o Use the Search Clause options in DQL. This includes IN FTINDEX and FT_OPTIMIZER conditions. The FTINDEX option allows you to run a query against an individual fulltext index. For Example: select from dm_document SEARCH TOPIC Fuel Efficient SUV IN FTINDEX filestore_02 Performance Engineering Page 6 of 8

FAQ7. Q. How do I Tune the Fulltext index on a large Documentum Repository? When using Verity s tools to manually, you should shutdown the Documentum Repository prior to tuning your fulltext index. A. The VDB (Verity Data Base) is the fundamental storage mechanism responsible for supporting dynamic access to documents in collections. A VDB consists of simple tables with rows and columns that relate to each other by row position. VDB tables are not relational, and their architecture supports quick and efficient searching over textual data. A VDB consists of segments that are packed into a single file. One of the advantages of having one packed VDB file is optimized search performance. The fewer files that need to be opened during search processing, the faster the search performance. To optimize your collection, use the following command: mkvdk -collection PATH_TO_COLLECTION_DIRECTORY -optimize tuneup For example: mkvdk -collection /dm/storage_01/verity/24001e5280003d03/dm_sysobject/universal -optimize tuneup tuneup - This optimization tunes a collection using a combination of a optimization types. It performs the following tasks: maximal merging on the partitions to create partitions that are as large as possible (maximum of 64,000 docs per partition), squeezing deleted documents, and making linear partition data. Note, as this effectively destroys and rebuilds your fulltext collection, this task should be performed outside normal production hours. FAQ8. Q. What is the Verity Zone Search and how does it work? A. Zone searching is a feature of Verity Full Text searching that allows you to pinpoint a search to a particular zone or regions in an XML document. This will allow you to be more specific on your search criteria. XML Elements and Attributes are mapped to zones by default with the XML zone filter. The XML Zone filter parser determines the zones in the document. When content files have an extension of XML Verity will automatically runs these files through the XML zone filter. As attributes are indexed, they are also assigned zones. This feature can be used to make your fulltext queries more efficient, as fulltext matches can be further reduced. The DQL language allows you to perform content and metadata searches within the fulltext undex, using the following syntax: SEARCH TOPIC searchstring <IN> attribute_name Performance Engineering Page 7 of 8

For example: select * from dm_document SEARCH TOPIC Canada <IN> country Note: XML zone searching does not span across Virtual Documents. [Reference Section] Verity, Inc. Introduction to Collections Manual Documentum Content Server Administrator s Guide Chapter 9, Fulltext Indexes Documentum Content Server DQL Reference Manual Search Clause, estimate_search Documentum Tech Support Note 8594, 16089 How to use Verity s mkvdk tool Documentum Managing XML Content Guide Performance Engineering Page 8 of 8