Frequently Asked Questions Fulltext Indexing on Large Documentum Repositories For Content Server Versions up to 5.2.x FAQ Version 1.0 Performance Engineering Page 1 of 8
FAQ1. Q. How will my Hardware requirements change?... 3 FAQ2. Q. How many Documents can I store in a Filestore with Fulltext Indexing?... 3 FAQ3. Q. What is Parallel Searching and how does it work?... 4 FAQ4. Q. What are the Limitations of a Fulltext index Collection and its Partitions?... 4 FAQ5. Q. How can I make a large Fulltext Index more efficient?... 5 FAQ6. Q. How can I improve Fulltext Query Performance in a large Documentum Repository?... 6 FAQ7. Q. How do I Tune the Fulltext index on a large Documentum Repository?... 7 FAQ8. Q. What is the Verity Zone Search and how does it work?... 7 In Documentum Content Server 5.2.x, the Verity fulltext engine version 2.7 is used. The Documentum Systems Administration manual covers configuration and administration. This document will only cover items of interest for Documentum Repositories with more than 1 million index-able objects, or very large content size index-able objects. Repository Filestore01 Filestore01 Collection A Collection B Filestore02 Filestore02 Collection C Collection D partition a partition c partition a partition c partition b partition d partition b partition d Figure 1. The logical components in a 5.2x Fulltext Index. To make documents in a filestore available for searching, you must first create one or more collections. A collection is set of partitions (directories and files) that allow search users to use the Verity search engine to quickly find and display documents matching various search criteria. [Scope] The Challenges of Fulltext indexing on large Documentum Repositories. Performance Engineering Page 2 of 8
The main issues for large Documentum Repositories, are around the length of time it takes to process requests. The procedures for building and searching through the fulltext index, will take longer to complete. These procedures include: Batch creation of fulltext index Incremental updates to an existing fulltext index De-fragmentation of fulltext index files Searching and Security constraints A well planned design for the index creation can yield faster search response times. FAQ1. Q. How will my Hardware requirements change? A. In addition to increasing the disk space requirements, full text indexing consumes additional CPU cycles and increases the I/O to the content disks.. A more robust system will provide better performance and throughput when processing fulltext requests. CPU The Verity processes for updating the index will consume CPU cycles. The more processes that are spawned the faster the throughput will be, as well as increasing the consumption of CPU cycles. Disk Space- There is a general rule of thumb that the disk space needed for a fulltext index of a given filestore is an additional 30-40% over the content consumption. This will vary depending on the percentage of index-able words in documents, and the number and size of index-able object attributes. Disk I/O Due to the additional data being stored in the Verity collections, the disk I/O subsystem must be able to support the increased I/O rates to the content storage areas. FAQ2. Q. How many Documents can I store in a Filestore with Fulltext Indexing? A. A single filestore can contain approximately 10 million indexable documents. It is not advisable to split documents into multiple filestores when using fulltext indexing. There are performance implications for the default serial search as it reads through all collections and merges the results before returning it to the requestor. If multiple filestores are necessary, a customized parallel search scheme can be used to increase search performance. Performance Engineering Page 3 of 8
Select... From SEARCH TOPIC Enterprise Any matches here?..here?.. here?..here? Filestore01 Filestore02 Filestore03 Filestore04 Figure 2. The default serial search path through a Fulltext Index. FAQ3. Q. What is Parallel Searching and how does it work? A. A customized Parallel search specifies a single filestore as part of the search criteria. You can issue multiple search requests to the individual filestores, thus spawning multiple search requests. The documents returned are not merged so they must be programmatically merged. Select SEARCH TOPIC Enterprise in FTINDEX filestore01 Select SEARCH TOPIC Enterprise in FTINDEX filestore02 Any matches here? Any matches here? Filestore01 Filestore02 Filestore03 Figure 3. Parallel search through a Fulltext Index. Filestore04 FAQ4. Q. What are the Limitations of a Fulltext index Collection and its Partitions? A. There may be a practical limit to the number of objects in a single Verity Collection. The architectural limit is 64,000 documents per partition or a single partition file size larger than 2 GB. Within Documentum, a Collection is also divided into multiple partitions based on document format. Eventually, search performance degrades as the collection becomes fragmented with many partitions. Performance Engineering Page 4 of 8
Collection 2GB max Filestore01 Filestore02 Partition 64K docs max collectiona { partition_a partition_b partition_c partition_d } collectionb FAQ5. Q. How can I make a large Fulltext Index more efficient? A. To make your full text index more efficient, you should avoid indexing nonselective or commonly found words. The Verity Style Stop files can be configured to process a word exclusion listing when building the collection. The default stop file stops indexing of all single character words and most common prepositions, articles, adverbs, and conjunctions. So, words like of, the, only, or, and the are not indexed. However you can omit additional words that are common in your industry and are not useful in querying a document. This way your fulltext index will be smaller, and more meaningful for searches. For example, 75% of your documents contain the word Documentum. However this is not meaningful search criteria. So adding Documentum to the stop file list can reduce the build, maintenance and search costs of the fulltext index. Omit the word documentum from the index Style.stp ------------ ------------ Performance Engineering Page 5 of 8
FAQ6. Q. How can I improve Fulltext Query Performance in a large Documentum Repository? A. There are few search options that can improve performance. Efficient Security Checks Documentum differs from other "internet-like" search technologies in that it doesn't just return to you what "exists", it returns what "exists" AND "what you are allowed to see". This extra property means that your access to the result set is checked for each document that could be in it. In 5.2, the algorithm will obtain the search result from the full text environment and then do an access join against the ACL tables to sort out the documents that you do not have rights to see. Reducing the number of ACLs through System ACLs, and maintaining current statistics can help to make this join efficient. Avoid Unselective Queries In 5.2.x there are several ways to harden your user interface against unselective queries: o Modify your user interface to do an ESTIMATE_SEARCH method prior to doing the actual search. The ESTIMATE_SEARCH will return an "estimate" (not exact) on the number of rows that qualify from the query. If this is over a pre-defined limit, then the user interface could prompt the user to refine the query to make it more selective. For example, if one worked for a car company and entered the word "SUV" there might be a large number of hits, but "Fuel Efficient SUV" might be significantly more selective. It s an estimate because it has not done the security join. For Example: execute ESTIMATE_SEARCH with name = filestore01, type = dm_document, query= Fuel Efficient SUV o Use the Search Clause options in DQL. This includes IN FTINDEX and FT_OPTIMIZER conditions. The FTINDEX option allows you to run a query against an individual fulltext index. For Example: select from dm_document SEARCH TOPIC Fuel Efficient SUV IN FTINDEX filestore_02 Performance Engineering Page 6 of 8
FAQ7. Q. How do I Tune the Fulltext index on a large Documentum Repository? When using Verity s tools to manually, you should shutdown the Documentum Repository prior to tuning your fulltext index. A. The VDB (Verity Data Base) is the fundamental storage mechanism responsible for supporting dynamic access to documents in collections. A VDB consists of simple tables with rows and columns that relate to each other by row position. VDB tables are not relational, and their architecture supports quick and efficient searching over textual data. A VDB consists of segments that are packed into a single file. One of the advantages of having one packed VDB file is optimized search performance. The fewer files that need to be opened during search processing, the faster the search performance. To optimize your collection, use the following command: mkvdk -collection PATH_TO_COLLECTION_DIRECTORY -optimize tuneup For example: mkvdk -collection /dm/storage_01/verity/24001e5280003d03/dm_sysobject/universal -optimize tuneup tuneup - This optimization tunes a collection using a combination of a optimization types. It performs the following tasks: maximal merging on the partitions to create partitions that are as large as possible (maximum of 64,000 docs per partition), squeezing deleted documents, and making linear partition data. Note, as this effectively destroys and rebuilds your fulltext collection, this task should be performed outside normal production hours. FAQ8. Q. What is the Verity Zone Search and how does it work? A. Zone searching is a feature of Verity Full Text searching that allows you to pinpoint a search to a particular zone or regions in an XML document. This will allow you to be more specific on your search criteria. XML Elements and Attributes are mapped to zones by default with the XML zone filter. The XML Zone filter parser determines the zones in the document. When content files have an extension of XML Verity will automatically runs these files through the XML zone filter. As attributes are indexed, they are also assigned zones. This feature can be used to make your fulltext queries more efficient, as fulltext matches can be further reduced. The DQL language allows you to perform content and metadata searches within the fulltext undex, using the following syntax: SEARCH TOPIC searchstring <IN> attribute_name Performance Engineering Page 7 of 8
For example: select * from dm_document SEARCH TOPIC Canada <IN> country Note: XML zone searching does not span across Virtual Documents. [Reference Section] Verity, Inc. Introduction to Collections Manual Documentum Content Server Administrator s Guide Chapter 9, Fulltext Indexes Documentum Content Server DQL Reference Manual Search Clause, estimate_search Documentum Tech Support Note 8594, 16089 How to use Verity s mkvdk tool Documentum Managing XML Content Guide Performance Engineering Page 8 of 8