Globalbrain Administration Guide. Version PDF Free Download

Globalbrain Administration Guide Version 5.4

Copyright 2012 by Brainware, Inc. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of Brainware, Inc. Trademarks Brainware, Inc., and its logos are trademarks of Brainware, Inc. Additional trademarks include FineReader 10 Copyright 1993-2009 ABBYY (BIT Software), Russia. Microsoft SQL Server 2005, Microsoft SQL Server 2008, Copyright by Microsoft Corporation. Oracle Outside In Technology, Copyright by Oracle. Oracle 10g, Oracle 11g, Copyright by Oracle. Product names mentioned herein are for identification purposes only, and may be trademarks and/or registered trademarks of their respective companies. Warranties The customer acknowledges that: Brainware, Inc. has given no assurance, nor made any representations or warranties of any kind with respect to the product, the results of its use, or otherwise. Brainware, Inc. makes no warranty regarding the applicable software package, its merchantability or fitness for a particular purpose; and all other warranties, express or implied, are excluded. Software License Notice Your license agreement with Brainware specifies the permitted and prohibited uses of the product. Any unauthorized duplication or use of the Brainware software in whole, or in part, in print, or in any other storage and retrieval system, is forbidden. Document Number BW-AG-GB54 Version 5.4 June 2012 Brainware, Inc. 20110 Ashbrook Place Suite 150 Ashburn, VA 20147

Administration Guide Contents Contents 1 INTRODUCTION... 9 1.1 About this Document... 9 1.2 Basic Terms and Concepts... 9 1.2.1 Application Types...9 1.2.2 Important Components...9 1.2.3 Server Instances...9 1.2.4 Data Sources... 10 1.2.5 Security Domains... 10 2 THE DESKTOP ADMINISTRATION CLIENT... 11 2.1 Introduction... 11 2.2 Login... 11 2.3 The Main Window... 12 2.3.1 Selecting a Module... 12 2.3.2 Editing Data... 13 2.3.2.1 Saving Modifications... 13 2.3.2.2 Configuring Times for Scheduled Events... 13 2.3.3 The Server Menu... 14 2.3.3.1 Login and Logout... 14 2.3.3.2 Change Password... 14 2.3.4 2.3.4.1 The Settings Menu... 14 Changing the Language... 14 3 ADMINISTRATION WITH THE WEB CLIENT... 15 3.1 Introduction... 15 3.2 Accessing the Web Client... 15 3.3 The Main Window... 15 4 SERVER INSTANCES... 16 4.1 Configuring a Server Instance... 16 4.1.1 Properties... 16 4.1.2 Adding and Removing Server Instances... 17 4.2 Starting and Stopping a Server Instance... 18 4.2.1 Using Scripts... 18 4.2.2 Using Windows Services... 18 4.3 Monitoring a Server Instance... 18 5 DATA SOURCE... 20 5.1 Configuring a Data Source... 20 5.1.1 Properties... 20 5.1.2 Adding and Removing Data Sources... 21 6 USERS & SECURITY... 22 6.1 Security Domains in Detail... 22 6.1.1 Security Domain Types... 22 6.1.2 System Maintainer Domains... 22 6.1.3 Anonymous Access... 22 6.2 Configuring a Security Domain... 22 6.2.1 Internal User Management... 23 6.2.1.1 Storing Data in a Database... 23 6.2.1.2 Storing Data in Files... 24 6.2.2 Integrating Microsoft Active Directory... 25 6.2.2.1 Adding a Security Domain... 25 6.2.2.2 General Settings... 25 6.2.2.3 LDAP Connection... 27 6.2.2.4 Users & Groups... 28 6.2.2.5 Advanced Settings... 29 6.2.3 Integrating other LDAP Servers... 29 Globalbrain Page 3 of 181

Administration Guide Contents 6.2.3.1 Adding a Security Domain... 29 6.2.3.2 General Settings... 29 6.2.3.3 LDAP Connection... 29 6.2.3.4 Users & Groups... 30 6.2.3.5 Advanced Settings... 30 6.2.4 Password Policy... 31 6.3 Managing Users & Groups... 31 6.3.1 Searching for Users... 31 6.3.2 User Properties... 32 6.3.3 Creating and Deleting Users... 33 6.3.4 Selecting a Group... 33 6.3.5 Group Properties... 34 6.3.6 Creating and Deleting Groups... 34 6.4 Managing Permissions... 35 6.4.1 Introduction... 35 6.4.2 The Permissions Dialog... 35 6.4.3 Domain-level Permissions... 36 7 DOCUMENT REPOSITORIES... 38 7.1 Document Repositories in Detail... 38 7.1.1 Documents and Document Groups... 38 7.1.2 Document Repository Types... 38 7.2 Configuring a Document Repository... 39 7.2.1 Storing Documents in a Database... 39 7.2.1.1 Creating a new Document Repository in a Database... 39 7.2.1.2 Properties... 39 7.2.1.3 Configuring a SharePoint Security Connector... 40 7.2.2 Storing Documents directly in a Search Index... 41 7.2.2.1 Creating a new Document Repository as Standalone Search Index... 41 7.2.2.2 Properties... 42 7.2.3 Embedding an External Database... 43 7.2.3.1 Properties... 43 7.2.3.2 Adding a new Document Repository... 43 7.2.4 Removing Document Repositories... 43 7.3 Browsing Document Repositories... 44 7.3.1 Filtering Documents... 44 7.3.2 Viewing a Document... 46 7.3.3 Actions on Documents... 47 7.3.3.1 Deleting Documents... 47 7.3.3.2 Assigning Documents to Groups... 47 7.3.3.3 Assigning an Attribute to Documents... 48 7.4 Managing Document Groups... 48 7.4.1 Document Group Properties... 48 7.4.1.1 Static Groups... 48 7.4.1.2 Dynamic Groups... 49 7.4.2 Creating and Removing Document Groups Static Group... 50 7.4.3 Creating and Removing Document Groups Dynamic Group... 51 7.5 Permissions on Document Repositories and Groups... 51 7.5.1 Document Repository Permissions... 51 7.5.2 Document Group Permissions... 52 7.5.3 Summary... 52 7.6 Document Rendition Repositories... 53 7.6.1 Document Rendition Repositories in Detail... 53 7.6.2 Configuring a Document Rendition Repository... 53 7.6.2.1 Storing Files in a Network Directory... 53 7.6.2.2 Storing Files in a Local Directory... 54 7.6.3 Deleting a Document Rendition Repository... 54 8 SEARCH INDEX... 55 8.1 Search Index in Detail... 55 8.1.1 What is a Search Index?... 55 Globalbrain Page 4 of 181

Administration Guide Contents 8.1.2 Facets... 55 8.1.3 Configuring the Search Engine... 56 8.1.4 Search Index Types... 56 8.1.4.1 Linked Indexes vs. Standalone Indexes... 56 8.1.4.2 Single-Server Indexes vs. Multi-Server-Indexes... 56 8.1.4.3 Search Index Pools... 56 8.1.5 Actions on Search Indexes... 57 8.1.5.1 Opening and Closing a Search Index... 57 8.1.5.2 Synchronizing and Rebuilding a Linked Search Index... 57 8.1.6 Search Index Status... 57 8.2 Configuring a Linked Search Index... 58 8.2.1 Single-Server Index... 58 8.2.1.1 Creating and Removing an Index... 58 8.2.1.2 Properties... 58 8.2.2 Multi-Server Index... 60 8.2.2.1 Creating and Removing a Multi-server Index... 60 8.2.2.2 Properties... 60 8.2.3 Search Index Pool... 62 8.2.3.1 Creating and Removing a Search Index Pool... 62 8.2.3.2 Properties... 62 8.3 General Search Index Settings... 62 8.3.1 Editing the General Search Index Configuration... 63 8.3.2 Attribute Configuration... 63 8.3.3 Scoring Configuration... 64 8.4 Search Index Administration and Monitoring... 64 8.4.1 Actions on Search Indexes... 65 8.4.2 Displaying Runtime Statistics... 65 8.5 Document Access Statistics... 67 8.5.1 Introduction... 67 8.5.2 Configuring an Access Statistics Repository... 68 8.5.2.1 Storing Statistics in a Database... 68 8.5.2.2 Storing Statistics in Files... 68 8.5.3 Customizing the Delay in the Web Client... 69 8.5.4 Monitoring Document Access Statistics... 69 9 QUERY REPOSITORIES... 70 9.1 Query Repositories in Detail... 70 9.1.1 Configuring a Query Repository... 70 9.1.1.1 Properties... 70 9.1.1.2 Creating and Removing a Query Repository... 70 9.1.2 Permissions on Query Repositories... 71 9.1.3 Viewing and Deleting Queries... 71 10 CRAWLER... 73 10.1 Crawler in Detail... 73 10.1.1 Crawlers and Agents... 73 10.1.2 Crawl Tasks... 74 10.1.2.1 What to crawl... 74 10.1.2.2 Where to store documents... 74 10.1.2.3 Which rules to apply... 74 10.1.2.4 Renditions... 74 10.1.2.5 Scheduling... 75 10.1.3 Jobs... 75 10.1.4 Types of Crawlers... 76 10.2 Configuring a Crawler... 76 10.2.1 Crawler Using a Database... 76 10.2.1.1 Creating a new Crawler... 76 10.2.1.2 Properties... 76 10.2.2 File-based Crawler... 77 10.2.2.1 Creating a new Crawler... 77 10.2.2.2 Properties... 77 10.2.3 Configuring Crawl Agents... 78 Globalbrain Page 5 of 181

Administration Guide Contents 10.2.3.1 10.2.3.2 Adding, Editing and Removing Agents... 78 Properties... 78 10.2.4 Removing Crawlers... 78 10.2.5 Configuring a Proxy... 79 10.3 Permissions on Crawlers... 79 10.4 Configuring Crawl Tasks... 80 10.4.1 Properties... 80 10.4.1.1 General Settings... 81 10.4.1.2 Document Sources... 81 10.4.1.3 Scheduling... 85 10.4.1.4 Authentication... 85 10.4.1.5 Crawling Rules... 86 10.4.1.6 Extraction... 88 10.4.1.7 Attribute Filter... 91 10.4.1.8 URL Transformation... 92 10.4.1.9 Assigning Documents to Groups... 92 10.4.1.10 Renditions... 92 10.4.1.11 Advanced Settings... 93 10.4.1.12 Configuring a Post Processor... 94 10.4.2 Viewing and Editing Crawl Tasks... 96 10.4.3 Creating and Removing a Crawl Task... 97 10.4.4 Using the IMAP Wizard... 98 10.4.5 Importing and Exporting Crawl Tasks... 99 10.4.6 Changing Properties of Multiple Crawl Tasks... 100 10.4.7 Force Recrawling of Crawl Tasks... 101 10.5 Monitoring the Crawlers... 101 10.5.1 Status Monitoring... 101 10.5.2 Starting and Stopping a Crawler... 102 10.6 Monitoring Jobs... 102 10.6.1 Viewing Jobs... 103 10.6.2 Deleting Jobs... 105 10.6.3 Enforce Recrawling of Jobs... 105 10.6.4 Exporting Crawl Jobs... 105 10.6.5 Displaying the Crawl Path... 106 10.6.6 Job Deletion Rules... 106 10.7 Viewing Statistics... 107 10.8 Validating Jobs... 108 11 CLASSIFICATION... 110 11.1 Classification in Detail... 110 11.1.1 What the classification service can do... 110 11.1.2 Available Classifiers... 110 11.1.3 View Repositories... 110 11.2 Configuring a View Repository... 111 11.2.1 View Repository... 111 11.2.1.1 Properties... 111 11.2.1.2 Creating and Deleting View Repositories... 111 11.2.2 Classification Server... 111 11.2.2.1 Properties... 111 11.2.2.2 Creating and Removing Classification Servers... 112 11.2.3 Autoclassification Service... 112 11.2.3.1 Properties... 112 11.2.3.2 Configuring and Deleting the Autoclassification Service... 112 11.3 Working with Views... 113 11.3.1 Creating a View... 113 11.3.1.1 Properties of a View... 113 11.3.1.2 Creating a View... 115 11.3.1.3 Adding and Removing Classes... 116 11.3.1.4 Adding and Removing Example Texts... 117 11.3.1.5 Learning the View... 118 11.3.2 Exporting and Importing Views... 119 11.3.2.1 Exporting a View... 119 Globalbrain Page 6 of 181

Administration Guide Contents 11.3.2.2 Importing a View... 119 11.4 Using the Autoclassification Service... 119 11.4.1 Creating and Deleting Classification Tasks... 119 11.4.2 Properties of a Classification Task... 121 11.4.3 Browsing Classification Results... 122 11.5 Permissions on View Repositories and Views... 123 11.6 Monitoring of Classification Servers... 123 12 SUBSCRIPTIONS... 125 12.1 Subscriptions in Detail... 125 12.2 Configuring Subscriptions... 125 12.2.1 Subscription Manager... 125 12.2.2 Channels... 126 12.2.2.1 Creating a Poll Channel... 126 12.2.2.2 Creating an Email Channel... 126 12.2.2.3 Deleting a Channel... 128 12.3 Permissions on the Subscription Manager... 128 12.4 Monitoring the Push Manager... 129 12.5 Viewing and Deleting Subscriptions... 129 13 SEARCH LISTS... 131 13.1 Search List Processing in Detail... 131 13.1.1 Introduction... 131 13.1.2 General Workflow... 131 13.1.3 Search Lists vs. Subscriptions... 131 13.2 Configuring a Search List Processing Service... 131 13.2.1 Creating and Deleting a Search List Processing Service... 132 13.2.2 Properties... 132 13.3 Permissions for Search List Processing... 132 13.4 Manage Search List Processing Tasks... 133 13.4.1 Properties... 133 13.4.1.1 General... 134 13.4.1.2 Query Source... 134 13.4.1.3 Document Repository... 137 13.4.1.4 Execution Trigger... 138 13.4.2 Viewing and Editing Tasks... 139 13.4.3 Creating and Removing Tasks... 139 13.4.4 Force an Immediate Execution of Tasks... 140 13.5 Viewing Search List Results... 141 13.5.1 Result List... 141 13.5.2 Result Details... 142 13.5.3 Exporting results... 142 13.5.4 Deleting Results... 142 13.6 Monitoring the Search List Processing Service... 142 14 MONITORING... 143 14.1 Dashboard... 143 14.2 Sessions... 144 14.2.1 Configuring the Session Manager... 144 14.2.1.1 Storing Sessions in a Database... 145 14.2.1.2 Holding Sessions in Memory... 145 14.2.2 Viewing Sessions... 145 14.3 System Messages and Client Error Reports... 146 14.3.1 System Messages and Client Error Reports in Detail... 146 14.3.1.1 System Messages... 146 14.3.1.2 Client Error Reports... 147 14.3.2 Configuring a System Message Repository... 147 14.3.3 Viewing System Messages... 148 14.3.4 Viewing Client Error Reports... 150 Globalbrain Page 7 of 181

Administration Guide Contents 14.4 Audit Messages... 151 14.4.1 Audit Messages in Detail... 151 14.4.2 Configuring an Audit Repository... 152 14.4.3 Activating Message Types... 152 14.4.4 Viewing Audit Messages... 153 14.4.5 Generating Reports... 154 15 LICENSE... 156 15.1 Licensing in Globalbrain... 156 15.1.1 The Product License... 156 15.1.2 The Feature Licenses... 156 15.2 License Monitoring... 157 15.3 Importing a new License... 157 15.4 License Info in the Web Application... 157 16 CONFIGURATION OF THE WEB APPLICATION... 159 16.1 Configuring Default Values... 159 16.1.1 Search Defaults... 160 16.1.2 Markup Defaults... 161 16.1.3 Search Result Export... 161 16.1.4 PDF Settings... 162 16.1.5 E-Mail Settings... 162 16.1.6 Subscription... 163 16.2 Configuring Sticky Attributes... 163 16.2.1 Creating a Sticky Attribute... 164 16.2.2 Removing a Sticky Attribute... 165 16.2.3 Ordering Sticky Attributes... 165 16.3 Configuring File System Access... 165 16.4 Activating NTLM Authentication... 166 16.4.1 Preparing NTLM Authentication... 166 16.4.2 Configuring NTLM Authenticaton... 166 16.5 Configuring Ticket Authentication... 167 17 SOAP ACCESS... 169 17.1 Introduction... 169 17.2 Configuring SOAP Access... 169 18 TOOLS... 170 18.1 Command line tools... 170 18.1.1 Introduction: Using the Tools... 170 18.1.2 Exporting and Importing a Document Repository... 170 18.1.3 Crawling a Single URL... 170 18.1.4 Validating a Crawled Directory... 170 18.1.5 Changing the Logging Behavior... 171 18.1.6 Configuration Explorer... 172 18.1.7 Script Runner... 173 18.2 CairoExtractor Designer... 174 18.3 Starting CairoExtractor Designer... 174 18.4 Selecting Preprocessing Methods... 174 18.4.1 Setting Specific OCR Tolerances... 175 18.4.1.1 Box & Comb Removal... 175 18.4.1.2 Lines Manager... 176 18.5.1 General Tab... 177 18.5.2 Recognition Tab... 177 18.5.3 Languages Tab... 178 18.6 Getting Path Information... 178 INDEX... 180 Globalbrain Page 8 of 181

Administration Guide 1 Introduction 1 Introduction 1.1 About this Document This document provides an administrative overview of your Globalbrain installation. This includes starting the server, configuring existing and additional services, and monitoring these services. This document does not cover the installation of Globalbrain please refer to the Installation Guide if you need assistance with setting up a new Globalbrain installation or extending an existing installation. 1.2 Basic Terms and Concepts 1.2.1 Application Types A Globalbrain installation typically consists of three kinds of application: - The Globalbrain Server: each installation includes at least one Globalbrain server instance. Typically, multiple server instances are distributed over several machines. - The Globalbrain Administration Client: this client allows you to configure and monitor your Globalbrain installation. It should be installed on at least one machine multiple installations on multiple machines are possible. - The Globalbrain Web Client: the web client provides an interface to Globalbrain s search capabilities, browsing of classification results and management of subscriptions that can be used by normal users without administration permissions. Additionally, it provides an administration section that allows web-based monitoring and administration of a Globalbrain installation. The web client is typically installed once within a servlet container like Apache Tomcat. Unless otherwise stated, the instructions given for the Globalbrain Administration Client also apply for the Globalbrain Web Client. Screenshots from both the Desktop Client and Web Client will be presented in parallel where appropriate. 1.2.2 Important Components - The core functionality of Globalbrain is to make large numbers of documents searchable. Documents or more accurately the text content and the attributes of documents - are stored in Document Repositories. For each Document Repository, one or more Search Indexes can be configured that index the content of the Document Repository and make it searchable. Documents can be added to Document Repositories either by 3 rd party applications via API calls or by using a Globalbrain Crawler: a Crawler is responsible for collecting documents from a source for instance, a file system, an FTP server, or a web server. 1.2.3 Server Instances Components like a Search Index or a Crawler are hosted by a designated server instance. By having multiple server instances within one Globalbrain installation, the data load can be distributed. It is possible to distribute the server instances across multiple machines as well as having multiple server instances on the same machine. Multiple server instances on the same machine allow you to use different server instances for different tasks. As an example, one instance could host the Crawler while another instance hosts Search Indexes. With multiple server instances on one machine, each server instance runs on a different port Globalbrain Page 9 of 181

Administration Guide 1 Introduction and has a different role. The role is a string that briefly describes the task of this instance (e.g. Crawler ). 1.2.4 Data Sources Globalbrain can either use a relational database (Oracle or Microsoft SQL Server) or proprietary files to store its data. For larger installations, using a database is recommended. When working with relational databases, Globalbrain needs to know how to access the database on which machine the database is running, which account can be used to log in, etc. This is configured as a data source. Globalbrain can store everything in one data source but is also able to make use of different data sources. In a system with multiple data sources, the data source to be used must be configured for each component that stores its data in a database. 1.2.5 Security Domains Users, groups and permissions are organized in security domains. Each Globalbrain installation has at least one security domain. Working with multiple security domains allows data and services to be strictly separated from each other within an installation: a user or administrator logging into one security domain only sees data and services for that domain the data from other security domains is invisible. Security domains can either use Globalbrain s internal user management or access an external user repository, e. g. a Microsoft Active Directory Server. Globalbrain Page 10 of 181

Administration Guide 2 The Desktop Administration Client 2 The Desktop Administration Client 2.1 Introduction The Globalbrain Administration Client is a desktop application that allows you to configure and monitor your Globalbrain installation. Written in Java, it can be used on any operating system for which a Java Runtime Environment is available. The Administration Client uses a remote access to connect to a Globalbrain server instance, and does not necessarily have to be installed on the same machine as the Globalbrain server. You should ensure that the port that is used by the server instance you wish to connect to is not blocked by a firewall. On Windows, the Globalbrain Administration Client can be started using GBAdminClient.exe in the main installation directory. On UNIX, use GBAdminClient.sh to start the application. You can also create a desktop shortcut for a more comfortable access. For operating system, where no start script could be created by the setup, use the following command line to start the Administration Client: java jar GBAdminClient.jar 2.2 Login Figure 2-1 The Login Dialog After you have started the Globalbrain Administration Client, you need to log in to a Globalbrain server. The login dialog consists of two sections: In the Server section you need to provide access data to a Globalbrain server instance: For Host, enter the name or the IP address of the machine on which the Globalbrain server instance is running. Enter the RMI port used by the server instance in the Port field. By default, this is 1499. If your Globalbrain installation consists of multiple server instances, you can connect to any server instance. If no server instance is available on the given host and port, a Server is not reachable message will be displayed. In the Account section, enter your login details: Select your security domain from the list of available security domains. This list will refresh automatically when you connect to a different server instance. Globalbrain Page 11 of 181

Administration Guide 2 The Desktop Administration Client Enter the login credentials for your account into the fields User and Password. If you are using Globalbrain s internal user management and did not change anything after the installation or creation of the security domain, there is a default account existing with user name system and password secret. It is highly recommended that you change this password after the first login. Click on Login to perform the login or on Cancel to abort login. If you have entered incorrect login information, message Illegal user name or password will be displayed. If your password has expired, a dialog for changing the password will be displayed (see section 2.3.3.2 Change Password). 2.3 The Main Window 2.3.1 Selecting a Module The Globalbrain Administration Client comprises different modules. Depending on the permissions of the account you are logged in with, the available modules are displayed as icons on the left side of the main window. Switching between the different modules changes the content of the main panel. The module panels of the desktop and web applications have same/ similar functionality; it will be explained in detail in further chapters. Figure 2-2 Module Selection for Desktop Application The following modules are available: - Monitoring provides access to different monitoring tools like the dashboard or for viewing system messages or audit messages. - Configuration allows you to configure your Globalbrain installation. This is where you can create or modify Document Repositories, Search Indexes, Crawlers, etc. - Users manages the users and groups in your security domain. - Document Repository provides access to the document repositories that are available in your security domain. You can browse the document repositories and manage their document groups. - Crawler provides access to your security domain s Crawlers. It allows crawl tasks to be configured, and Crawler activity monitoring. Globalbrain Page 12 of 181

Administration Guide 2 The Desktop Administration Client - Classification allows you to configure views that can be used to assign documents automatically to categories. - Search Lists allows you to define Search List Processing tasks and to manage the search results. - User Data gives you access to data that was stored by normal users. This includes bookmarked queries and subscriptions. 2.3.2 Editing Data 2.3.2.1 Saving Modifications The exact layout of the main panel and how to select editable data depends on the selected module. Where editable data is displayed, you will always find a toolbar on top of the form that contains at least two buttons: - A Save button that stores the changes. - A Reset button that restores the original values. Both buttons are disabled as long as no data has been modified. 2.3.2.2 Configuring Times for Scheduled Events In some cases you will be able to configure times for scheduled events for instance, for starting and stopping the Crawler or synchronizing Search Indexes. Figure 2-3 Configuring Times for Scheduled Events It is possible to configure multiple times by specifying: the months in which the event will be launched the days of the week on which the event will be launched the hours at which the event will be launched the minutes at which the event will be launched You can change the value for a time field by clicking on the field, then by selecting the required values from the list that pops up. Clicking a value a second time will deselect it. To delete the time configuration completely, click on the x to the right of the Minutes field. In the example shown in Figure 2-3, the event is started every hour from midnight to 11 o clock at night from Monday to Sunday and from January to December or, in short, every hour on the hour. You can limit events to certain months, days of the week or hours of the day by selecting a single month, weekday or hour or any combination of months, weekdays or hours. To launch an event several times within an hour, select multiple values for the Minutes field. For instance, to launch an event every 30 minutes, select all values for months, weekdays and hours and 0 and 30 for minutes. Globalbrain Page 13 of 181

Administration Guide 2 The Desktop Administration Client Please be aware that if you limit hours to certain values and keep using multiple values for minutes, the list of launch times will be a combination of both values. In the example above, the event would be launched at 6:00, 6:30, 18:00 and 18:30. 2.3.3 The Server Menu 2.3.3.1 Login and Logout If you are not logged in yet or have logged out, you can log in using the Server / Login menu. This displays the login dialog described in section 2.2 Login. If you are logged in, you can terminate your current session by selecting Server / Logout. 2.3.3.2 Change Password Once you are logged in to a security domain that uses Globalbrain s internal user management, you can change your own password by selecting Server / Change Password. Figure 2-4 Changing a Password Enter your current password as Old Password. Enter the new password twice, as New Password and New Password (Repeat). The new passwords must be the same. Depending on the password policy configured for your security domain, the password may need to have a minimum length or follow a specific pattern. If this policy is violated by the new password, the password will not be changed, and an error message will be displayed. If your security domain uses an external user repository like an Active Directory Server, the password cannot be changed within the administration client. 2.3.4 The Settings Menu 2.3.4.1 Changing the Language Figure 2-5 Changing the Language The Globalbrain Administration Client currently supports two languages: English and German. If you have not explicitly configured anything else, the language will be chosen based on the preferences of your operating system. You can override this default selection by choosing a different language in the Settings/Language menu. After changing to a different language, you must restart the administration client so that the change can be applied. Globalbrain Page 14 of 181

Administration Guide 3 Administration with the Web Client 3 Administration with the Web Client 3.1 Introduction The Globalbrain Web Client fully supports the configuration, monitoring and administration of a Globalbrain installation. It can be used from any browser without having to install additional software. 3.2 Accessing the Web Client The administration section of the web client can be reached by adding /admin/ to the base URL. So if the base URL of your web client installation is http://gbserver:8080/globalbrain/, the administration section can be found at http://gbserver:8080/globalbrain/admin/ Figure 3-1 Login Screen If you have not logged in so far, the URL will lead to a login screen. Please check the Globalbrain User Guide for details about logging in to the web client. Once you have logged in, the main screen is displayed. Please note that anonymous login is not accepted for the administration section. If you have accessed the user section of the web application with the guest account, you will be asked to log in with a real user account when trying to access the administration section. 3.3 The Main Window Figure 3-2 Module Selection for Web Application A menu bar at the top of the screen displays the available modules. The modules are the same as in the Desktop Administration Client (see section 2.3.1 Selecting a Module).. Depending on the permissions granted to your user account and depending on what is currently configured for your security domain, not all modules may be visible. Each module can offer multiple panels that are listed as tabs in the second row of the menu. When selecting a different module, this list is exchanged, and the first available panel is preselected. To change to another panel, click on its tab. Globalbrain Page 15 of 181

Administration Guide 4 Server Instances 4 Server Instances 4.1 Configuring a Server Instance 4.1.1 Properties Once a server instance has been installed, it might be necessary to change the configuration either because a host name or an IP address has changed, or because the data should be stored in a different directory. Figure 4-1 Selecting a Server Instance Desktop Client Figure 4-2 Selecting a Server Instance Web Client In the Globalbrain Administration Client: Select the Configuration module. Select the server instance you wish to edit from the Server Instances node of the tree. In the Web Client: Select the Configuration module. Select the System tab. Select the server instance you wish to edit from the Server Instances node of the tree. This opens up a form that allows you to edit the properties. Figure 4-3 Server Instance Properties Desktop Client Globalbrain Page 16 of 181

Administration Guide 4 Server Instances Figure 4-4 Server Instance Properties Web Client The following properties are available: - Hostname is the name of the host on which the server instance is running. - IP is the IP address of the host on which the server instance is running. - Role is the role of the server. This can be any string but should give a brief description of the task this instance will perform. A role must be unique for a host no two server instances on a host can share the same role name. Please note that role names that only differ in non-alphanumeric characters are considered to be equal ( Server 1 and Server_1 is the same role). - RMI Port is the port on which the server listens for remote requests. Again, the port must be unique for the host; no two server instances on a host can share the same port. - Data Directory is the directory in which this server instance stores local files. By default, this is the data/ directory within the server installation. It can be changed to separate the data from the server application. - RAM Disk Directory is optional and allows the path of a RAM disk to be configured. This can improve the performance of some components like search indexes. Please make sure that the IP address is always up to date this value is important for the communication between the server instances. When a server instance starts up and detects that the configured IP address does not match the detected IP address, a warning message is logged. Editing the properties of a server instance requires the System Management permission. 4.1.2 Adding and Removing Server Instances It is strongly recommended that additional server instances are only added with the setup program. The Globalbrain Administration Client can neither generate start scripts nor register Windows services. To add a server instance via the Globalbrain Administration Client: Select the Configuration module. Right click the Server Instances node. Select New Server Instance. To delete a server instance with the Globalbrain Administration Client: Select the Configuration module. Right click the server instance from the Server Instances node of the tree. Select Delete. You cannot delete a server instance while any component is configured to run on it. Adding and deleting server instances requires the System Management permission. Globalbrain Page 17 of 181

Administration Guide 4 Server Instances 4.2 Starting and Stopping a Server Instance There are two ways to start and stop the instances of the Globalbrain server: - You can use the start and stop scripts (.bat or.sh files). - If you have installed the server instances as Windows Services, you can start and stop them via the Services Administrative Tool. 4.2.1 Using Scripts The bin/ directory of your server installation contains a start and a stop script for each server instance that is configured for the current machine. - gbsrv-role.bat (Windows) or gbsrv-role.sh (UNIX) can be used to start the server instance. - gbsrv-role-shutdown.bat (Windows) or gbsrv-role-shutdown.sh (UNIX) can be used to stop a running server instance. Role is the name of the role that was assigned to the server instance. For example, the start and stop scripts for the server instance with the role Server would be gbsrv-server.bat and gbsrv-sever-shutdown.bat. The server instances are started as console applications and will print out logging messages. You can find these logging messages in log files within the log/ directory too. It may take a few seconds for the server instance to stop after the stop script has been invoked. If you try to start a server instance twice, the second instance will terminate quickly with an error message. 4.2.2 Using Windows Services Figure 4-5 Server Instance as Windows Service If you have chosen to create Windows services during the installation, it is recommended that the server instances are started and stopped via the Services control instead of using the scripts. The service names consist of Globalbrain plus the role. 4.3 Monitoring a Server Instance Figure 4-6 Server Instances on the Dashboard Use the Dashboard to get a quick overview of the availability of your server instances. The dashboard is available in the Monitoring module of the Globalbrain Administration Client and the Globalbrain Web Client. Globalbrain Page 18 of 181

Administration Guide 4 Server Instances The Server Instances section lists all configured server instances along with their status. Two status values are possible: - Available if the server instance is started and reachable - Unavailable if the server instance is either not started or not reachable due to network problems The status of a server instance is also displayed in the detail panel status bar when you select the server instance in the Globalbrain Administration client s Configuration module. Please note that this feature is not available in the Web Client. You can check the status of a server instance on the Dashboard. Figure 4-7 Status of a server instance in the Configuration module It can take up to 90 seconds before the status displays the correct value after you start a server instance. Monitoring a server instance requires the Server Monitoring permission. Globalbrain Page 19 of 181

Administration Guide 5 Data Source 5 Data Source 5.1 Configuring a Data Source 5.1.1 Properties To edit the properties of a Data Source in the Globalbrain Administration Client: Select the Configuration module. Select the Data Source from the Data Sources node in the tree. Edit the properties in the displayed form and save. Figure 5-1 Data Source Properties Desktop Client Figure 5-2 Data Source Properties Web Client The following properties are available: - ID is an internal ID for this data source. It must be unique and is only editable for new Data Sources. - Name is a human readable name that represents the Data Source. - Host is the name of the host on which the database server is running. - Port is the port on which the database server is listening. - SID is the SID of the database (Oracle only). When using Oracle XE, this field may be left empty. - User is the user name used for login. - Password is the password for the user name configured above. - Initial Pool Size configures the number of database connections that are opened initially by the connection pool. More connections are opened on demand. Changing the properties of a data source requires the System Management permission. Globalbrain Page 20 of 181

Administration Guide 5 Data Source During the installation a gbsys Data Source is created. If the access data for this database changes, you need to update the gbserver-config.xml files in the conf/ directories of your server installations and in the WEB-INF/classes/ directory of your web application first. After that, start one server instance and the Globalbrain Administration Client, and edit the Data Source properties. 5.1.2 Adding and Removing Data Sources Unlike the Globalbrain Setup, the Globalbrain Administration Client is not able to create a new database or database account. You can only configure access to a database that has already been created by the database administrator. Figure 5-3 Adding a Data Source To add a new Data Source via the Globalbrain Administration Client: Select the Configuration module. Right click the Data Sources node. Select New Data Source in the context menu and choose the database type from the sub menu. Edit the properties as described in the previous section and click on the Save button. To delete a Data Source: Select the Configuration module. Select the Data Source from the Data Sources node. Select Delete from the context menu. Creating or deleting a Data Source requires the System Management permission. You cannot delete a Data Source while it is referenced by another component. Globalbrain Page 21 of 181

Administration Guide 6 Users & Security 6 Users & Security 6.1 Security Domains in Detail 6.1.1 Security Domain Types Users, groups and permissions are organized in security domains. Each Globalbrain installation has at least one security domain. Having multiple security domains allows a strict separation of data and services; when logged into one security domain, the other security domains data is invisible. A security domain consists of two sub components: - The user repository contains the users and groups and is responsible for user authentication - The security manager stores the permissions that are granted to objects within Globalbrain. It is responsible for permission checks. While a security manager always stores its data internally within Globalbrain either in a relational database or in files within the local file system there are two kinds of user repositories: - Globalbrain can manage users and groups internally. This means users and groups need to be created and maintained within Globalbrain. - Globalbrain can access an external user repository. This means users and groups are taken from that external repository and cannot be created within Globalbrain. The standard distribution supports the integration of LDAP servers, including Microsoft Active Directory Server. Support for other user repositories can be added by implementing a simple interface. 6.1.2 System Maintainer Domains While most components like Document Repositories, Search Indexes, Crawlers, etc. are bound to a specific security domain and are not visible to members of other security domains, items like server instances, data sources and the license are used by all security domains. Only members of security domains that are marked as a system maintainer are allowed to configure these items members of other domains can see them but cannot create, delete or modify a server node or a data source, or import a new license. At least one security domain within a Globalbrain installation must be marked as a system maintainer. For installations with multiple security domains, the recommended strategy is to create one master domain that is marked as a system maintainer but which does not hold any data. 6.1.3 Anonymous Access Security domains can be configured to support anonymous access. This means a user can, for instance, perform searches via the web client without needing to login. As with a normal login, a user session is created for the anonymous user. The difference is that a special internal user account is used for these sessions. The user name is ANONYMOUS; it is not member of any group; and cannot be edited. Thus, this user only has the permissions that are explicitly granted to it. This allows precise control over what is visible without logging in. 6.2 Configuring a Security Domain All actions described in this section require the Security Management permission. Globalbrain Page 22 of 181

Administration Guide 6 Users & Security 6.2.1 Internal User Management 6.2.1.1 Storing Data in a Database If your Globalbrain installation is configured for accessing a relational database and you do not want to integrate an external user repository, the preferred way to store users and groups is using a relational database. To add a new security domain of this type: Select the Configuration module. Right click on the Security Domains node. Select New Security Domain / Internal User Management in Database. Edit the properties as described below Click the Save button to confirm the changes. Figure 6-1 Security Domain with Database Desktop Client Figure 6-2 Security Domain with Database Web Client General Globalbrain Page 23 of 181

Administration Guide 6 Users & Security - Name is the name of your security domain. This should be a short string containing, for instance, the name of your company or the department that will use this domain. - Data Source is the data source in which the data will be stored. - Anonymous Access enables or disables anonymous access to this security domain (see section 6.1.3 Anonymous Access). - System Maintainer controls whether this security domain has system maintainer permissions. Password Policy See section 6.2.4 Password Policy. 6.2.1.2 Storing Data in Files The storage of user and group data in files is only recommended if your Globalbrain installation works without a relational database. To add a new security domain of this type: Select the Configuration module. Right click on the Security Domains node. Select New Security Domain / Internal User Management in Files. Edit the properties as described below Click the Save button to confirm the changes. Figure 6-3 Security Domain with Data Storage in Files General Name is the name of your security domain. This should be a short string containing, for instance, the name of your company or the department that will use this domain. Server Instance is the server instance that hosts the files that contain the data. Anonymous Access enables or disables anonymous access to this security domain (see section 6.1.3 Anonymous Access). System Maintainer controls whether this security domain has system maintainer permissions. Password Policy In addition to the general properties, you can configure the password policy (see section 6.2.4 Password Policy). It is not possible to login to Globalbrain if the server instance that hosts the files is not running. Globalbrain Page 24 of 181

Administration Guide 6 Users & Security 6.2.2 Integrating Microsoft Active Directory The integration of a Microsoft Active Directory Server is based on the LDAP protocol. Please make sure that your AD server is configured for LDAP access. If you have not changed the default configuration, you should do this prior to proceed. 6.2.2.1 Adding a Security Domain To add a new security domain that accesses an Active Directory Server: Select the Configuration module. Right click on the Security Domains node. Select New Security Domain / Access LDAP Server / Active Directory. Edit the properties as described in the next sections Click the Save button to confirm the changes. 6.2.2.2 General Settings Figure 6-4 Active Directory Integration Desktop Client Globalbrain Page 25 of 181

Administration Guide 6 Users & Security Figure 6-5 Active Directory Integration Web Client General The General settings are similar to security domains with internal user management: - Name is the name of your security domain. This should be a short string containing, for instance, the name of your company or the department that will use this domain. - Anonymous Access enables or disables anonymous access to this security domain (see section 6.1.3 Anonymous Access). - System Maintainer controls whether this security domain has system maintainer permissions. User Repository See sections LDAP Connection, Users & Groups, and Advanced Settings about User Repository configuration. Security Manager Select the data source where the access control lists are stored from the dropdown list. The integration of an LDAP server is only available for installations with relational database access. For this type of security domain, the configuration of a password policy is unnecessary passwords are set outside of Globalbrain within the Active Directory server. The actual access to the Active Directory Server is configured in the User Repository section. It consists of three tabs: LDAP Connection, Users & Groups, and Advanced. Globalbrain Page 26 of 181

Administration Guide 6 Users & Security 6.2.2.3 LDAP Connection Figure 6-6 LDAP Connection Settings Figure 6-7 Active Directory Domain Name LDAP Connection contains the basic connection information: - Host is the name or IP address of the host on which your Active Directory Server is running. - Port is the port of the Active Directory Server this is usually 389. - Domain is the full name of your Active Directory domain as it is displayed in the tree of the AD management application (Figure 6-7). - LDAP user is the DN (distinguished name) of a domain user that is able to read existing users and groups. - Password is the password for this user. Figure 6-8 Determine DN with dsquery If you are not sure of the DN for LDAP user, use the dsquery tool on the machine that hosts your AD Server: dsquery user name username gives you a list of users that match the username pattern. If for some reason Globalbrain is not able to connect to your Active Directory server, you can use the LDAP user and the password to log in to Globalbrain and adapt the configuration settings. Globalbrain Page 27 of 181

Administration Guide 6 Users & Security 6.2.2.4 Users & Groups Figure 6-9 Users and Groups Users & Groups contains the configuration of users and groups that are available within Globalbrain: - Admin Group is the name of a group within the Active Directory server that is treated as the administrator group within Globalbrain. This is relevant for granting default permissions to secured objects. If this group does not exist, no user will have admin permissions within Globalbrain. - OU for Users is the organizational unit that contains the users that are available within Globalbrain. This can either contain the full distinguished name (DN) or simply the name of the OU, using backslash as a separator to reference a sub node (Figure 6-8). To configure more than one organization unit, separate multiple values with a semicolon. If the value is left empty users are taken from the Users node (Figure 6-11). - OU for Groups is the organizational unit that contains the groups that are available within Globalbrain. The format is the same as for OU for Users. If you leave this field empty, you should activate use built-in groups to make sure that groups are available at all. - Use built-in groups imports in addition to the OU for Groups the built-in groups of the AD server. Figure 6-10 Using Users from a specific OU Figure 6-11 Using Users from the default node Globalbrain Page 28 of 181

Administration Guide 6 Users & Security Figure 6-12 Using Built-in Groups 6.2.2.5 Advanced Settings Figure 6-13 Advanced Settings The properties that are available under the Advanced tab do not usually have to be edited. Please don t touch them if you are not absolutely sure of what you are doing. Search filter for Users is the filter that is used when searching for users. Classes for Groups are the classes that represent groups. Search Filter for Groups is the filter that is used when searching for groups. 6.2.3 Integrating other LDAP Servers 6.2.3.1 Adding a Security Domain To add a new security domain that accesses any kind of LDAP server: Select the Configuration module. Right click on the Security Domains node. Select New Security Domain / Access LDAP Server / Generic. Edit the properties as described in the next sections and click on the Save button. 6.2.3.2 General Settings The general settings for a security domain that accesses any LDAP server are the same as for security domains accessing Microsoft Active Directory server (see section 6.2.3.2). 6.2.3.3 LDAP Connection Figure 6-14 Connection Settings LDAP Connection contains the basic connection information: Globalbrain Page 29 of 181

Administration Guide 6 Users & Security - Host is the name or IP address of the host on which your LDAP server is running. - Port is the port of the LDAP server this is usually 389. - Domain is the root DN (distinguished name) which contains your user data. - LDAP user is the DN (distinguished name) of an LDAP user that is able to read existing users and groups. - Password is the password for this user. 6.2.3.4 Users & Groups Figure 6-15 Users and Groups Users & Groups contains the configuration of users and groups that are available within Globalbrain: - Admin Group is the name of a group within the LDAP server that is treated as the administrator group within Globalbrain. This is relevant for granting default permissions to secured objects. If this group does not exist, no user will have admin permissions within Globalbrain. - OU for Users is the organizational unit that contains the users that are available within Globalbrain. This must be entered as a DN. - OU for Groups is the organizational unit that contains the groups that are available within Globalbrain. This must be entered as a DN. If you leave this field empty, you should activate use built-in groups to make sure that groups are available. - use built-in groups imports the built-in groups of the LDAP server in addition to those imported by OU for Groups. 6.2.3.5 Advanced Settings Figure 6-16 Advanced Settings The properties that are available under the Advanced tab do not usually have to be edited. Please don t touch them if you are not absolutely sure what you are doing. - Search filter for Users is the filter that is used when searching for users. - Classes for Groups are the classes that represent groups. - Search Filter for Groups is the filter that is used when searching for groups. Globalbrain Page 30 of 181

Administration Guide 6 Users & Security 6.2.4 Password Policy Figure 6-17 Password Policy The Password Policy section that is displayed for some types of security domain allows you to configure the rules for user passwords. - Expires after allows you to set a maximum age for passwords. Users must change their password after the configured number of days. A value of 0 means that passwords never expire. - Max. Failed Logins allows you to configure after how many failed logins an account will be disabled. If an account gets disabled, the user cannot login until the administrator has re-enabled the account (see section 6.3 Managing Users & Groups). A value of 0 means that there is no limitation. - Minimum Length is the minimum length that passwords must have. 6.3 Managing Users & Groups Note: Users and groups can only be edited when you are logged in to a security domain that uses Globalbrain s internal user management. For security domains using an external user repository, you can only browse users and groups and view some basic data. Please note that if a user from an external user repository is being deleted, all associated information about that user in Globalbrain is not deleted immediately. The data is deleted by an asynchronous task that is executed every night. All actions described in this section require the Security Management permission. 6.3.1 Searching for Users Figure 6-18 Searching for Users Desktop Client In the Globalbrain Administration Client: Select the Users module. Select the Users tab. Enter the user name or a part of the user name in the text field at the top you may use * as wildcard. Either press return or click on the Search button. Globalbrain Page 31 of 181

Administration Guide 6 Users & Security Figure 6-19 Searching for Users Web Client In the Globalbrain Web Client: Select the User Management module Select the Users tab Enter the user name or a part of the user name in the text field at the top you may use * as wildcard. Either press return or click on the Search button. The result list will be filled with all matching users. You can now select the user you would like to edit this user is displayed in the detail panel to the right of the result list. 6.3.2 User Properties Figure 6-20 User Properties Desktop Client Globalbrain Page 32 of 181

Administration Guide 6 Users & Security Figure 6-21 User Properties Web Client After selecting a user from the result list or after clicking on the New User button, you can edit the properties of the user: - Name is the login name of the user. This value cannot be changed once a new user account has been created. - Account is active allows you to enable or disable this account. The account may have been disabled by the system due to too many failed logins (see section 6.2.4 Password Policy) - Failed Logins displays the number of failed logins for this user. You can reset this value by clicking on the button. - Password and Password (Repeat) allow you to set a new password for the user. This only needs to be done when creating a new user account. For existing users, you only need to fill these fields to assign a new password - if you leave them empty, the password will not be changed. - Groups allows you to assign the user to groups. A list of all groups is displayed check those to which the user should be assigned. 6.3.3 Creating and Deleting Users To create a new user account in the Globalbrain Administration Client: Select the Users module. Select the Users tab. Click on the New User button. You can now edit the user s properties in the main panel and save your modifications. To delete a user: Select the Users module. Select the Users tab. Search the user (see section 6.3.1 Searching for Users) and select it. Click on the Delete User button. Confirm the deletion in the message box that pops up. The steps for creating or deleting a user account via the Globalbrain Web Client are similar. The appropriate buttons can be found on the User panel of the User Management module. 6.3.4 Selecting a Group Figure 6-22 Selecting a Group (Administration Client) In the Globalbrain Administration Client: Select the Users module. Select the Groups tab. Select the group from the list of groups that is displayed. Globalbrain Page 33 of 181

Administration Guide 6 Users & Security Figure 6-1 Selecting a Group (Web Client) In the Globalbrain Web Client: Select the Users module. Select the Groups tab. Enter the group name or a part of the group s name in the text field at the top Either press return or click on the Search button. By default, Globalbrain creates two groups admin and user and grants all permissions that are available on the level of the security domain to the group admin. You may delete the group admin but make sure that you have granted those permissions to another group or user first! The group user can be deleted without any risk if you d prefer to work with other groups. 6.3.5 Group Properties Figure 6-23 Group Properties Desktop Client Figure 6-24 Group Properties Web Client Groups only have a single property the group name. The name cannot be changed once the group has been created. 6.3.6 Creating and Deleting Groups To create a new group in the Globalbrain Administration Client: Select the Users module. Select the Groups tab. Click on the New Group button. You can now edit the group properties in the main panel and save your modifications. To delete a group: Select the Users module. Select the Groups tab. Select the group. Click on the Delete Group button. Globalbrain Page 34 of 181

Administration Guide 6 Users & Security Confirm the deletion in the message box that pops up. The steps for creating or deleting a user account via the Globalbrain Web Client are similar. The appropriate buttons can be found on the Groups tab of the User Management module. 6.4 Managing Permissions 6.4.1 Introduction Various kinds of object can be secured within Globalbrain; for instance, Document Repositories, groups within a document repository, or Crawlers. For each kind of securable object, a list of available permissions is defined. These permissions can be granted to or revoked from users or groups. A user has permission: - if it is granted directly to him or to one of the groups of which he is member and - if it is not denied to him or one of the groups of which he is a member Generally, a denial overrides a grant. For example, a user that is member of the group users would be allowed to read documents from a document repository if the Read documents permission is granted to the users group. If in addition to that, the Read documents permission is denied to him he will not be able to read documents. 6.4.2 The Permissions Dialog Figure 6-25 Permissions Dialog - Desktop Client Globalbrain Page 35 of 181

Administration Guide 6 Users & Security Figure 6-26 Permissions Dialog - Web Client The dialog for editing permissions looks similar for all kinds of access controlled objects it only differs in the list of available permissions. It consists of two sections: - The upper part allows the permissions to be edited for the selected actor. - The lower part displays the full access control list. Figure 6-27 Selecting an Actor from the list Figure 6-28 Searching an Actor There are two ways to select an actor: Click on the arrow beside the text field and select one of the proposed actors. This list includes all groups and, if anonymous access is allowed, the ANONYMOUS user. Enter a user name or a part of a user name and hit return to get a list of matching users select one of these. Once an actor is selected, permissions can be edited by checking the granted or denied checkbox. Since permission can either be granted or denied, the second checkbox is disabled once the first is checked. Changing the permissions of a secured object requires the Security Management permission. 6.4.3 Domain-level Permissions System permissions are attached to the security domain. These permissions can be edited by clicking on the Edit Permissions icon in the Users tab of the Users module. The following permissions are available: Globalbrain Page 36 of 181

Administration Guide 6 Users & Security - Read Configuration is required to read entries from the configuration database. This allows a user to see which server instances, data sources, security domains, etc. are configured. This is a very basic permission and most of the other permissions make no sense if this permission is not granted. - Write Configuration is required to store entries in the configuration database. This permission is necessary to create or modify any kind of configuration for instance, Document Repositories, Crawlers, etc. - Server Monitoring is required to monitor the status of servers and components for instance, to monitor the status of a search index or a server instance. - Server Management is required to change the status of a server component for instance, to rebuild a search index. - System Management is required to change configurations that affect the whole system for instance, configuring data sources or server instances. - Session Management allows sessions to be viewed and killed. - Security Management is required to edit users or groups, and to grant permissions. Globalbrain Page 37 of 181

Administration Guide 7 Document Repositories 7 Document Repositories 7.1 Document Repositories in Detail 7.1.1 Documents and Document Groups Document Repositories store two types of data objects: documents and document groups. Globalbrain documents are a representation of files or other data items that exist somewhere outside of Globalbrain and contain textual data that is to be processed by Globalbrain. They may be office documents, PDFs, HTML pages, mail messages, or entries in a database table. A document has the following properties: - The URL is a reference to the original file or data object. This may be a path in a file system, the URL of a web page or a reference to an entry in a database. The URL of a document must be unique within a document repository no two documents can share the same URL. - Text contains the pure text of the original data object. Any kind of layout information is removed. - Attribute provides a list box which contains all original objects metadata for instance, author or description. Title is a special attribute which is used by clients when displaying the document in a list. - The last modification date is the time the original data object was last modified. Documents can be organized into groups; each document can be member of one or more groups. It is possible to assign permissions to groups which allow users to be given read permission only for a part of a Document Repository. Globalbrain supports two types of document group: Static document groups: Documents must be assigned to a static document group to be member of that group. This is usually done by the crawler (see 10.4.1.9). Dynamic document groups: Member documents of dynamic document groups are resolved dynamically based on rules. You can, for example, create a group that contains only documents with a specific attribute or only documents from a given folder. 7.1.2 Document Repository Types Globalbrain basically supports two types of document repositories: - The data can be stored in a relational database. It needs one or more search indexes (see chapter 8 Search Index) configured for a document repository to provide the capability of full text search. This type of document repository is only available if data sources have been configured. - The data is stored directly in a search index which means the document repository and the search index are the same thing. This type is also available on installations that work without a relational database. When choosing the second type you should be aware that a search index which stores its data in files does not provide the same level of robustness and flexibility as a relational database does. There are several disadvantages: - Limited scalability: the documents cannot be distributed to multiple search indexes on different machines and it is not possible to have multiple search indexes for these documents. This leads to an implicit size limit. - Limited support for concurrent access: Only one user (or the Crawler) can write at a time during which no read or search operation is possible. This affects performance. Globalbrain Page 38 of 181

Administration Guide 7 Document Repositories Mixing write operations with searching is very inefficient because parts of the index need to be rebuilt after a modification, before a search can be done. - An external database can be embedded directly as document repository. You can configure search indexes for this document repository that pull the data directly out of this external database. The advantage of this approach is that no crawling is necessary. Thus, the document repository always reflects the current state of the data in your database. See section 7.2.3 for further information. On the other hand, there is no need to synchronize the search index with database modifications. For document repositories that don t change often and have a limited size, the standalone search index can be a good choice. 7.2 Configuring a Document Repository All of the actions described in the following sections require the Write Configuration permission. 7.2.1 Storing Documents in a Database 7.2.1.1 Creating a new Document Repository in a Database To create a new document repository that stores its documents in a database: Select the Configuration module. Right click the Document Repositories node. Select New Document Repository / Database from the context menu. Edit the properties as described in the following sections. Save your modifications. 7.2.1.2 Properties Figure 7-1 Document Repository Properties in a Database Desktop Client Globalbrain Page 39 of 181

Administration Guide 7 Document Repositories Figure 7-2 Document Repository Properties in a Database Web Client A document repository that stores the documents in a database has the following properties: - Name is a human readable name for the document repository. - Data Source is the data source in which the data is stored. This cannot be changed once a new document repository has been created. - Keep deleted docs for specifies the minimum number of hours that deleted documents are kept in the database before they are really deleted. During this time they are only marked as deleted but are not accessible anymore. Other components that need to synchronize against a document repository can still query for those documents and thus can find out which documents have been deleted. - Clean-Up Time is the time at which deleted documents are really removed from the database. Since this may cause some load on the database it is recommended that this is done at a time of the day when the system is not used much by normal users. - Security configures the security model that is used for the document repository. Select Globalbrain to use Globalbrain s default security model based on document repository and document group permissions. Select SharePoint (early filtering) or SharePoint (late filtering) if this document repository stores documents that have been crawled from SharePoint (see next section for details). Select Custom security connector and enter a class name to work with a custom security connector. Custom security connectors can be implemented to read security information for a document repository s documents from an external application, for example, from a document management system. Please refer to chapter 5.1 of the Globalbrain Customization Guide for details. When configuring search indexes for the document repository, make sure that the time span for keeping deleted documents is long enough to cover the time between two synchronizations of the search index. Otherwise you risk that the documents are not deleted from the search index. For details on search index synchronization, see section 8.1.5.2 Synchronizing and Rebuilding a Linked Search Index. You can rename an existing document repository. If you have search indexes configured for a document repository that you rename, they will be empty afterwards and you will need to rebuild them. 7.2.1.3 Configuring a SharePoint Security Connector Globalbrain provides a security connector for document repositories that only gives users access to those documents that they are also allowed to see in SharePoint. Use this security Globalbrain Page 40 of 181

Administration Guide 7 Document Repositories connector for document repositories that contain documents that have been crawled from SharePoint sites (see Access to a SharePoint server in section 10.4.1.2). Globalbrain provides two options for integrating the SharePoint security: Early filtering: The search indexes will download the access control lists from SharePoint during a rebuild or synchronization. The search is then performed on a pre-filtered selection of documents based on the downloaded security information. The search will be performed without a connection to SharePoint. Late filtering: First, the search will be performed, and then the results will be filtered by checking the permission for each document on the SharePoint server. This approach is slower and causes much more load on the SharePoint server, but ensures that you are working with the latest version of the access control lists. Please note that before you can use this connector, Globalbrain needs to be integrated within SharePoint. This requires administrative steps on the SharePoint site. Please refer to the Installation Guide for instructions. Figure 7-3 Document Repository Properties for SharePoint Web Client Both the early and late filtering versions of the security connector are configured with the same properties: - SharePoint URL: Enter the URL for the main site of your SharePoint server. - User: Enter a user account that the connector can use to log in to SharePoint. When using the early filtering option, this user account must be a SharePoint Site Collection administrator in order to have the permission to download the necessary security information. When using a domain account, please add the domain as part of the user name. - Password: Enter the password for the user account. - Domain: Enter the name of your Active Directory domain. 7.2.2 Storing Documents directly in a Search Index 7.2.2.1 Creating a new Document Repository as Standalone Search Index To create a new document repository that stores its documents in a search index: Select the Configuration module. Right click the Document Repositories node. Select New Document Repository / File from the context menu. Globalbrain Page 41 of 181

Administration Guide 7 Document Repositories Edit the properties as described in the following section. Save your modifications. 7.2.2.2 Properties Figure 7-4 Document Repository Properties Standalone Search Index A document repository that stores the documents directly in a search index has the following properties: - Name is the name of the document repository. The name cannot be changed once a new document repository has been created. - Server Instance is the server instance on which this document repository is hosted. - Open when server instance starts - Use RAM disk and Engine instances are search index properties and will be explained in detail in section 8.2.3.2. - Flush after configures after how many inserted characters the data is flushed to the underlying files. The default value is 1 which means when adding or modifying documents, all changes are written to disk immediately; this is the safest way but also the slowest configuration. Users who want better performance can increase this value. If you do this, the latest modifications might be lost if the server doesn t terminate cleanly (for example, a server crashes). Globalbrain Page 42 of 181

Administration Guide 7 Document Repositories 7.2.3 Embedding an External Database 7.2.3.1 Properties Figure 7-5 Embedding an external database The properties required to embed an external database as a document repository are discussed in detail in chapter 4.1 of the Globalbrain Customization Guide. It is strongly recommended that you use the Configuration Builder tool (introduced in the Customization Guide) to create and test a configuration. Once you've done that, import the configuration into your Globalbrain server using the Globalbrain Administration Client. To import a configuration, click on the Import button and select the file that you exported created with the Configuration Builder. 7.2.3.2 Adding a new Document Repository To create a new document repository that embeds an external database: Select the Configuration module. Right click the Document Repositories node. Select New Document Repository / External Database from the context menu. Edit the properties as described in the previous section and save your modifications. Note: Document repositories that embed an external database are always read-only; this means you cannot add, modify or delete documents or create document groups. 7.2.4 Removing Document Repositories To remove a document repository: Select the Configuration module. Globalbrain Page 43 of 181

Administration Guide 7 Document Repositories Select the document repository in the Document Repositories node. Select Delete from the context menu. 7.3 Browsing Document Repositories 7.3.1 Filtering Documents Figure 7-6 Document Repository in the Globalbrain Administration Client The document repository explorer allows you to check which documents are stored in your document repositories, and to perform several actions on them. To open the document repository explorer in the Globalbrain Administration Client: Select the Document Repository module. Select the document repository you want to explore from the tab bar at the bottom. Globalbrain Page 44 of 181

Administration Guide 7 Document Repositories Figure 7-7 Document Repository in the Web Application To open the document repository explorer in the Globalbrain Web Client: Select the Document Repository module. Select the Documents tab. Select the document repository you want to explore from the tab bar at the bottom. At the top of the panel are some fields to enter the document search criteria. The available fields are: - Group limits the search to documents that are assigned to the selected group. - URL limits the search to documents that match the given URL pattern. The URL can contain * and? as wildcards. - Date limits the search to documents that have been modified within the given time span. If no minimum date but a maximum date is supplied, all documents older than the maximum date are selected. If no maximum date but a minimum date is supplied, all documents newer than the minimum date are selected. - Attribute limits the search to documents that have been assigned to the selected attribution (only available in the Desktop Client). Click on the Search button to apply the filter settings. If you don t enter any values for the filter properties, all documents are selected. The matching documents are displayed in the table at the center of the panel. The results are displayed in pages. Use the navigation buttons at the bottom to switch between pages. Globalbrain Page 45 of 181

Administration Guide 7 Document Repositories To change the sorting of the documents, click on the header of the column you wish to sort by. Click again on the column header to toggle the sort order between ascending and descending. Note: To use the document repository explorer, the Full Control or the Read Documents permission is required for this document repository. 7.3.2 Viewing a Document Figure 7-8 Viewing a Document In the Globalbrain Administration Client: Double click on a document in the result table of the document repository explorer to display the document. This will open a document viewer.the window title shows the title of the document. At the top, the URL and the modification time are displayed. The main pane of the window contains the document text. For nfile, lfile, and http URLs, a click on the link on the top of the window opens the original document with its associated application. If this is supported, the URL appears underlined when the mouse hovers over it. The column at the right hand side shows attributes of the document and groups to which it is assigned. In order to show the next or the previous documents from the list use the next/previous buttons at the bottom of the window. In the Web Client: Viewing a document in the Web application is the same, except that the user has to click on the Attribute icon at the bottom to see the document attributes window. Globalbrain Page 46 of 181

Administration Guide 7 Document Repositories Figure 7-9 Viewing a Document in the Web Application 7.3.3 Actions on Documents The buttons at the bottom of the result table invoke actions on the documents. If you have selected some documents within the result table, the action will be performed on the selected documents. If you have not selected any documents, the action will be performed on all documents that match your filter properties. Note: All of the following actions require either the Full Control or the Write Documents permission on the document repository. See section 7.5.1. 7.3.3.1 Deleting Documents Click on the Delete button to delete the selected or matching documents. Confirm the deletion in the message box that pops up. 7.3.3.2 Assigning Documents to Groups Click on the Assign to group button to assign the selected or matching documents to a static document group. Select the group from the box that pops up and click on Ok. Desktop Client Web Client Globalbrain Page 47 of 181

Administration Guide 7 Document Repositories 7.3.3.3 Assigning an Attribute to Documents You can assign an attribute to multiple documents by clicking on the Add Attribute button. Enter the name and the value of the attribute in the box that pops up and click on Ok. Desktop Client Web Client 7.4 Managing Document Groups 7.4.1 Document Group Properties You can define a group to be static or dynamic. The document group s properties depend on whether it is a static or a dynamic group. 7.4.1.1 Static Groups The only property you can configure for static groups is the name. Figure 7-10 Desktop Client Document Group Properties static group Figure 7-11 Web Client Document Group Properties static group Globalbrain Page 48 of 181

Administration Guide 7 Document Repositories 7.4.1.2 Dynamic Groups Figure 7-12 Desktop Client Document Group Properties dynamic group Figure 7-13 Web Client Document Group Properties dynamic group Name: Document Filter area Create a meaningful name for your group based, for instance, on the attribute you re using to group the documents. At least one of the Document Filter criteria has to be defined. URL: Attribute: Modified: Example: Enter a URL pattern to group the documents. It can be a web address or a path to your document storage system. You can use * as a wildcard. Select an attribute from the dropdown list as the basis for grouping your documents. Enter a value in the field below. Define a date or a range of dates as the basis for grouping the documents. Use the Calendar buttons to make the selection. You could group documents on a specific topic, edited by a specific author, and created/modified within a specific time period. URL: Attribute: Value: nfile://win-cvbiov7mmiu/test/data/docs/documentation/globalbrain+5.x/* author John Smith Modified: 2012-06-11 2012-06-19 Please note that the filter conditions are AND conditions. If you populate all of the filter criteria, all of the document group filters conditions must be met before a document is added to a group. So be careful and double check the accuracy of your entries. Globalbrain Page 49 of 181

Administration Guide 7 Document Repositories 7.4.2 Creating and Removing Document Groups Static Group Figure 7-14 Creating a Document Group Desktop Client To create a static document group in the Globalbrain Administration Client: Select the Document Repository module. Select the document repository you want to work with from the tab bar at the bottom. Click on the Document Groups button at the top of the window that pops. Click on the New Group button in the toolbar. Edit the properties and save your modification. The only property available for static groups is the name. To remove a document group: Select the document repository you want to work with from the tab bar at the bottom. Click on the Document Groups button at the top. Select the document group from the list. Click on the Delete button from the toolbar. Figure 7-15 Creating Document Group in the Web Application To create a static document group in the Globalbrain Web Client: Select the Document Repository module. Select the Groups tab. Select the document repository from the tabs at the bottom. Globalbrain Page 50 of 181

Administration Guide 7 Document Repositories Click on the New Group button. Edit the properties (enter a name) and save your modifications. To remove a document group: Select the document repository from the tabs at the bottom. Select the document group you wish to remove. Click on the Delete button. Note: Document groups are not supported if the document repository is read-only (for example, when embedding an external database) or when using an external security adapter. 7.4.3 Creating and Removing Document Groups Dynamic Group To create a dynamic document group in the Globalbrain Administration Client: Select the Document Repository module. Select the document repository you want to work with from the tab bar at the bottom. Click on the Document Groups button at the top of the window that pops. Click on the New Group button in the toolbar. Edit the properties and save your modifications. Please see section 7.4.1.2 Dynamic Groups for detailed information. To create a dynamic document group in the Globalbrain Web Client: Select the Document Repository module. Select the Groups tab. Select the document repository from the tabs at the bottom. Click on the New Group button. Edit the properties and save your modifications. Please see section 7.4.1.2 Dynamic Groups for detailed information. 7.5 Permissions on Document Repositories and Groups 7.5.1 Document Repository Permissions To edit the permissions of a document repository in the Globalbrain Administration Client: Globalbrain Page 51 of 181

Administration Guide 7 Document Repositories Select the Document Repository module. Select the tab for the document repository you want to edit. Click on the Edit Permissions button in the toolbar. To do the same in the Globalbrain Web Client: Select the Document Repository module. Select the Groups & Permissions tab. Select the document repository you want to edit from the tab bar. Slick on the Edit Permissions button. The following permissions are available: - Full Control gives the actor full control including reading all documents, writing documents, and managing document groups. - View is a basic permission that allows the actor to see that this document repository exists. Without additional permissions the actor can t read any documents. - Read Documents allows the actor to read all documents within the document repository regardless of document group permissions or permissions resolved via an external security connector (such as the SharePoint connector). - Write Documents allows the actor to create, modify and delete documents in this document repository and to assign them to document groups. 7.5.2 Document Group Permissions To edit the permissions of a document group in the Globalbrain Administration Client: Select the Document Repository module. Select the document repository you want to edit from the tab bar at the bottom. Click on the Document Groups button at the top. Select the group you want to edit. Click on the Edit Permissions button in the toolbar of the document group detail panel. To do the same in the Globalbrain Web Client: Select the Document Repository module. Select the Groups & Permissions tab. Select the document repository you want to edit from the tab bar at the bottom. Select the group you want to edit. Click on the Edit Permissions button in the toolbar of the document group detail panel. There is only one permission: - Read Documents gives the actor the permission to read documents that are assigned to this group. 7.5.3 Summary Let s summarize the way read permissions are granted: - Users that have the Full Control permission can read all documents in the document repository. - Users that have the Read Documents permission at the document repository level can also read all documents in the document repository. - Users that have the View Documents permission at the document repository level and the Read Documents permission at the document group level can only read documents that are assigned to that group. Globalbrain Page 52 of 181

Administration Guide 7 Document Repositories - Users that have the View Documents permission on a document repository with an external security connector (such as the SharePoint connector) can only read documents that have been identified as visible for them by the security connector. 7.6 Document Rendition Repositories 7.6.1 Document Rendition Repositories in Detail Document repositories store crawled documents with their text and metadata. In addition to this, Globalbrain can also store one or multiple renditions of the original documents. Technically, renditions of documents are collections of binary files that are attached to a document that is stored in a document repository. Document renditions are used - to store documents in a format that allows them to be displayed in their original layout within the web client - to keep copies of crawled documents within Globalbrain Document renditions are primarily created by the Globalbrain crawler (see section 10.1.2.4). It is also possible to store renditions via the Globalbrain API, so this may also be used by custom tools. Document renditions are stored in a document rendition repository. Each security domain can have only one document rendition repository this rendition repository stores document renditions for all document repositories for this domain. While document repositories usually store their data in a relational database, document rendition repositories store their data directly into the file system. Document rendition repositories either: Store the data in shared directory in a Windows network or Store the data in a local directory of the machine that hosts one of the Globalbrain server instances. The advantage of the shared directory is that all server instances and the Globalbrain web client can access the directory directly. The directory does not necessarily have to be on a machine on which Globalbrain is installed it can also be on a central file server. When storing the data in a local directory of a server instance document rendition can only be accessed and stored when this server instance is online. 7.6.2 Configuring a Document Rendition Repository 7.6.2.1 Storing Files in a Network Directory 7-16Document Rendition Repository in Network Directory To configure a document rendition repository that stores the data in a shared directory in a Windows network: Select the Configuration module. Right-click on the Document Repositories node and select Add Document Rendition Repository / Network Directory. There is just one property available: Globalbrain Page 53 of 181

Administration Guide 7 Document Repositories - Root Directory is the path to the shared directory in which files can be stored. Once the document rendition repository has been created, it appears as Document Rendition Repository in the Document Repositories tree node. Make sure that all accounts under which Globalbrain server instances are running have write permission to this directory. If you are running the Globalbrain server instances as Windows services, start these services under a domain account and grant write permission on the configured directory to this account. 7.6.2.2 Storing Files in a Local Directory 7-17 Document Rendition Repository in Local Directory To configure a document rendition repository that stores the data in a local directory of a server instance: Select the Configuration module. Right-click on the Document Repositories node and select Add Document Rendition Repository / Local Directory. The following properties are available: - Server Instance is the server instance that hosts the rendition repository - Root Directory is a local directory on the machine that hosts the configured server instance Once the document rendition repository has been created it appears as a Document Rendition Repository in the Document Repositories tree node. 7.6.3 Deleting a Document Rendition Repository To delete a document rendition repository: Select the Configuration module Right-click on the Document Repositories/Document Rendition Repository node and select Delete Globalbrain Page 54 of 181

Administration Guide 8 Search Index 8 Search Index 8.1 Search Index in Detail 8.1.1 What is a Search Index? A search index is responsible for making the content of a document repository searchable. The core feature is to search for text that is contained in the documents. Such a search can be combined with other criteria which help the user to narrow down the search result set to the documents of interest. Internally, a search index indexes the following information for each document: - the document text - document attributes according to the configured search index settings - the timestamp of the document - the URL - document group membership Let s take a closer look at the document attributes. Which of them is searchable and how can they be searched? This can be configured for each document repository via the General Search Index Configuration (see section 8.3.1). There are several options on how attributes can be indexed: - Attributes can be stored as fields in the Brainware search engine. They are indexed like document text and are available for full text search. When performing a search for some text, hits may either refer to the document text or to a field. - Attributes can be stored as non-full text attributes within the Brainware search engine. These attributes are kept in memory and are very fast to search but having too many of them may lead to memory problems. Engine attributes are the basis for facet statistics; this will be discussed in the next section. - Attributes can be indexed in a BTree index. This is the kind of index used in databases. It is a good alternative if you have many different values for an attribute or plan to search by ranges. Attributes that are not declared as searchable will not be indexed. 8.1.2 Facets Attributes can be declared as facet attributes. This allows statistics on these attributes to be retrieved as part of the search result so the user knows which attributes occurred most often. As an example, let s assume you have indexed some books. If a user has searched for something and got many results, he may wish to narrow the search result by selecting an author. Declare author as facet to get a list of those authors that occur most often in the search results together with information on how often they occur. The web client will now display a dropdown box for author, containing the authors which occur most frequently in the search results. The user can easily select the author he is interested in. Note: Although the search index delivers exact frequency values, clients should not usually display these values. Unless you try to scale your search such that it tries to find all possible hits which is very expensive it is very likely that you will get inconsistent values when narrowing the search: The facet statistics may say that there are 40 hits for Shakespeare in the 250 results that were requested. Now you are searching again for 250 results, but with author = Shakespeare as a condition. If there are more than 250 results for your search term in the index, you may get more than the 40 hits because some of the additional hits are included now. Globalbrain Page 55 of 181

Administration Guide 8 Search Index The Globalbrain Web Application displays facet values in a tag cloud style, using a larger font for more frequent values. 8.1.3 Configuring the Search Engine At the heart of a Globalbrain search is Brainware s powerful ASSA search engine. A search within this engine usually consists of two phases: a very quick, CPU-intensive first phase in which result candidates are identified, and a slower second phase in which the engine needs to take a closer look at the candidates. During this second phase, it needs to load some information about the candidate documents from files. To speed the second phase up, you may make use of a RAM disk. When the search index is configured to use a RAM disk, the files that are performance critical are copied onto the RAM disk. This takes more memory but leads to a significant increase in performance. A search engine instance can only process one request at a time. This means that the engine is locked while the search is in progress; search requests from other threads need to wait until the previous request has been completed. To allow multiple requests at the same time, you can configure a search index to use multiple search engines. Incoming requests are then processed by the next available engine. This lessens the likelihood that the thread has to wait while a previous search is in progress. The number of search engines you configure for a search index should not be greater than the number of CPUs or CPU cores the machine that hosts the index has. Configuring a higher value would only mean that the engine instances block each other when trying to access the CPU. It is recommended that multiple engine instances are only used in combination with a RAM disk otherwise the benefit of using multiple CPUs can be lost by making the hard disk the bottleneck. 8.1.4 Search Index Types 8.1.4.1 Linked Indexes vs. Standalone Indexes As already mentioned in the section 7.2.2 instead of storing documents in a database and creating a search index based on the database content, documents can be stored directly in a search index (see section 7.2.1.3) In Globalbrain, a linked search index is where the documents that need to be indexed are copied from a document repository. The index is linked to the document repository and cannot exist without it. With a standalone search index, documents are stored directly in the search index and it does not need an external document repository. 8.1.4.2 Single-Server Indexes vs. Multi-Server-Indexes With linked search indexes, a search index can either store the documents of a document repository on a single server instance or distribute it to multiple server instances. When distributing the documents to n server instances, each of the part indexes on the server instances indexes approximately 1/n-th of the documents. Distributing documents to multiple server instances is helpful to keep a high performance even on a very large amount of text. 8.1.4.3 Search Index Pools When working with linked search indexes, it is possible to create more than one search index for a document repository. In this case, Globalbrain creates a search index pool internally. The pool monitors its members and redirects incoming search requests to one of its members. Globalbrain Page 56 of 181

Administration Guide 8 Search Index Working with search index pools allows more parallel search requests to be performed. 8.1.5 Actions on Search Indexes 8.1.5.1 Opening and Closing a Search Index Search indexes can be open or closed. An open search index requires some memory as each index holds data in memory to be able to process queries with an acceptable performance. An index that gets closed releases all the memory and file system resources it has allocated. A closed index cannot be searched. If all the indexes of a document repository are closed, the document repository is not listed as searchable. By default, search indexes are opened as soon as the service that hosts them starts up. 8.1.5.2 Synchronizing and Rebuilding a Linked Search Index For linked search indexes, the data stored in the document repository needs to be kept consistent with the data that is indexed in the search index. This can be done by rebuilding the search index which means the index completely drops all of its data and reinitializes itself based on the current data in the document repository. A more efficient way is to synchronize periodically. This adds all documents to the index that have been added or modified in the document repository since the last rebuild or synchronization, and deletes those documents from the search index that have been deleted from the document repository in that time. A full rebuild may still be necessary after changing the configuration for attribute indexing in the search indexes of a document repository. During a rebuild or synchronization, the search index is locked and unavailable for searches. When rebuilding or synchronizing a search index pool, the search indexes are processed one after the other to ensure that as many indexes are kept available as possible for searching. 8.1.6 Search Index Status A search index has two different statuses: - The search index status describes the status of the index itself; for instance, if it is open, closed or unavailable. - The content status describes the status of the content of the index if it is ok, if a rebuild is required, etc. Let s take a closer look at the possible values for the search index status: - Unavailable means that the index cannot be accessed currently. This is usually the case if the index is hosted by a remote server instance which is currently not reachable. Closed means that the search index is closed. - Closing is used as a temporary status while the index is closing. The status changes to closed as soon as this is completed. - Open means that the search index is currently open and available for searching. - Opening is used as a temporary status while the index is opening. The index changes to open as soon as this has completed successfully. - Synchronizing means that a rebuild or synchronization is in progress on a linked search index. During that time, the index is not available for searching. These are the values for the content status: - Ok means that everything is ok. Globalbrain Page 57 of 181

Administration Guide 8 Search Index - Attributes inconsistent can occur after the attribute indexing configuration has changed; for instance, if engine attributes or fields have been added. This can only be resolved by rebuilding the index. - Incomplete means that a rebuild or synchronization operation has been aborted. You should rebuild the index. - Corrupted indicates a serious corruption of internal files. The index is no longer able to serve search requests. You need to rebuild the index. - Partially unavailable can occur for multi-server indexes if not all of its parts are available. 8.2 Configuring a Linked Search Index Note: For configuring search indexes you need the Read Configuration and Write Configuration permissions for your security domain. 8.2.1 Single-Server Index 8.2.1.1 Creating and Removing an Index To add a single-server index: Select the Configuration module. Right click the required document repository. Select New Search Index / Single-Server Index. Edit the properties as described in the previous section. To delete a single-server index: Select the Configuration module. Right click the index you wish to delete. Select Delete from the context menu. 8.2.1.2 Properties Figure 8-1 Single-Server Index Properties Desktop Client Globalbrain Page 58 of 181

Administration Guide 8 Search Index Figure 8-2 Single-Server Index Properties Web Client The following properties are available for a single-server search index: - Server Instance is the server instance the index is running on. You can change this value but need to rebuild the index afterwards. - Open when server instance start configures if the search index is opened when the server instance that hosts it starts. - Use RAM disk configures whether or not the search index should copy files that are critical for performance onto a RAM disk. This can only be done if a RAM disk directory was configured for the server instance. - Engine Instances is the number of engine instances that are used by this search index. See section 8.1.3 Configuring the Search Engine for details. - Splitting: determines whether another search index will be created when the defined size limit is reached. - Size Limit for Splitting defines the size limit for a single search index. - Synchronization is the time when the index synchronizes against the document repository. - Rebuild is the time when a full rebuild of a search index is done. Figure 8-3 Search Index Statistics Optionally, statistics about the number of documents and the size of the search index can be displayed. To enable this, click on the Statistics button in the toolbar. Please note, that this feature is not available in the Web Client. The values that are displayed are: - Total Document Size: the total size of the document text of the indexed documents. - Documents: the number of documents in the index. - Deleted Documents: the number of documents that are marked as deleted in the index. Documents are marked as deleted either by an explicit delete action or when a modified version is indexed. Documents cannot really be removed from the index. Since the deleted documents still use some memory during a search, you should consider rebuilding your search index if there are more than 5,000 documents marked as deleted, or more than 30% of the documents are marked as deleted. Globalbrain Page 59 of 181

Administration Guide 8 Search Index 8.2.2 Multi-Server Index 8.2.2.1 Creating and Removing a Multi-server Index To add a multi-server index: Select the Configuration module. Right click the required document repository. Select New Search Index / Multi-Server Index. Edit the properties as described in the following section. To delete a multi-server index: Select the Configuration module. Right click the index you wish to delete. Select Delete from the context menu. 8.2.2.2 Properties To configure the overall properties of a multi-server index, select the Search Index node within the Document Repository tree. Figure 8-4 Selecting the overall Multi-Server Index Properties A multi-server search index provides the following properties: Figure 8-5 Multi-Server Index Properties Desktop Client Figure 8-6 Multi-Server Index Properties Web Client - Server Instances are the server instances on which the parts of the index are hosted. After changing this configuration, you need to rebuild the search index. Globalbrain Page 60 of 181

Administration Guide 8 Search Index - open when server instance starts configures if the search index is opened when the server instance that hosts it starts. - Synchronization is the time at which the index synchronizes against the document repository. - Rebuild is the time at which a full rebuild of a search index is done. For each server instance you select, a search index part is created. To configure the properties for one of the part indexes, select the appropriate part index under the Search Index node. Figure 8-7 Properties for a part index of a multi-server index The following properties are available: Figure 8-8 Search Index Part Properties Desktop Client Figure 8-9 Search Index Part Properties Web Client - Server Instance displays the server instance on which this index is running. - open when server instance starts configures if the search index is opened when the server instance that hosts it starts. - Use RAM disk configures whether or not the search index should copy files that are critical for performance onto a RAM disk. This can only be done if a RAM disk directory was configured for the server instance. - Engine Instances is the number of engine instances that are used by this search index. See section 8.1.3 Configuring the Search Engine for details. - Splitting determines whether another search index will be created when the defined size limit is reached. - Size Limit for Splitting defines the size limit for a single search index. Globalbrain Page 61 of 181

Administration Guide 8 Search Index 8.2.3 Search Index Pool 8.2.3.1 Creating and Removing a Search Index Pool A Search Index Pool will be created automatically as soon as you add a second search index for a document repository. If you delete a search index and there is only one index remaining, the pool will be removed. 8.2.3.2 Properties Figure 8-10 Search Index Pool Properties To configure the overall properties of a search index pool, select the Search Index Pool node in the Document Repository tree. Figure 8-11 Search Index Pool Overall Properties The following properties are available: - Synchronization is the time when the member indexes synchronize against the document repository. - Rebuild is the time when a full rebuild of the members search indexes is done. 8.3 General Search Index Settings Figure 8-12 General Search Index Configuration Desktop Client Globalbrain Page 62 of 181

Administration Guide 8 Search Index Figure 8-13 General Search Index Configuration Web Client 8.3.1 Editing the General Search Index Configuration As soon as a search index exists for a document repository, a general search index configuration is created for the document repository. This contains configuration settings that apply to all search indexes of that document repository. To edit these settings: Select the Configuration module. Select the Search Index Configuration node of the document repository you wish to edit. Note: You need the Read Configuration and Write Configuration permission to change the general search index configuration. See section 6.4. 8.3.2 Attribute Configuration The basics of attributes in search indexes have already been discussed in section 8.1 Search Index in Detail. You can add an attribute configuration by clicking on the Add button below the Attributes table. You can delete an attribute configuration by selecting it and clicking on the Delete button. For each configuration, the following settings are available: - Name is the name of the attribute to be indexed. - Data Type is the data type of the values (string, int, float, or date). This is important when comparing values with > or <. - Index Type configures how the attribute is indexed. Values are: - Full text stores the attribute values in a field and makes them searchable like document text - BTree stores the attribute values in a BTree index. - None does no special indexing. - Format is only used for fields of data type date. As attribute values are stored as strings, values need to be parsed to interpret them as dates. You can configure a preferred format 1. If you don t configure a format or if you configure an incorrect format, Globalbrain will try parsing the value with some common date formats. - Searchable marks this attribute as searchable. This is information for clients. - Sortable means that search results can be sorted by the values of this attribute. Only attributes that use the BTree as index type can be marked as sortable. Please be aware that every sortable attribute requires additional memory in the search index. 1 see http://java.sun.com/j2se/1.5.0/docs/api/java/text/simpledateformat.html for a description of the format string Globalbrain Page 63 of 181

Administration Guide 8 Search Index - Facet marks this attribute as a facet attribute. This means its values are stored in the engine and can be used for statistics as described in section 8.1.2 Facets. - Weight can be used to give a hit in a full text attribute a higher or lower weight in the result scoring. For example, if you have configured "title" as a full text attribute and want documents that have a hit in the title listed before documents that have a hit in the main document text, you would give "title" a higher weight. The document text always has a weight of 0. Weights must be set relative to the weight of the text: a value greater than 0 gives the attribute a higher weight, a value lower than 0 gives it a lower weight. Please be aware that when working with weights, only documents that have an exact hit in the attribute with the highest weight can get a score of 100%. The enable sorting by last modification date checkbox allows sorting by the last modification date to be enabled or disabled (see Figure 8-12). Enabling this requires some additional memory in the search index. Note: When an attribute is declared as a facet attribute, the values are stored inside the engine so this could be seen as an additional index type like Full text and BTree. It is offered as an additional setting so that you can configure a combination of Full text and Facet or BTree and Facet. In these cases, the values are stored in the engine to get facet statistics and they are stored as a field or in the BTree index. 8.3.3 Scoring Configuration For queries with a search type of Phrase, the Word Order Weight controls the affect that the correct word order has on the score. It decides by how much the score is reduced if the order of the words in a document is different to the order of the words in the query. If you set a lower value the influence of the word order is reduced. With the value set to 0, there would be no difference between the "Phrase" and "Words" search types. Figure 8-14 Word Order Weight 8.4 Search Index Administration and Monitoring Note: You need the Server Monitoring permission to view the status and the statistics of a search index. In addition, you need the Server Management permission to invoke the actions. See section 6.4. Figure 8-15 Search Index Monitoring via the Dashboard Globalbrain Page 64 of 181

Administration Guide 8 Search Index You can get a quick overview of the status of your search indexes by using the Dashboard panel in the Monitoring module; the Document Repositories box displays a list of all the configured document repositories along with their search indexes. For each search index, the status (see section 8.1.6) is displayed. Figure 8-16 Search Index Details in Dashboard Clicking on a search index opens a detail panel. 8.4.1 Actions on Search Indexes Depending on the type of search index, some or all of the following action buttons are available: Button Command Description Open Open the search index. This is only enabled if the search index is available and currently closed. This button is not available at the Search Index Pool level. Close Close the search index. This is only enabled if the search index is available and currently open. Synchronize Executes the synchronization of the search index against the document repository. This is only enabled if the search index is available. Rebuild Performs a full rebuild of the search index. This is only enabled if the search index is available. Statistics Opens a separate window that displays runtime statistics of the search index. 8.4.2 Displaying Runtime Statistics Globalbrain can display runtime statistics for every search index. To bring up the statistics window: Click on a search index within the Document Repository pane on the Dashboard. Click on the Statistics button. Globalbrain Page 65 of 181

Administration Guide 8 Search Index Figure 8-17 Search Index Statistics Current tab The following statistics are available: - Number of Requests is the total number of search requests that has been processed by the selected search index in the configured period of time. - Avg. Response Time is the average time it took to complete a request. - Avg. Time in Waiting Queue is the average time a request spent in the waiting queue before an engine instance could be allocated. This value will typically be 0 on a low number of concurrent searches but will climb as the number of concurrent searches increase. If this value is often in the multiple seconds range, you should consider adding more engine instances or configuring additional search indexes. - Avg. Number of Results is the average number of search results that successful queries produced. The Current tab displays the values of all of these statistics for a selected period of time. To change the period, enter a new value or select another time unit the statistics are refreshed immediately. The History tab charts one of these statistics over a period of time. The following configuration options are available: - Field changes the statistics that is displayed in the chart - Time changes the period of time for which values are displayed. Move the mouse over a value in the chart to get a tooltip with a textual description of the value.. Figure 8-18 Search Index Statistics History tab Runtime statistics are not preserved when restarting the server intance that hosts the search index. If the server instance has been restarted 30 minutes ago no statistics older than 30 minutes are available. Globalbrain Page 66 of 181

Administration Guide 8 Search Index 8.5 Document Access Statistics 8.5.1 Introduction 8-19Visualization of Document Popularity in Search Results Globalbrain can count document accesses to identify the "popularity" of documents. When this feature is active, the popularity of documents is visualized in the search result lists that are displayed in the web client. Additionally, it is possible to sort search results by documents popularity. To enable document access statistics, a document access statistics repository has to be configured for the security domain. Having enabled this, document accesses are counted when displaying search result details if the user: - views the document for more than a configured number of seconds (5 seconds by default) - switches to another page of the document - opens the original document - exports the document Each document is counted only once within a result set: if the user visits a document, switches back to another document and comes back to the first document later the first document won't be counted again. The popularity of a document is determined by comparing the number of accesses to this document with the number of accesses to other documents. The documents with the highest number of accesses always get the highest popularity value, independent of whether the absolute number of accesses for these documents was 100 or 10000. New document accesses are propagated to the search indexes when synchronizing or rebuilding; so, additional accesses do not change the popularity icon immediately. The time span for which accesses are used to determine the popularity value is configurable: You may decide to let the popularity be based just on the accesses of the last month or on the accesses of the last year. For performance reasons, accesses to a document are internally grouped: When configuring, for example, a granularity of a month, all accesses for a document within one month are stored in one data record. This has an impact on the expiration: Let's assume you have configured an expiration time span of 6 month and today is September 15 th. Since Globalbrain then counts all accesses for March, April, May, in one data record the accesses from March 1 st (and not March 15 th ) are used when computing the popularity values. Globalbrain Page 67 of 181

Administration Guide 8 Search Index 8.5.2 Configuring an Access Statistics Repository 8.5.2.1 Storing Statistics in a Database 8-20 Database-based Document Access Statistics Repository To configure a document access statistics repository that stores the data in a database: Select the Configuration module. Right-click on Document Repositories and select New Document Access Statistics Repository / Database. Edit the properties as described below. The following properties are available: - Data Source is the data source in which the data is stored. - Granularity is the time span for which accesses are grouped. - Expiration Timespan is the time after which accesses expire and are no longer taken into account for the popularity value. You should make sure that the time unit for Expiration Timespan is equal or larger than the one for Granularity. Once a document access statistics repository has been configured it is listed as Access Statistics under the Document Repositories node in the Configuration module. You can configure only one document access statistics repository per security domain. 8.5.2.2 Storing Statistics in Files 8-21File-based Document Access Statistics Repository To configure a document access statistics repository that stores the data in a file system, Select the Configuration module. Right-click on Document Repositories and select New Document Access Statistics Repository / File. Edit the properties as described below. The following properties are available: - Server Instance is the server instance which manages the data files. - Granularity is the time span for which accesses are grouped. - Expiration Timespan is the time after which accesses expire and are no longer taken into account for the popularity value. Globalbrain Page 68 of 181

Administration Guide 8 Search Index Once a document access statistics repository has been configured, it is listed as Access Statistics under the Document Repositories node in the Configuration module. You can configure only one document access statistics repository per security domain. 8.5.3 Customizing the Delay in the Web Client Within the Configuration module of the Web Client, you can configure how long a user must keep a document open before an access is counted. Open the Administration web client. Open the Configuration module. Select the Web Client tab. Open the Defaults tab. Select the Search element. Go to the Register document access (in seconds) field. Configure the time (in seconds) after which an access is counted for document popularity. 8.5.4 Monitoring Document Access Statistics Figure 8-22 Monitoring of Document Access Statistics in Web Client You can monitor which documents have the most accesses by selecting Monitoring / Document Access in the Globalbrain Administration Client or the administration section of the web client. By default, the 50 most frequently accessed documents are listed. To define the number of the documents to be listed: Change the value in the Top documents field. Click on Search button. If you double click a document, it opens in the document viewer. To remove statistics from the database that are beyond the expiration date, click on the Purge button at the bottom. Globalbrain Page 69 of 181

Administration Guide 9 Query Repositories 9 Query Repositories 9.1 Query Repositories in Detail Globalbrain supports the bookmarking of search queries: users can store their favorite queries on the server to have them available for later searches. To enable this feature, a query repository needs to be configured. Queries can either be stored as private or public queries: private queries are only visible for the user that stored this query; public queries are visible for all users. The security settings on a query repository allow the administrator to control which users are allowed to see the query repository and who is allowed to store private or public queries. It is possible to create multiple query repositories, with each one having its own security setting. This can be used to create different query repositories for different groups of users each of these user groups would then have its own set of public queries. Note: Query Repositories are only available when working with a Globalbrain installation that uses a relational database. 9.1.1 Configuring a Query Repository 9.1.1.1 Properties Figure 9-1 Query Repository Properties A query repository stores its data in a relational database. It has the following properties: - Name is a human readable name for the query repository. - Data Source is the data source in which the data is stored. This cannot be changed once a new query repository has been created. 9.1.1.2 Creating and Removing a Query Repository Figure 9-2 Adding Query Repository Desktop Client Globalbrain Page 70 of 181

Administration Guide 9 Query Repositories Figure 9-3 Adding Query Repository Web Client To add a Query Repository: Select the Configuration module. Right click Query Repositories. Select Add Query Repository. Edit the properties described in the previous section. To remove a Query Repository: Select Configuration module. Right click on the query repository you want to delete. Click on Delete. 9.1.2 Permissions on Query Repositories To edit the permissions of a query repository in the Globalbrain Administration Client: Select the Configuration module. Select the query repository in the Query Repositories node. Click on the Edit Permissions button in the toolbar of the detail panel. To edit the permissions in the Globalbrain Web Client: Select the User Data module. Select Bookmarked Queries. Click on the Edit Permissions button in the toolbar. The following permissions are available: - Full Control gives the user all permissions, including the permission to read, modify and delete queries of other users. - Read gives the user the permission to read their own private queries and public queries. - Write Private gives the user the permission to store private queries. - Write Public gives the user the permission to create, modify and delete public queries. Please note that this permission allows the user to modify any public query and is not limited to queries that have been stored by the current user. 9.1.3 Viewing and Deleting Queries Queries can be bookmarked via the Globalbrain Web Client. Users can manage their own private queries and if they have the Write Public permission public queries from the Subscriptions section of the web application. Globalbrain Page 71 of 181

Administration Guide 9 Query Repositories In addition to this, administrators that have the Full Control permission on a query repository are able to view and delete any users queries from the Globalbrain Administration Client. To view Bookmarked Queries: Select the User Data module. Select the Bookmarked Queries tab. Select the query repository you want to view from the Repository drop down box at the top. Figure 9-4 Bookmarked Queries in Administration Client You can either view all subscriptions or filter on the following properties: - Name is the name under which the query was stored - Text is the text this query. You can use * as a wildcard here. - Owner is the owner of the query. This is either the name of the user account or #public in the case of public queries. To delete queries, select a single query or a set of queries, and then click on the button. Delete Globalbrain Page 72 of 181

Administration Guide 10 Crawler 10 Crawler 10.1 Crawler in Detail 10.1.1 Crawlers and Agents Figure 10-1 Crawlers and Agents The Crawler is the component of Globalbrain that is responsible for collecting documents or other data objects from a source file systems, FTP servers, web servers, mail servers, etc. - and extracting text and attributes from them to create a centralized place for the user to view information from multiple sources. A Crawler is linked to a security domain a Crawler that is configured for domain A is invisible to the members of domain B. Thus, each Globalbrain installation can have multiple Crawlers. It is possible to configure more than one Crawler for a single security domain. In this case, each Crawler has its own tasks that are kept separate from each other. Each Crawler has its own security settings. Different Crawlers can be configured to run at different times of the day. Each Crawler has at least one agent. The agent is the part of the Crawler that does the actual work the retrieval of the original documents and the text extraction. Configuring a Crawler with multiple agents allows this work to be distributed over multiple server instances and on multiple physical machines. This can be helpful when there are large amounts of data to crawl or when performing CPU-intensive processing like OCR ing image documents. (see section 10.2.3 Configuring Crawl Agents) Each agent can start a number of threads. This allows a parallelization of jobs within one agent. This will be advantageous even with a single CPU: for instance, while thread 1 is retrieving data from a remote server, thread 2 can extract the text and attributes of another document. All the agents of a Crawler communicate with a central component called the job manager. The job manager knows about all the URLs that need to be crawled and decides which URL is to be crawled next. Globalbrain Page 73 of 181

Administration Guide 10 Crawler 10.1.2 Crawl Tasks To instruct the Crawler to crawl a specific directory, a web page or an FTP server, you need to create a crawl task (see section 10.4 Configuring Crawl Tasks A crawl task specifies: - what to crawl - where to store the documents - which rules to apply - whether to keep the original documents/layout 10.1.2.1 What to crawl Documents can be crawled for various types of document sources. This may, for instance, be a file system, an FTP server, or a mail server. The type of document source determines which protocol needs to be used to access the documents. The following protocols are supported by Globalbrain out of the box : - lfile crawls a file system on a local machine where a crawl agent is running - nfile crawls Windows shares - ftp crawls FTP servers - http and https crawl web servers - imap and imaps crawl mail servers - nntp crawls news servers Globalbrain can be customized to handle additional protocols, for instance, to access databases or document management systems. Based on the selected protocol, a start URL for the crawl task needs to be configured. The start URL points to the location where the Crawler will start crawling. Usually, this is a directory or some other structure that contains references to further documents. By following these references, the Crawler drills down into the site and collects a larger number of documents. The format of the start URL depends on the protocol. In the client, you will usually not find an item on the form that is called start URL. Depending on the protocol you have chosen, several properties (like host and directory) will be requested, and the URL will be generated internally. 10.1.2.2 Where to store documents The crawled data is stored as documents in a Document Repository. The target Document Repository needs to be selected. 10.1.2.3 Which rules to apply A huge set of available rules allows you to control what is crawled, what is ignored and how it is stored. You may, for instance, exclude certain directories from being crawled, or change the extracted attributes before they are stored. 10.1.2.4 Renditions The crawler can optionally - produce a rendition of the crawled document that can be used by the web client to display the document in its original layout instead of just the simple text layout - store a copy of the original document within Globalbrain Reasons to store a local copy of a document could be: - Preserve documents before they are deleted - Faster download in case the original location is only available through a slow network connection (e. g. a file server that is located in another office / city / country) Globalbrain Page 74 of 181

Administration Guide 10 Crawler - Provide access to original documents that are actually not accessable for the normal users, e.g. because they are stored in a protected source (ftp server with password protection etc). 10.1.2.5 Scheduling A crawl task can be configured so that it is processed only once (or at least only on demand) or processed periodically, for instance, every night. Crawl tasks that are processed periodically are called recurring crawl tasks. 10.1.3 Jobs For each URL that is to be crawled, the Crawler creates a crawl job internally. The first job for a crawl task is created for the start URL. Crawling the start URL will usually lead to additional jobs that need to be crawled. Crawling these additional jobs may again lead to further jobs that need to be crawled. Each job has a layer: the job for the start URL is layer 0. Jobs that have been generated while crawling the start URL are layer 1, jobs that have been generated while crawling layer 1 jobs are layer 2 and so on. The number of layers can be limited within the configuration of the crawl task when crawling sites which are organized non-hierarchically, e. g. like web servers. Figure 10-2 Job Lifecycle Each job has one of the following status values: - Open: The job needs to be crawled. This means either it has never been crawled or it needs to be crawled because the task is to be processed again. If the job has never been crawled, the last crawl date of the job will not be set. - In progress: The job is currently being processed by one of the agents. - Crawled: The job has been crawled successfully. - Failed: The job has been crawled but crawling failed due to an error. In this case, the return code gives more detailed information about what went wrong. The typical basic lifecycle would be: a job changes from open to in progress and from in progress to crawled or failed. But what happens if a crawl task is resubmitted to the job manager to be crawled again? Crawled jobs change to status open if the update rules specified for the crawl task identifies them as updatable jobs. Failed jobs change to status open if the maximum number of crawling attempts configured for the crawl task is not exceeded. In progress and open jobs are untouched. A job that has not been crawled since the crawl task was last submitted remains in the queue of open jobs. Globalbrain Page 75 of 181

Administration Guide 10 Crawler 10.1.4 Types of Crawlers There are two preconfigured types of Crawler: - When using a database, both the crawl tasks and the jobs managed by the job manager are stored in the database. (see section 10.2.1 Crawler Using a Database) - When working without an external relational database (Oracle, SQL Server), the crawler can store crawl tasks and jobs in a local Derby database. (see section 10.2.2 File-based Crawler) 10.2 Configuring a Crawler Note: All of the actions described below require the Read Configuration and Write Configuration permissions on the security domain. See section 6.4. 10.2.1 Crawler Using a Database 10.2.1.1 Creating a new Crawler To create a new Crawler that stores its crawl tasks in a database: Select the Configuration module. Right click the Crawlers node. Select New Crawler / Database-based Crawler. Edit the properties as described below (Figure 10-3 Crawler Properties (with database)). 10.2.1.2 Properties Figure 10-3 Crawler Properties (with database) A Crawler using a database stores the crawl tasks and the jobs in a relational database. It has the following properties: - Name is the name of the Crawler. The name cannot be changed once a new Crawler has been created. - Server Instance is the server instance on which the main component of the Crawler is running. If this server instance is not available for some reason, the entire Crawler will be unavailable. - Data Source is the Data Source in which the Crawler stores its data. This value cannot be changed. - Start Time is the time at which the Crawler starts automatically. - Stop Time is the time at which the Crawler stops automatically. - Suspend Tasks after connection errors configures the number of connection errors after which a task is suspended for a period of time. During this time, none of the task s Globalbrain Page 76 of 181

Administration Guide 10 Crawler jobs are processed. This stops the crawler from continually trying to crawl documents from sources which are temporarily unavailable. The default value is 10. The task is only suspended if this number of errors happens in a sequence - if a job succeeds the counter is reset. - Suspend Tasks for minutes configures for how long the task is suspended. This value is the base time and it is used the first time the task is suspended. If after that time the source is still not reachable, the time grows in steps up to five times the original value. So if the value is 5 (default), no jobs for this task will be crawled for 5 minutes. If after 5 minutes the source is still not reachable, the task is suspended for 10 minutes. If after these 10 minutes the source is still unreachable, it is suspended for 15 minutes and so on, up to 25 minutes. This time duration will be maintained until a successful connection. - Execution of Job Deletion Rules configures the time at which the job deletion rules are executed. See section 10.6.6 for more details. Note: The configuration of start and stop times is optional. If you don t configure them, you must start and stop the Crawler manually. If you want the Crawler to start and stop automatically, you need to configure both start and stop times. 10.2.2 File-based Crawler 10.2.2.1 Creating a new Crawler To create a new file-based crawler: Select the Configuration module. Select the Crawlers node. Open the context menu and select New Crawler / File-based Crawler. Edit the properties as described below. 10.2.2.2 Properties Figure 10-4 Crawler Properties (file-based) The file-based crawler has the following properties: - Name is the name of the Crawler. The name cannot be changed once a new Crawler has been created. - Server Instance is the server instance on which the main component of the Crawler runs. If this server instance is not available, the entire Crawler is unavailable. This value cannot be changed. - Start Time is the time at which the Crawler starts automatically. - Stop Time is the time at which the Crawler stops automatically. - Execution of Job Deletion Rules configures the time at which the job deletion rules are executed. See section 10.6.6 for more details. Globalbrain Page 77 of 181

Administration Guide 10 Crawler 10.2.3 Configuring Crawl Agents 10.2.3.1 Adding, Editing and Removing Agents When configuring a new Crawler, one agent is preconfigured automatically. To add further agents: Select the Configuration module. Right click the required Crawler in the Crawlers node. Select Add Agent from the context menu. Edit the properties as described in the following section (Figure 10-5 Agent Properties). To edit an existing agent: Select the Configuration module. Select the Crawlers section. Select the required agent in the Crawler node. Edit the properties as described in the following section (Figure 10-5 Agent Properties). To delete an agent: Select the Configuration module. Select the Crawlers section. Right click the required agent in the Crawler node. Select Delete from the context menu. Note: You can add and remove agents while the Crawler is running. New agents will start participating in the crawling process within two minutes. 10.2.3.2 Properties Figure 10-5 Agent Properties Crawler agents have the following properties: - Server Instance is the server instance on which the agent is started. - Threads is the number of threads that are started by this agent. - Protocols are the protocols that can be crawled by this agent. Note: If there is only one crawler agent and the server instance for the Crawler is changed, it is assumed that the Crawler will run completely on one server instance. The server instance for the agent is changed automatically to reflect this. To run the crawler agent on a different server instance you have to change this manually in the crawl agent properties. 10.2.4 Removing Crawlers To delete a Crawler: Select the Configuration module. Right click the required Crawler in the tree. Globalbrain Page 78 of 181

Administration Guide 10 Crawler Select Delete from the context menu. Note: When deleting a Crawler, all crawl tasks assigned to this Crawler will also be deleted. Documents that have been crawled by this Crawler are not affected. 10.2.5 Configuring a Proxy 10-1 Proxy Configuration If you need a proxy server to crawl web servers via HTTP, you can configure this by selecting Proxy Settings in the Configuration module. The following properties are available: - Use Proxy activates or deactivates the use of the proxy server. None of the other properties is editable until this checkbox is activated. - Host is the server name or IP address of the proxy server. - Port is the port of the proxy server. - Exceptions is a list of server names for which the proxy will not be used. If the configured proxy server requires authentication, please also enter a user name and a password. 10.3 Permissions on Crawlers To edit the permissions of a Crawler in the Globalbrain Administration Client: Select the Configuration module. Select the Crawler in the Crawlers node. Click on the Edit Permissions button in the toolbar of the detail panel. To edit the permissions of a Crawler in the Globalbrain Web Client: Select the Crawler module in the administration section Select the Crawl Tasks tab for the appropriate crawler Click on the Edit Permissions button in the toolbar at the bottom of the table. The following permissions are available: - Read Crawl Tasks allows the actor to read this crawler s crawl tasks. - Write Crawl Tasks allows the actor to create, modify and delete crawl tasks. - Crawler Monitoring allows the actor to monitor the status of the Crawler and its agents, and to view and modify jobs. Please refer to section 6.4.2 for instructions on how to assign permissions in the permissions dialog. Globalbrain Page 79 of 181

Administration Guide 10 Crawler 10.4 Configuring Crawl Tasks 10.4.1 Properties Not all of the properties that are available for crawl tasks make sense for all of the protocols. Depending on the type of the document source you have chosen, the Globalbrain Administration Client will only display those properties that make sense in this context. Only the generic form which needs to be used for custom types of document sources displays all properties. Figure 10-6 Crawl Task Properties The crawl task properties editor tool bar provides the following action buttons: Button Command Description Save Reset Validating Jobs IMAP Wizard Click this button to apply the changes you have made. Click this button to undo your changes, and to restore default settings. See section 10.8 Validating Jobs Opens the IMAP Wizard to configure the connection to a mail server. See section 10.4.4 for detailed information on the different settings. The content of the tabs is described in the next section. Globalbrain Page 80 of 181

Administration Guide 10 Crawler 10.4.1.1 General Settings Figure 10-7 General Crawl Task Settings - Name is the human readable name of the crawl task. It is used when the crawl task is displayed in a list. - URL is the start URL of the crawl task. It is only visible as a field of its own for http and custom protocols. For other protocols, it is requested indirectly as described below. - Crawl into sets the Document Repository in which the documents are stored. The target Document Repository can be changed for existing crawl tasks; this means that all existing jobs are deleted and a new job for the start URL is created. This is the same as if the task were deleted and created again for the new Document Repository. - Max. Layer sets a limit on the maximum layer for jobs. This is only editable for http and custom document sources. For all other types, there is no limit. - Enabled allows the crawl task to be enabled or disabled. If the task is disabled, no jobs for this task will be crawled. As soon as the task is enabled again, the crawling task will continue. 10.4.1.2 Document Sources The part of the form that asks for the connection details (host, port, path, etc.) differs depending on the type of document source. Access to a Web Server (http) Figure 10-8 Web Server Configuration To configure access to a web server (http), enter the following information: - URL is the address of the web site to be crawled. Access to a Network file (nfile) Figure 10-9 Network File Configuration To configure access to a Windows share (nfile), enter the following information: Globalbrain Page 81 of 181

Administration Guide 10 Crawler - Path is the UNC path to the folder or file that is shared. Enter it the same way as you would enter it in Windows explorer (e. g. \\ourfileserver\documents\ ). Access to Local File System (lfile) Figure 10-10 Local File Configuration To configure access to a directory in a local file system (lfile), enter the following information: - Host is the name or IP address of the host that contains the directory to be crawled. - Directory is the directory to be crawled. Note: lfile crawl tasks can only be crawled if a crawl agent is available on the configured host. Only this agent can process this crawl task s jobs. (see Figure 10-5 Agent Properties) Access to an FTP Server (ftp) Figure 10-11 FTP Server Configuration To configure access to an FTP server (ftp), enter the following information: - FTP Server is the name or IP address of the FTP server. - Port is the port on which the FTP server is listening. FTP servers usually listen on port 21. - Directory is the directory on the FTP server to be crawled. Access to a Mail Server (imap) Figure 10-12 Mail Server Configuration Globalbrain Page 82 of 181

Administration Guide 10 Crawler To configure access to a mail server (imap), enter the following information: - Mail Server is the name or IP address of the mail server - Port is the port of the mail server. IMAP servers usually use port 143. - Folder is the folder to be crawled. To crawl the main folder of the account owner (specified in the Authentication section), use INBOX. - Use SSL allows you to access your mail server via SSL. Note, that the port will change as soon as you select this option. If you enable this option, the jobs of this crawl task will use the imaps rather than the imap protocol. Access to a SharePoint server (sp) Figure 10-13 SharePoint Server configuration To configure access to a SharePoint server (sp), enter the following information: - Server is the name of the SharePoint server. Select http or https from the drop down list. - Site is the target site within SharePoint. Leave this empty to crawl the main site. Please note that multi part site titles need to be provided without spaces according to the site s relative URL: Globalbrain Page 83 of 181

Administration Guide 10 Crawler Figure 10-14 Checking proper site name on SharePoint Path is a path within the site. This can either be a complete document library or just a folder within the document library. When you leave this empty, the full site is crawled. The Versions drop down list provides the options Most recent version only and All versions. For Username, enter the user account to be used to crawl SharePoint. This account must have read permission on all documents that are to be crawled. Please note that when working with a domain account, the domain must be part of the user name. Enter the appropriate password. Leave the default selection for Type. Access to a Newsgroup Server (nttp) Figure 10-15 Newsgroup Configuration To configure access to a newsgroup server (nttp), enter the following information: - Server is the name or IP address of the Usenet news server. - Port is the port of the server; this is usually port 119. - Newsgroup is the name of the newsgroup to be crawled. Globalbrain Page 84 of 181

Administration Guide 10 Crawler 10.4.1.3 Scheduling Figure 10-16 Crawl Task Scheduling In the Scheduling section you can configure if the task is processed only once, or periodically. - Process immediately submits the crawl task instantly to the job manager to be crawled / recrawled. This is independent of any scheduled crawling times. Non-recurring crawl tasks will be recrawled when this checkbox is selected. - Recurring marks the crawl task as recurring. When selecting this checkbox, you also need to configure a time at which the task is recrawled. - Time configures the execution time for recurring crawl tasks. 10.4.1.4 Authentication Figure 10-17 Crawl Task Authentication The Authentication section allows a user name and password to be specified if one is required to access the document source. In the majority of the cases, Type should be kept as Default. When crawling a web site that requires a form-based login, select HTML Form as the type. Figure 10-18 Form-based authentication HTML Form requires additional information. You will have to look at the HTML source code of the page that contains the login form to collect this information. URL is the address of the page that contains the login form. This is the page you would open in your browser to enter a user name and a password. Form name is the name of the form element. This value only has to be set if the page contains several forms with password fields. User Parameter is the name of the input field that contains the user name. Password Parameter is the name of the input field that contains the password. Globalbrain Page 85 of 181

Administration Guide 10 Crawler Success Condition is an optional OGNL expression that is used to evaluate if the login was successful. If no value is set, the crawler always assumes that the login succeeded. Within the expression, you can use three variables: redirecturl contains the URL of the page the request was redirected to; it is null if no Location header was sent by the server. text is the HTML source of the page that was loaded when the form was submitted. You may want to look for a string that indicates whether or not you are on the expected page. cookies is a map containing the cookies that were sent back by the server when submitting the form (if any). You may want to check this for the presence of a session id. There is a third option available for the authentication type: BeanShell Script which assigns a BeanShell script that performs the authentication. This option is not required for any of the standard document sources Globalbrain supports but may be useful for custom protocols. Please refer to the Globalbrain Customization Guide for details about this. Note: The Authentication section is not available for crawl tasks of type lfile and nfile. For http, the authentication information is optional and only needs to be entered when crawling a password-protected site. 10.4.1.5 Crawling Rules Figure 10-19 Rules The properties described in this section configure rules that specify what is crawled, what is ignored, and what will be updated if the task is crawled again. Some of these rules for instance, those that specify which URLs will be updated - make use of filters that accept or reject URLs. - All URLs accepts all URLs. - Start URL and Folders accepts the start URL of the crawl task and URLs that are recognized as folders. - Start URL accepts only the start URL of the crawl task. Globalbrain Page 86 of 181

Administration Guide 10 Crawler If Start URL and Folders is selected in Only follow links from or All but Start URL and Folders is selected in Store text for, the Folder URLs section will be displayed to define what kind of URLs are to be recognized as folders. Figure 10-20 Folder URLs The simplest rule is to treat a URL that ends with a slash ( / ) as a folder; this will apply for most document sources like file systems, FTP servers and mail servers. The advanced option is to treat as folders any URLs that contain or do not contain a given string, or that match or do not match a given regular expression. This is relevant for crawling web sites. For many sites a pattern is required to describe what a folder (a page that contains only links to other pages) is. Storage and Update provides the following properties: - Store text for allows a URL filter to be set that determines for which URLs a document s text is stored and for which it is not. The following filters are available: All URLs, All but start URL and folders, and All but Start URL. Since the folders that occur in a file system or on an FTP server do not produce text as such, this is actually only relevant when crawling web sites and thus only displayed for http and custom crawl tasks. - Update allows URLs to be configured which are recrawled if the crawl task is processed again. You may want to limit that to folders if you are sure that the actual documents don t change. The Crawling section provides the following properties: - Only follow Links from configures a URL filter that specifies which URLs can act as a source for new links. This is only offered for http and allows you, for instance, to define that you accept links from the start page and folder pages, but not from pages containing articles. - Accepted Links filters links by their location in relation to the start URL of the crawl task. This is only displayed for http. - Allowed Extensions is a list of file extensions. Only resources that have one of these extensions will be crawled. You can use * or? as wild cards. Check accept URLs without file extensions to also crawl URLs that do not have a file extension. Please note: The extension configuration file has been corrected for OpenOffice file extensions. The erroneous file extension odc has been replaced with ods. If you open an existing crawl task configured to crawl Office documents which were created in a previous Globalbrain version, on the properties sheet, the Office button will appear unclicked. However, this does not affect the crawler s accuracy. Click the Office button, and you will also find the ods file extension in the list below. Globalbrain Page 87 of 181

Administration Guide 10 Crawler - Allowed Paths is a list of strings that URLs must contain if they are to be crawled. Multiple values can be separated by blanks. Values starting with ~ are treated as regular expressions. Values starting with "/" are treated as paths: For paths, parent directories are also accepted for crawling (e. g. "/a/b" would accept "/a/" or "/a/b/c.txt but not "/a/c"). - Forbidden Paths is a list of strings that must not occur in URLs if they are to be crawled. The format is the same as for allowed paths. Note: In the simplest case you will be using folder names for allowed paths and forbidden paths but you may also use fragments of folder names or file names. For instance, you may wish to exclude printable versions of articles on a website by adding print to the forbidden paths. 10.4.1.6 Extraction Globalbrain uses various extractors to extract the text and attributes from the original documents. Some of these extractors provide configuration options that can be set via the Extraction tab. Since the administration client cannot know which document types and thus which extractors will be used for the current crawl task, it is up to you to decide which settings are relevant for this task and which are not. When you extract plain text files, it is assumed that the files use the platform s default character set. This differs between the various operating systems. If you know that your files use a different character set for instance, UTF-8 you can configure this as Charset. Figure 10-21 PDF Extractor properties For PDF documents, three extraction modes are available: Text only only extracts text that is available as text content within the PDFs. This is the default option. OCR on Images converts the pages of the PDF document into images and performs an OCR on these images. Use this if your PDFs contain scanned images. Mixed tries to extract text content first and performs an OCR on images only if no or almost no text was extracted. Use this if your document source contains both, PDFs with text content and PDFs with scanned images. Globalbrain Page 88 of 181

Administration Guide 10 Crawler Note: The OCR on Images option also works for PDFs that contain text content but since this is a very time consuming operation, you should avoid using this unnecessarily. For the Mixed option, you can define a maximum number of characters as a threshold for the usage of OCR OCR will only be used if the text that was extracted is smaller than the configured value. This setting is not relevant and disabled for the other modes. Figure 10-22 Mixed extraction mode Note: OCR on Images is only available on agents that run on a Windows operating system. Figure 10-23 HTML Extractor Properties When you extract an HTML document, the following properties are available: - Extract JavaScript Links configures the extraction of JavaScript links. Globalbrain can t really interpret JavaScript sections with an HTML page. It can look for patterns that typically contain a link. The following options are available: - Ignore JavaScript Links lets the extractor completely ignore JavaScript links - Only when sure lets the extractor add links from JavaScript sections only if it is sure that they really are links. - Aggressive lets the extractor add everything that could be a link. This may lead to many illegal URLs during crawling. Globalbrain Page 89 of 181

Administration Guide 10 Crawler - Title Tag or Class allows you to specify what is interpreted as the document s title. This can be a tag name (title, h1, h2, ) or the name of a CSS class. If nothing is set, the default rules - which give the heading tags a higher priority over the title tag - are applied. - Content Classes can be set if the complete text content is included in a div or table use the name or id of this div or table as the value. - Skip provides various options for ignoring parts of the text. The problem with HTML is that the full text extract also contains the text of navigation elements, page headers and footers, links to new or related articles, etc. Having all this in the text extract often reduces the percentage of useful content to less than 30%. The options are: Forms skips everything that is included within an HTML form. Sections containing mostly links skips texts from sections (table cells, DIVs, etc.) that contain mostly links. This is helpful to get rid of navigation sections. Small tables skips text from tables that do not contain much text. Such tables often contain navigation elements. Small sections following a large section skips small tables or DIVs that follow much larger tables or DIVs. This is based on the assumption that the larger sections contain the relevant text and the smaller sections contain non-relevant data like page footers, etc. There is no configuration that works for all web sites. You need to test what works best for the site you want to crawl. To do this, enter an example URL in the URL field and see what happens when options are enabled or disabled. Figure 10-24 Images Extractor Properties When crawling images, the following properties are available: - Rendition Image Width - determines the width (in pixels) of the images that are generated for each rendition image in the web client - Rendition Color Depth - the number of colors used in the rendition images - Cairo Configuration: Allows a custom settings file for OCR processing to be uploaded. Please refer to section 18.2 CairoExtractor Designer to learn how to configure OCR processing using the CairoExtractor Designer tool. Globalbrain Page 90 of 181

Administration Guide 10 Crawler 10.4.1.7 Attribute Filter Figure 10-25 Attribute Filter Configuration The Attribute Filter configuration in the Transformation tab gives you some influence over how attributes that have been extracted from the original document are added to Globalbrain. If you do not configure anything, all the attributes the crawler extracts are taken as they are. You can: rename attributes before they are added to Globalbrain only accept a predefined list of attributes and reject others add static attributes with a fixed value The same attribute may have different names in different types of document. To standardize this, you can configure a rule to rename attributes: Click on the Add button to create a new rule. Enter the name of the attribute you want to rename in Original Name. Enter the new name of the attribute in Rename to. Some document types may provide a lot of attributes that are of no interest to Globalbrain. This may, for instance, be the case for HTML documents from web servers where you may want to accept only certain attributes. To do this: Check the reject all unmapped attributes checkbox. For each attribute you wish to keep: Click on the Add button to create a new rule. Enter the name of the attribute you wish to keep in Original Name and in Rename to. You may wish to assign a certain attribute (e. g. source = BBC ) to all documents that have been added while crawling this crawl task. To do this: Click on the Add button to create a new rule. Enter the name of the attribute you wish to add in Original Name Enter the value you wish to assign in quotes in Rename to. For example, BBC. Of course you can combine all three kinds of settings in one rule set. Globalbrain Page 91 of 181

Administration Guide 10 Crawler 10.4.1.8 URL Transformation Figure 10-26 URL Transformation Rule Some web sites encode a session id in their URL. This is a dangerous issue for the crawler because each time it visits the site, it gets a new session id and thus it will treat all links as new and crawl them. As a result you may have the same document stored multiple times in your document repository using URLs that only differ in their session id. To address this problem, you can configure a URL Transformation rule in the Transformation tab: - Pattern is a regular expression to match the part of the URL that needs to be modified. - Replacement is a replacement string for the pattern above. Leave this empty if you want to remove the characters that match the pattern completely. Use $1, $2, etc. to insert sections from the match that have been placed in brackets in the regular expression. As an example, let s assume we have URLs like: http://www.somewebsite.com/page.php?sessionid=1a2b3c We wish to get rid of the session id. A pattern that describes the session id is sessionid=\w+. Since this can be eliminated completely, we do not need a replacement value. 10.4.1.9 Assigning Documents to Groups Figure 10-27 Assigning Documents to Groups The crawler can assign all documents that have been crawled by this crawl task to one or more document groups within the document repository. To do this, select the Assignment tab and check the groups you want the documents to be assigned to in the Document Groups section. 10.4.1.10 Renditions The Renditions tab is only enabled if you have configured a Document Rendition Repository (see section 7.6) Globalbrain Page 92 of 181

Administration Guide 10 Crawler Figure 10-28 Renditions Settings Original Layout The crawler can optionally create renditions of the crawled documents that allow the web client to display documents in their original layout. This is configured on a per-file-type level. To enable the creation of renditions in their original layout, either select single document types from the Original Layout list or use the buttons at the top to select groups of documents types (Office, PDF, Images). Original Documents The crawler can optionally copy crawled documents to the document rendition repository. This is also configured on a per-file-type level. To enable the storage of copies, either select single document types from the Original Documents list or use the buttons at the top to select groups of documents types (Office, PDF, Images). 10.4.1.11 Advanced Settings Figure 10-29 Advanced Settings There are some settings on the Advanced tab that do not fit into the other categories we ve described so far: - Priority is the priority of this crawl task compared with other crawl tasks. Jobs from crawl tasks with a higher priority are crawled completely before jobs from crawl tasks with a lower priority are crawled. - Max. Retries is the maximum number of attempts to crawl a job before it is classed as failed. If this limit is reached for a job, it will not be crawled if the crawl task is processed again.. - Ignore Server Policies instructs the crawler to ignore crawling rules that have been supplied by the server that is being crawled. Globalbrain Page 93 of 181

Administration Guide 10 Crawler An example is the robots.txt file that can be found on many web servers. Brainware recommends that this checkbox is left unchecked. 10.4.1.12 Configuring a Post Processor Figure 10-30 Post Processor Configuration A post processor for crawl tasks can modify the data that has been extracted from a document before it is stored to a document repository. This can be used to do custom modifications like: - matching additional attributes for a document from a database or a document management system (DMS) - adding, modifying or removing attributes based on custom rules - modifying the extracted text You can either implement a custom post processor as a Java class or as a BeanShell script, or use one of the standard post processors that are delivered with Globalbrain. Please consult the Globalbrain Customization Guide for details about implementing custom post processors. Figure 10-31 Search & Replace in Text To perform a search and replace in the text that was extracted from a document, select Search & Replace in Text as the type and enter the following information: Search for is a regular expression that captures the part of the text that is to be replaced. Replace by is the replacement for the text captured by the regular expression. If your regular expression contained groups, you can refer to them with $1, $2 etc. Leave this value empty if you only want to remove the matching text. This option can be useful if all of the crawled documents end with some standard text (e. g. a disclaimer in mails) that should be removed before the document is stored. Figure 10-32 Extract Attribute from URL When crawling a file system or a mail server, sometimes the folder names contain valuable information that is worth storing with the document. This can be done by selecting Extract Attribute from URL as the post processor type: - Attribute is the name of the attribute under which the extracted value will be stored - Path Element is the part of the path that will be used as the attribute value. It is given as an index which points to the path element that is to be used. Positive values count from the beginning of the path, negative values from the end. Globalbrain Page 94 of 181

Administration Guide 10 Crawler An example; let's assume you are crawling a directory that contains documents that are related to your customers. The folder structure is set up so that the last folder is the name of the customer. So you will have a URL like this as path: nfile://fileserver/invoices/customer1/xyz.doc To add a "customer" attribute that uses the last folder as value you would configure "customer" as Attribute and -2 as the Path Element. Figure 10-33 Using a Custom Java Class To use a custom post processor that is available as a Java class, select Custom Class as the type: - Class Name is the fully qualified name of the Java class. - Additionally, parameters can be configured that are assigned to the post processor after it has been instantiated. Click on the Add button to add an attribute and enter parameter name and value. The list of available parameters depends on the implementation. You need to request this information from the developer of this class. Figure 10-34 Using a BeanShell Script To add a custom processor as a BeanShell script, select BeanShell Script as the type and enter the source code of the script to the text area. Globalbrain Page 95 of 181

Administration Guide 10 Crawler 10.4.2 Viewing and Editing Crawl Tasks Figure 10-35 Crawl Tasks in Administration Client In the Globalbrain Administration Client, the Crawler module provides one tab for each configured crawler at the bottom of the panel. Tabs at the top allow you to switch between Tasks and Jobs. To view a list of configured crawl tasks, select the Tasks tab. Each task is displayed with: - the type of document source (nfile, http, ftp, ), represented by an icon. - the name of the crawl task. - the name of the document repository that documents of this task are stored in. - the time the crawl task was last processed. - whether this is a recurring crawl task or not recurring tasks are displayed with an icon. - whether this task is currently enabled or not enabled tasks are displayed with an icon. You can filter the list of displayed crawl tasks by: - the document repository the documents are stored in - a time span within which the tasks were last processed To view the configuration of single crawl task, double click on this task. The task details are displayed in another window. Globalbrain Page 96 of 181

Administration Guide 10 Crawler Figure 10-36 Crawl Tasks in the Web Client To view the crawl tasks of a crawler in the Globalbrain web client: Select the Crawler module. Select the Crawl Tasks tab. Select the desired crawler from the tabs at the bottom. The displayed information and the filtering options are exactly the same as in the Globalbrain Administration Client. 10.4.3 Creating and Removing a Crawl Task Figure 10-37 Adding a Crawl Task To create a new crawl task or to delete a crawl task, first switch to the crawl task list of the required crawler as described in the previous section. To create a new crawl task: Click on the Add a new crawl task button. Select the type of crawl task to create. Edit the properties as described in section 10.4.1 Properties Note: When you add an IMAP crawl task, a wizard opens that asks for the connection data and the folders that are to be crawled. To skip the wizard and proceed to the crawl task editor, click the Cancel button. You can get back to the wizard later by clicking the wand button in the toolbar on top of the crawl task editor (see section10.4.4). To delete a crawl task: Select the relevant crawl task. Click on the Delete button. Globalbrain Page 97 of 181

Administration Guide 10 Crawler When you delete a crawl task, a confirmation dialog will pop up. If you also want to delete all the documents that have been crawled with this crawl task, enable the checkbox. Note: Only documents that have been crawled with Globalbrain 5.3 or higher can be deleted. Figure 10-38 Confirmation Dialog of Crawl Task Deletion Note: You can also delete all crawl tasks that match the current filter settings by clicking on the Delete button when no crawl tasks are selected. The steps are the same in the Globalbrain Administration Client and the Globalbrain Web Client. Note: You need the Read Crawl Tasks permission to view crawl tasks. To create, edit or delete a crawl task you additionally need the Write Crawl Tasks permission on the crawler. 10.4.4 Using the IMAP Wizard Figure 10-39 IMAP Wizard in the Web Client The IMAP wizard helps you to configure crawl tasks that crawl mails via the IMAP protocol. It simplifies the selection of the folders that are to be crawled. The IMAP wizard pops up if you create a new crawl task of type Mail (imap). You can bypass it by clicking on Cancel or by closing the dialog this takes you directly to the standard crawl task dialog. You can open the IMAP wizard for an existing crawl task by clicking on the IMAP Wizard button in the toolbar of the crawl task dialog. Globalbrain Page 98 of 181

Administration Guide 10 Crawler For a new crawl task, enter the following information: - Mail Server is the name or IP address of the mail server - Port is the port of the mail server. IMAP servers usually use port 143. - Use SSL allows you to access your mail server via SSL. Note: the port will change as soon as you select this option. If you enable this option, the jobs of this crawl task will use the imaps rather than the imap protocol. - User Name is the user name that is used to log in - Password is the password that is used to login Click on Connect to create a connection to the mail server. After the connection is established, the tree in the lower half of the dialog is filled with a list of available top level folders. Expand a folder node to list the available subfolders. Activate the checkboxes for the folders you want to crawl. Once you have finished, click on Ok. This takes you to the main crawl task dialog where you can edit additional settings for the crawl task. Note: When using the IMAP Wizard from within the Globalbrain Administration Client, the client tries to establish a connection the mail server. If firewall settings prevent this, the wizard cannot be used. 10.4.5 Importing and Exporting Crawl Tasks Crawl tasks can be exported to a file. This allows crawl task configurations to be exchanged between different Globalbrain installations (e. g. from a test system to a production system) or to backup your crawl tasks. To import or export crawl tasks, first switch to the crawl task list of the requested crawler as described in the previous section. To export crawl tasks: Select the crawl tasks you want to export from the crawl task list. Click on the Export button. Select a folder and specify a file name in the file dialog. To import crawl tasks: Click on the Import button. Select the file you want to import from the file dialog; this opens up a new dialog that displays import options. Select the crawl tasks you want to import. Select the document repository to which unbound crawl tasks will be assigned these are crawl tasks for which there is no document repository with the same name as the document repository for which they were originally created. Specify if you wish to deactivate imported crawl tasks; doing this gives you the chance to edit the crawl tasks before the crawler starts processing them. Click on Ok Globalbrain Page 99 of 181

Administration Guide 10 Crawler 10.4.6 Changing Properties of Multiple Crawl Tasks 10-40Crawl Task Properties in Globalbrain Administration Client You can easily change some properties for multiple crawl tasks in one step. The properties that are available for this are: - enabling / disabling of tasks - task scheduling - attribute filter To edit the properties of multiple crawl tasks Select one or multiple crawl tasks from the crawl tasks list. Click on the Edit selected crawl task properties button For each of the three sections - General, Scheduling and Attribute Filter you can decide whether or not these settings are overwritten for the selected crawl tasks: enable the checkbox for the section to overwrite the settings. Globalbrain Page 100 of 181

Administration Guide 10 Crawler 10-41 Crawl Task Properties in Web Client 10.4.7 Force Recrawling of Crawl Tasks You can easily force one or multiple crawl tasks to be recrawled: Select the Crawler module. Select the crawler and activate the Tasks tab. Select the crawl tasks to be recrawled. Click on the Recrawl button. This is similar to opening each of the tasks and setting the Process immediately checkbox in the Scheduling section. See section 10.4.1.1. 10.5 Monitoring the Crawlers The following actions require the Crawler Monitoring permission ( see section 10.3. Permissions on crawlers) 10.5.1 Status Monitoring Figure 10-42 Crawler Monitoring in the Dashboard You can get a quick overview over the status of your crawlers by using the Dashboard panel in the Monitoring module: The Crawlers box displays a list of all crawlers along with their agents. The current status is displayed for each crawler and agent. Globalbrain Page 101 of 181

Administration Guide 10 Crawler 10.5.2 Starting and Stopping a Crawler Figure 10-43 Crawler Actions If you have not configured times to start and stop the crawler automatically, you need to start and stop them manually. This is done in the Dashboard panel by clicking on the crawler you want to start or stop. This opens up a box with crawler details and action buttons. Click on the Click on the Start button to start the crawler. It is disabled if the crawler is already running. Stop button to stop the crawler. It is only enabled if the crawler is running. In the Globalbrain Administration Client, the start and stop buttons are also displayed in the toolbar at the top of the crawler detail panel in Crawler module. Figure 10-44 Starting and Stopping via Crawler module Note: If the crawler is configured to start and stop automatically, you should not start or stop the crawler manually. It works but the crawler may restore the old state. 10.6 Monitoring Jobs All of the actions described in the following sections need the Crawler Monitoring and the Read Crawl Tasks permissions for the selected crawler ( see section 10.3. Permissions on crawlers). Globalbrain Page 102 of 181

Administration Guide 10 Crawler 10.6.1 Viewing Jobs Figure 10-45 Job Monitoring in the Desktop Application Figure 10-46 Job Monitoring in the Web Application The job monitoring section helps you to find out what has been crawled, what is still open, and what has failed. In the Globalbrain Administration Client: Select the Crawler module. Select the appropriate crawler from the tabs at the bottom. Select the Jobs tab at the top. Enter your filter criteria as described below and click on the In the Globalbrain Web Client: Search button. Select the Crawler module. Select the Jobs tab. Select the appropriate crawler from the Crawler tabs at the bottom. Enter your filter criteria as described below and click on the Search button. Globalbrain Page 103 of 181

Administration Guide 10 Crawler Figure 10-47 Selecting a crawl task The following filter criteria are available: - Task allows you to filter jobs by a single crawl task or all crawl tasks of a document repository. The drop down list contains both document repositories and crawl tasks, displayed in a hierarchy. If a document repository is selected, the filter covers all tasks which crawl into that repository (see figure 10-37). - URL allows you to filter by URL enter either a full URL or use * as a wildcard. - Status allows you to filter by the job status (open, in progress, crawled, or failed). - Return Code helps you to select failed jobs. The dropdown box provides predefined sets of return codes that describe more or less specific error situations. - Modified in allows you to select only jobs that have been crawled within the given time span. After performing the search, the table provides the following information: - Task is the crawl task to which this job belongs. - URL is the URL of the document that has been crawled or needs to be crawled. This value usually needs to be shortened. The row s tooltip contains the full URL. - Layer is the layer of this job. - Status shows the status of the job whether it is still open or if it has already been crawled. - RC is the return code of this job. This shows if the job was crawled successfully or what kind of error occurred. The values are based on the return codes of the HTTP protocol. Successful jobs have a return code of 200, and failed jobs have a return code greater than 400. Return codes beyond 900 are proprietary codes used only within Globalbrain. The job details (see below) contain a description of the return code. - Retries displays the number of retries for failed jobs. A value is only shown if at least one retry has been made. - Crawled at is the time at which this job was last crawled. Jobs that have not been crawled yet don t have a value. Figure 10-48 Job Details in Globalbrain Administration Client When a job has been selected, some details are displayed below the table in the Globalbrain Administration Client. This covers the full URL and a description of the status, including the return code. The Web Client displays the same information as tooltip. For nfile, lfile, and http URLs, a click on the link opens the original document. If this is supported, the URL appears underlined when the mouse hovers over it. To open the crawled document in the Globalbrain Administration Client, click on the Open button. In the Web Client, double click on the entry in the job list. If the job has produced multiple documents (for example when crawling a ZIP archive or a mail with attachments), the document viewer is opened showing the first document. You can use the previous / next buttons at the bottom to navigate to other documents that belong to this job. If only one document belongs to the job these buttons are not enabled. Globalbrain Page 104 of 181

Administration Guide 10 Crawler Figure 10-49 Job Details in the Web Application 10.6.2 Deleting Jobs Figure 10-50 Confirmation Dialog for Job Deletion To delete jobs from the job manager: Select one or more jobs from the table. Click on the Delete button. Select Delete documents that have been crawled via these jobs if you also want to delete the documents that were crawled with these jobs, otherwise just the jobs are deleted and the documents are left unchanged. Confirm the deletion by clicking on Yes. When activating the "Delete documents " option, only documents that were crawled with Globalbrain 5.3 or higher can be deleted. It is usually not necessary to delete jobs. If an error occurred, you can force it to be recrawled as described in the next section. 10.6.3 Enforce Recrawling of Jobs After an error, the crawler will try to recrawl the job several times according to the maximum number of retries configured for the crawl task. The next attempt will be done after the crawl task is next submitted. You can force jobs to be recrawled by selecting them and clicking on the Recrawl button. This works even for jobs where the number of retries has been exceeded. Technically, the status values of the jobs are reset so that the crawler treats them as uncrawled, open jobs. 10.6.4 Exporting Crawl Jobs Click the Excel file. Export button to export the crawl jobs. The jobs can be exported to a CSV or an Figure 10-51 Exporting crawl jobs Once you click one of the options, the appropriate file will be downloaded by your internet browser. Globalbrain Page 105 of 181

Administration Guide 10 Crawler Figure 10-52 Downloaded crawl jobs files Select the file from the Download area of your internet browser and save it, or open it. 10.6.5 Displaying the Crawl Path Figure 10-53 Crawl Path Select a job and click on the Show Crawl Path button to open a dialog that shows the path the crawler took to this job. This can be helpful to understand why the crawler has picked up a specific document, especially in non-hierarchical document sources like web sites. 10.6.6 Job Deletion Rules Figure 10-54 Job Deletion Rules The crawler can delete jobs based on configured rules. For example, it makes sense if all jobs that point to documents that no longer exist are periodically deleted automatically. To edit the job deletion rules in the Globalbrain Administration Client, click on the button. A dialog opens that allows you to create, edit and delete the rules. Globalbrain Page 106 of 181

Administration Guide 10 Crawler To create a new rule, click on the New button and edit the properties described below. To edit a rule, simply select it from the list and edit the properties described below. To delete a rule, select it from the list and click on the Delete button. The following properties are available: - Name is the name under which the rule is displayed in the list. - Task allows you to filter jobs by crawl tasks. You can select either a single crawl task or all the crawl tasks in a document repository. - URL allows you to filter jobs by URL enter either a full URL or use * as a wildcard. - Status allows you to filter jobs by the job status (open, in progress, crawled, or failed). - Return Code allows you to filter jobs by a predefined set of return codes. - Jobs older than allows you to configure a minimum age for jobs. When for example setting this to "3 days" this means that no job will be deleted that has been crawled within the last 3 days. After you have edited your settings, click on the 10.7 Viewing Statistics Save button to persist the rule. Figure 10-55 Crawl Task Statistics The crawl task statistics give you a quick summary on crawl tasks that have been crawled within a given timespan or that have open jobs. This helps you to check if all of your configured crawl tasks are still working. To view the statistics in the Administration Client: Select the Crawler module. Select the appropriate crawler from the tabs at the bottom. Select the Statistics tab. In the Globalbrain Web Client: Select the Crawler module. Select the Statistics tab. Select the appropriate crawler from the tabs at the bottom. The statistics table displays the following information for each task: - Task is the name of the task. - Open is the number of open jobs - Crawled is the number of jobs that have been successfully crawled within the configured time span. - Failed is the number of jobs that have failed within the configured time span. - In progress is the number of jobs that are currently crawled. - Last Job crawled At is the time at which the last job for the task has been crawled - Last Processing Time is the time at which the task was processed last Globalbrain Page 107 of 181

Administration Guide 10 Crawler The following filter criteria are available: - Task allows you to filter by a single task or all tasks of a document repository. - URL allows you to filter jobs by a URL pattern you can use * as wild card. This means only jobs are counted for the statistics that match this pattern. - Modified in allows you to filter jobs by their crawl date. That means only jobs are counted for the Crawled / Failed / In progress statistics that have been crawled within the configured time span. Notice that this limit is not applied for open jobs. Click on the Search button to apply the filter settings or to refresh the statistics. Please notice that only tasks are displayed in the table for which at least one open / crawled / failed / in progress job exists. 10.8 Validating Jobs Figure 10-56 Job Validation Results The Globalbrain Administration Client provides a tool that compares files from a local or shared directory with the jobs the crawler has created for these files and the documents that have been created in Globalbrain. A list is provided that contains the files that have not been crawled and an explanation why they have not been crawled. This helps you to verify that the crawl task configuration is working as expected. To run job validation: Select the Crawler module. Select the appropriate crawler from the tabs at the bottom. Select the Tasks tab. Double click on the crawl task to be validated. Click on the Validate jobs button in the toolbar of the crawl task dialog. The job validation is only available if the crawl task - has already been processed or - crawls network files (nfile) or local files (lfile) of the current machine. After the validation has been completed, you see a list of URLs that have not been crawled or for which crawling has failed. The table contains the following information: - URL is the URL of the file that has not been crawled. Globalbrain Page 108 of 181

Administration Guide 10 Crawler - RC is the return code that was produced when crawling this URL. If the URL has not been crawled at all the value is 0. - Reason displays the reason why the URL has not been crawled, or information about why crawling failed. You can sort the list by clicking on the head of a column. You can filter the list by any of these properties. For the URL, * is allowed as wild card. Globalbrain Page 109 of 181

Administration Guide 11 Classification 11 Classification 11.1 Classification in Detail 11.1.1 What the classification service can do Globalbrain s classification service is another way of helping you to find the documents you are interested in. The basic idea is that Globalbrain is able to assign documents to categories that you have previously defined with example texts. What you can do is to create a view on your documents: each view consists of multiple classes, with each class representing a subject or some kind of content you are interested in. Let s say you are interested in sports: you would create a view called Sports with classes like Football, Basketball, Baseball, etc. You would then provide example texts for each of those classes texts about football, basketball, baseball, - to tell Globalbrain what kind of content you are expecting. After the view has been learned by one of Globalbrain s classifiers, Globalbrain is able to classify an unknown text and can tell you in which of the classes it fits. Globalbrain can classify all the documents of a document repository against a view and store the results. You can then browse through the results and see which (new) documents exist in which category. Optionally, Globalbrain can store the name of the classes to which a document has been assigned as an attribute of this document. You will need to configure this attribute as a facet attribute in the general search index settings. Then, the views can be displayed as facets in the search, and you can easily limit the search to documents belonging to a given class. 11.1.2 Available Classifiers Globalbrain provides two classification engines: - The SVM classifier is technically a support vector machine. It compares the words it finds in a document with the words that occurred in the example texts of the classes. If it finds combinations that are typical for a class it assigns the document to that class. - The N-Gram classifier splits the words of a document up into segments of n characters (n-grams). It compares the frequency of the n-grams it finds in the document with the frequencies of these n-grams in the example texts and assigns the document to the class that is most similar. 11.1.3 View Repositories Views are organized in view repositories. You need to create at least one view repository within your security domain to make the classification functionality available in the clients. You can create multiple view repositories with each view repository having its own security settings; so you might create different view repositories for different user groups. For each view repository, you need to configure at least one classification server. A classification server runs on a designated server instance and is responsible for learning views and classifying documents. Please be aware that classifying a large number of documents can cause a high CPU load. You can configure multiple classification servers if you want to distribute the work on multiple machines. To support the automatic classification of all documents in a document repository against a given view in the background, you need to configure the auto classification service for the view repository. Globalbrain Page 110 of 181

Administration Guide 11 Classification Once this is configured, you are able to create classification tasks which configure which documents are classified against which views. 11.2 Configuring a View Repository All of the actions described below require the Read Configuration and Write Configuration permissions on the security domain. The configuration of a view repository consists of three parts: - the configuration of the actual view repository. - the configuration of one or more classification servers. - the configuration of the autoclassification service (optional). 11.2.1 View Repository 11.2.1.1 Properties Figure 11-1 Configuring a View Repository View repositories are only available when using a Globalbrain installation that works with a relational database. A view repository has the following properties: - Name is a human readable name for the view repository. - Data Source is the data source in which the data is stored. This cannot be changed once a new view repository has been created. 11.2.1.2 Creating and Deleting View Repositories To create a new view repository: Select Configuration module. Right click View Repositories. Select Add View Repository. Edit the properties as described in the previous section. To delete a view repository: Select Configuration module. Right click on the view repository in the View Repositories node. Select Delete. Confirm the deletion. 11.2.2 Classification Server 11.2.2.1 Properties Figure 11-2 Classification Server Properties Globalbrain Page 111 of 181

Administration Guide 11 Classification A classification server has the following properties: - Server Instance is the server instance on which the server runs. - Cache Timeout is the time in minutes that the server caches the internal representations of views in memory after the last access. The caching is done for performance reasons to avoid having to load the data for each set of documents that is classified. On the other hand, the data should not be kept in memory if the classifier is no longer used. 11.2.2.2 Creating and Removing Classification Servers Figure 11-3 Adding Classification Server To add a classification server: Select Configuration module. Right click the required view repository. Select Add Classification Server. Edit the properties as described in the previous section. To remove a classification server: Select Configuration module. Right click the classification server in the View Repositories node. Select Delete. 11.2.3 Autoclassification Service 11.2.3.1 Properties Figure 11-4 Configure an Autoclassification Service From the Server Instance dropdown list, select the server instance on which the autoclassification service is running. Document Attribute Synchronization configures the time at which class attributes are added to the documents. This is a background process that should run before the affected search indexes are synchronized. This is only relevant if you have configured classification tasks that store the classes as document attributes (see section 11.4.2). 11.2.3.2 Configuring and Deleting the Autoclassification Service To configure the autoclassification service: Select Configuration module. Right click the required view repository. Select Add Autoclassification Service. Edit the properties described in the previous section. Globalbrain Page 112 of 181

Administration Guide 11 Classification To delete the autoclassification service: Select Configuration module. Right click the Autoclassification Service you want to remove. Select Delete. Note: An autoclassification service can only be configured once for a view repository. Once you have configured it, the "Add Autoclassification Service" option will be disabled. 11.3 Working with Views Before a document can be classified against a view, the classifier needs to build its internal representation of the learnset. This process is called "learning". It needs to be invoked after the view has been created with its classes and examples, and after each modification of the learnset or the classifier configuration. Globalbrain provides two classification engines: - The SVM classifier is technically a support vector machine. It compares words found in a document with the words that occurred in the sample texts of the classes. If it finds combinations that are typical for a class, it will assign the document to that class. - N-Gram Classifier splits words of a document into segments of n characters (n-grams). It compares the frequency of n-grams found in a document with the frequency of these n-grams in the sample texts of the classes and assigns the document to the class which is most similar. Please be aware that at least two classes are required for a view before it can be learned. You can create, export, and import views in both administration desktop and web applications. 11.3.1 Creating a View The creation of a view consists of four steps: - creation and configuration of the actual view, see section 11.3.1 Creating a View - adding classes to the view, see section 11.3.1.3 Adding and Removing Classes - adding example documents to the classes, see section 11.3.1.4 Adding and Removing Example Texts - learning the view, see section 11.3.1.5 Learning the View 11.3.1.1 Properties of a View Figure 11-5 Properties of a View with N-Gram Globalbrain Page 113 of 181

Administration Guide 11 Classification A view s detail panel consists of three sections: Toolbar, General, and Classifier. Toolbar The Toolbar provides you with the following options: Button Command Description Save Reset Learn Import Learn set from Directory or Archive Export View Permissions Click this button to apply the changes you have made. Click this button to undo your changes, and to restore default settings. You can learn the View by clicking this button. Note: learning will fail if there are insufficient sample texts. You can browse for a file containing your learn set. Select the target location for your view data here. Grant permissions for the currently selected view. For instance, you can restrict users from writing to the view. Please refer to section 11.5 for further information. General In the General section, the following properties can be set: - Name is a human readable name of the view. - Classifier Type selects the classification engine that will be used for this view. See section 11.1.2 Available Classifiers for a general overview over available classifiers. Classifier The content of the Classifier section depends on the selected classifier type. For an SVM classifier, the following properties are available: Figure 11-6 Properties of a View with SVM Classifier selected - Min. Word Size configures the minimum length for words that can be taken into the internal dictionary. The dictionary contains the list of words that are used for the classification. For example, when configuring a minimum size of three, words like "a" or "at" would be Globalbrain Page 114 of 181

Administration Guide 11 Classification ignored completely. If your example texts contain relevant words that are very short (e.g. abbreviations) you need to adjust this value to make sure that they are included. - Min. Word Frequency configures how often a word must occur in the example texts before it is included in the dictionary. This prevents words that only occur a few times from being added to the dictionary. - Max. Words in Dictionary configures the size of the dictionary. If the example texts contain more words that match the conditions for being added to the dictionary, the engine will choose the words it considers are most relevant. - Use Numbers in Dictionary configures whether or not numbers are included in the dictionary. In most cases, numbers have no relevance. For an N-Gram classifier, it is important to understand that word fragments (n-grams) not words to be more precise: their frequency - are relevant for the classification. During the classification, the document being classified is segmented into blocks; it is these blocks and not the full documents that are compared with the example texts. Figure 11-7 Properties of a View with N-Gram Classifier selected - Ngram Size configures the size of the word fragments that are used for the classification. - Word Window Size configures the number of words that are contained in a text block during classification. - Word Overlap configures the number of words two neighboring text blocks have in common. 11.3.1.2 Creating a View To create a new view in the Globalbrain Administration Client: Select the Classification module. Right click Views. Select New View. Edit the properties described in the previous section. To create a new view in the Globalbrain Web Client: Select the Classification module. Select the relevant view repository from the Repository drop down box. Click on the New View button. Edit the properties described in the previous section. Globalbrain Page 115 of 181

Administration Guide 11 Classification Figure 11-8 Creating a View in the Web Client 11.3.1.3 Adding and Removing Classes Figure 11-9 Class Configuration A class has the following properties: - Name is a human readable name for the class - Min. Relevance configures a minimum score that classified documents must have before they are considered to belong to this class. Documents with a lower relevance are rejected. Figure 11-10 Adding a New Class To add a class to a view in the Globalbrain Administration Client: Right click on the appropriate view. Select New Class. Edit the properties described above. To delete a class, select Delete from the context menu of the class. To add a class in the Globalbrain Web Client: Select the appropriate view repository in the Repository drop down box. Select the relevant view in the View drop down box. Make sure the Classes tab is selected. Click on the New Class button. Globalbrain Page 116 of 181

Administration Guide 11 Classification Edit the properties described above. To delete a class, select the class and click on the Delete button. 11.3.1.4 Adding and Removing Example Texts Figure 11-11 Adding Example Texts To add or remove an example text, first select the class in the tree (Globalbrain Administration Client) or in the Classes list (Globalbrain Web Client) so that its details are displayed in the main panel. The Example Texts section displays the existing example texts. One text is displayed at a time. Use the navigation buttons to move through the example text. To add a new example text: Click on the New Example button. Enter some text or copy the text from the clipboard. Click on the Save button to store the text. To delete an example text: Navigate to the example text to be deleted. Click on the Delete button. To modify an existing example text: Navigate to the text to be edited. Edit the text. Click on the Save button to store the modified text. Figure 11-12 Activating the Classification Toolbar in the Administration Client Another way to assign example texts to classes is the classification toolbar in the document viewer. Globalbrain Page 117 of 181

Administration Guide 11 Classification - To activate the toolbar in the Globalbrain Administration Client, select Document Viewer Toolbars / Classification from the Settings menu. - To activate the toolbar in the Globalbrain Web Client, select Classification from the Commands menu at the upper right corner of the window. Once the toolbar is activated it is displayed whenever a document is displayed in the Administration Client or the Web Client. Figure 11-13 Classification Toolbar in the Administration Client Figure 11-14 Classification Toolbar in the Web Client To assign the currently displayed document as an example to a class, select the view repository and the view in the toolbar. If the view has already been learned, the name of the best class will be displayed along with its relevance. To assign the current document as example text to this or any other class, click on the field that contains the class a list with available classes pops up. Click on the class you want to assign the document to. If the currently selected view contains the minimum number of example documents (at least two documents per class and at least two classes), the view is relearned automatically and the toolbar shows the new relevance. If classification tasks exist for the selected view, you will be asked if the tasks should be suspended. If you accept this, the tasks will be suspended for 15 minutes. This prevents the auto classification service from kicking in and starting to reclassify all documents while you are still adding examples. You can unsuspend the tasks at any time manually (see 11.4.2). Note: You should have at least two example documents assigned to each class. The more example texts you have added, the more precise the classification will be. 11.3.1.5 Learning the View Before texts can be classified against a view, the view needs to be learned. "Learning" is a process in which the classification engine creates an internal representation of the classes and example texts as it is required to perform the classification. To learn a view: Globalbrain Page 118 of 181

Administration Guide 11 Classification Select the Classification module. Select the view you want to learn. Click on the Learn button in the toolbar of the details panel. 11.3.2 Exporting and Importing Views You can export and import views for backup purposes or to exchange views between different Globalbrain installations. 11.3.2.1 Exporting a View To export a View Select Classification module. Select the view to export. Click on the Export View button. Select the export location. 11.3.2.2 Importing a View Figure 11-15 Importing a View To import a View: Select the Classification module. Right click Views and select Import View from the menu. Select the required view file. Click Open. To merge two views or two versions of a view, you can also import classes and example texts to an existing view. In this case, only classes and texts that do not exist are added to the view. To do this: Select the Classification module. Select the relevant view. Click on the Import Learn Set button. Click Open. 11.4 Using the Autoclassification Service 11.4.1 Creating and Deleting Classification Tasks For the documents of a document repository to be classified against a view automatically, a classification task must be created that tells the autoclassification service how to do this. Globalbrain Page 119 of 181

Administration Guide 11 Classification Figure 11-16 Adding a Classification Task (Administration Client) To create a new classification task in the Globalbrain Administration Client: Select the Classification module. Expand the node for the required view. Right click on Classification Tasks. Select New Classification Task. Edit the properties as described in the following section. To delete an existing task: Right click on the task in the Classification Tasks node. Select Delete. Figure 11-17 Adding a Classification Task (Web Client) To create a new classification task in the Globalbrain Web Client: Select the Classification module. Select the appropriate view repository from the drop down list. Select the appropriate view. Select the Tasks tab. Click on the New Classification Task button. Edit the properties as described in the following section. To delete an existing task: Select the task from the Tasks list. Click on the Delete button. Note: The Classification Tasks node in the Globalbrain Administration Client and the Tasks tab in the Globalbrain Web Client are only available if the autoclassification service has been configured for the selected view repository. Globalbrain Page 120 of 181

Administration Guide 11 Classification 11.4.2 Properties of a Classification Task Figure 11-18 Classification Task Properties A classification task is configured in the context of a view and has the following properties: - Source is the document repository that contains the documents that are to be classified. Only one repository can be selected here. If you want to classify the documents of several document repositories against a view, create multiple classification tasks for the view. - Max. Text Age optionally configures a maximum age for documents that can be classified. If you want to classify only documents that have been created or modified in the last week, enter 7 as the value. If you leave this field empty, all documents will be classified. - Max. Results / Class is the maximum number of classification results that are stored for a class of a view. This setting prevents the database from being flooded with results if a large number of documents is assigned to a class due to a poor quality learnset. - Keep best results / newest results configures which results are kept when the maximum number of results has been reached. - Class Attributes: This section will be available as soon as you have selected a writeable document repository from the Source dropdown list. Enter the name of the document attribute under which Globalbrain will store the classes a document was assigned to. Leave this field empty if you don t want Globalbrain to add the class attribute. Figure 11-19 Suspended Classification Task If a classification task has been suspended while adding example documents to the view via the classification toolbar (see 11.3.1.4), a message is displayed in the General section. You can unsuspend the task manually by clicking on the Reset button beside the message. Globalbrain Page 121 of 181

Administration Guide 11 Classification 11.4.3 Browsing Classification Results Figure 11-20 Classification Results (Globalbrain Administration Client) For normal users, classification results are available via the Classification section of the Globalbrain Web Client. Please consult the Globalbrain User Guide for more details. Within the Globalbrain Administration Client, you can browse the classification results by selecting the Classification module and selecting the view repository. The following information is displayed: - the document repository that contains the document. - the view and class to which the document was assigned. - the document with its title. - the relevance of this document for the class. - the last modification date of the document. You can filter the results by: - document repository. - view. - class (only if a view is selected). - last modification date of the document. To change the sort order, click on the header of the column you wish to use for sorting. Click again to toggle between ascending and descending order. Double click on an entry to open the document in the document viewer. Note: You may be wondering why the Globalbrain Administration Client displays the number of results that match the filter while the Globalbrain Web Client does not. This is for performance reasons: only the results that refer to documents that the user has the read permission for are displayed and counted. Counting the matching results can be a very expensive operation: this is the case when security settings on the level of document groups are used. We assume that the Globalbrain Administration Client is Globalbrain Page 122 of 181

Administration Guide 11 Classification mostly used by administrators who have all permissions so in this case there are no performance risks. 11.5 Permissions on View Repositories and Views Permissions can be granted at the view repositories level and optionally also at the single view level. To edit the permissions of a view repository in the Globalbrain Administration Client: Select the Configuration module. Select the view repository in the View Repositories node. Click on the Edit Permissions button in the detail panel toolbar. To edit the permissions in the Globalbrain Web Client: Select the Classification module Select the view repository from the tabs at the bottom Click on Edit Permissions button in the Views toolbar The following permissions are available: - Full Control grants all permissions. This includes the permission to view all the views of this view repository, independent of the permissions that are granted at the single view level. - Read Views allows views to be read from this repository, as long as they do not have their own protection or the view ACL allows it. This is a basic permission. If neither Read Views nor Full Control is granted to a user, the view repository is not visible to them. - Write Views allows new views to be created, and existing views with no specific protection to be edited or deleted. - Write Classification Tasks allows classification tasks to be created, edited and deleted. This is only relevant if the autoclassification service is configured for this view repository. To edit the permissions of a single view: Select the view in the Classification module of the Globalbrain Administration Client or the Globalbrain Web Client. Click on the Edit Permissions button in the detail panel toolbar. The following permissions are available: - Read View grants the permission to read the view. - Write View grants the permission to edit or delete the view. An example: user usr1 and user usr2 both have the Read Views permission on the view repository. User usr3 has the Full Control permission. View A has no security settings of its own. All three users can see the view. View B has its own security settings; only u2 has the Read View permission. usr1 cannot see view B. usr2 can see the view because the permission was explicitly granted to them. usr3 can see the view because of their Full Control permission. 11.6 Monitoring of Classification Servers Note: The following actions require the System Management permission on the security domain. See section 6.4. Globalbrain Page 123 of 181

Administration Guide 11 Classification Figure 11-21 Monitoring of Classification Servers and the Autoclassification Service The Dashboard panel in the Monitoring module displays all view repositories along with their associated classification servers and autoclassification services. Figure 11-22 Starting and Stopping the Autoclassification Service The autoclassification service needs to be started and stopped manually. To do this, click on the Autoclassification Service entry. This opens a box with detail information and action buttons. Click on the Click on the Start button to start the service. It is disabled if the service is not running. Stop button to stop the service. It is only enabled if the service is running. Globalbrain Page 124 of 181

Administration Guide 12 Subscriptions 12 Subscriptions 12.1 Subscriptions in Detail Subscriptions are a way of notifying users about new documents they might be interested in. Users can subscribe to: - bookmarked queries the latest documents that match the subscribed query will be listed. - views the latest documents that have been classified into any class in the subscribed view will be listed. - single classes of views the latest documents that have been classified into the subscribed class will be listed. There are different ways channels that a user can be notified: - The user can visit the Subscriptions page of the web application and view recently arrived, matching documents. This channel is configured automatically when the Globalbrain Web Client starts the first time. - Globalbrain can periodically for example, once a day send an e-mail with the recently arrived, matching documents. This channel needs to be configured explicitly. Future versions will support additional channels. Generally, Globalbrain distinguishes between push channels and poll channels: - Push channels are channels that actively deliver the information to the user. The Email channel is an example. - Poll channels are channels that only provide information on request. The Subscriptions page of the web application is an example: the user only sees the information if they visit the page. If push channels are used, a special background service - the push manager - is required. It is responsible for triggering the delivery of notifications at predefined times. 12.2 Configuring Subscriptions Note: All of the actions described below require the Read Configuration and Write Configuration permissions on the security domain. See section 6.4. 12.2.1 Subscription Manager Figure 12-1 Subscription Manager Subscriptions need to be enabled for each security domain. To do this, a subscription manager needs to be configured. Generally, this is only possible on Globalbrain installations that use a relational database. The configuration of a subscription manager consists of two parts: Globalbrain Page 125 of 181

Administration Guide 12 Subscriptions - Subscription Repository: Select the data source where the subscriptions will be stored. Once the configuration has been saved, this value cannot be changed. - Push Service: Select a server instance to host the push manager. This is the server instance from which, for example, notification e-mails are sent. You can leave this empty if you don t plan to use push channels. To add a subscription manager configuration: Select the Configuration module. Right click on Subscriptions. Select Configure Subscription Manager. Edit the properties as described above. Once the subscription manager has been configured, the option in the context menu is disabled. Select Subscription Manager in the Subscriptions node to edit an existing subscription manager configuration. 12.2.2 Channels Channels are shared by all security domains of a Globalbrain installation. To add or edit channels, you must be logged in to a security domain that has the system maintainer privilege. 12.2.2.1 Creating a Poll Channel Note: When working only with the standard clients delivered with Globalbrain there is no need for you to touch the configuration of poll channels. The Globalbrain Web Client automatically creates a poll channel named "Globalbrain Web Application" the first time it s started. It is strongly recommended that you do not change this configuration. The configuration of additional poll channels may be relevant to you if you plan to access subscriptions with custom clients. Only one type of poll channel can be configured via the Globalbrain Administration Client: a simple poll channel only has a name property. To add a simple poll channel: Select the Configuration module. Right click on Subscriptions. Select Add Channel / Simple Poll Channel. Enter the name. Save the changes. 12.2.2.2 Creating an Email Channel Figure 12-2 SMTP Account Configuration Before you can configure an Email channel, you need to configure an SMTP account. The SMTP account configuration tells Globalbrain which mail server it can use to send mails. This server must support the SMTP protocol. Globalbrain Page 126 of 181

Administration Guide 12 Subscriptions An SMTP account is configured with the following properties: - Host is the name or IP address of the SMTP server. - Port is the port number used for SMTP. - User is the user name Globalbrain can use to log in to the server. - Password is the password that Globalbrain can use to log in to the server. In the current Globalbrain version, this is only used for the email channel of the subscriptions service. Future Globalbrain versions will use the SMTP account for other purposes so the configuration is located under the Miscellaneous node of the Configuration module. To configure an SMTP account: Select the Configuration module. Right click on the Miscellaneous node. Select Add SMTP Account. Edit the properties as described above. Once the account has been configured, the option in the context menu of Miscellaneous will be disabled. To edit the account settings, click on SMTP in the Miscellaneous node. Figure 12-3 Email Channel Configuration An Email channel has the following properties: - Name is a human readable name of the channel. This configures how the channel is displayed in the subscription dialog. - Send Mails at is the time at which mails are sent. - Sender is the e-mail address that is used as the sender for notification mails. Enter either an e-mail address, or a sender name and the e-mail address in triangular brackets ("Globalbrain <gb@ourcompany.com>"). - Subject is a template for the subject string of the mail. - Date Format configures how dates are displayed. For a description of the format, please check the documentation for the Java SimpleDateFormat class. - The text area contains the message template; this determines what the notification e- mails look like. The template must produce an HTML document. Globalbrain Page 127 of 181

Administration Guide 12 Subscriptions It is recommended that you adapt the default template for your needs rather than creating a new template. The templates for subject and the message text use the Velocity template language. In both templates, the following variables can be used: - $date will be replaced by the current date. - $name will be replaced by the real name the user entered in their subscription settings. For the message text, information about the matching documents is available. The information about a single matching document is called a message item. Message items are organized in message item groups a group represents a subscribed query, view or class. - $messageitemgroups contains a list of the available groups. - For each group, $name contains the group name. $entries provides a list of message items for the group. - For each message item, the following fields are available: $subscribable is the name of the subscribed query, view or class. $title is the title of the document. $url is the URL of the document as it is stored in Globalbrain. $link is a representation of the URL that can be used as link. For example, nfile URLs are provided as a UNC path. $text is a small text excerpt of the document. $date is the last modification date of the document. $relevanceinpercent is the relevance of the document as value between 0 and 100. To create a new Email channel: Select the Configuration module. Right click on Subscriptions. Select Add Channel / Email Channel. Enter the properties as described above. 12.2.2.3 Deleting a Channel To delete a channel: Select the Configuration module. Right click on the channel in the Subscription node Select Delete. Note: When a channel gets deleted, it will be removed from all subscriptions. If this channel was the only channel that was chosen for a subscription, the subscription will be deleted. 12.3 Permissions on the Subscription Manager The permissions granted to the subscription manager determine which users are allowed to use the subscriptions service. To edit the permissions in the Globalbrain Administration Client: Select the Configuration module. Select Subscription Manager from the Subscriptions & Search Lists node. Click on the Edit Permissions button in the detail panel toolbar. To edit the permissions in the Globalbrain Web Client: Select the User Data module. Select Subscriptions. Click on the Edit Permissions button in the toolbar. Globalbrain Page 128 of 181

Administration Guide 12 Subscriptions The following permissions are available: - Full Control grants all permissions including reading, editing, and deleting other users subscriptions. - Subscribe allows subscriptions to be created. Users that don t have the Subscribe or the Full Control permission are not allowed to use the subscription service at all. Please refer to section 6.4.2 The Permissions Dialog for instructions on how to assign permissions in the permissions dialog. 12.4 Monitoring the Push Manager Note: The following actions require the System Management permission. See section 6.4. Figure 12-4 Push Manager Status The push manager the service that is responsible for starting the delivery of notification messages by e-mail needs to be started manually. You can check its current status via the Dashboard panel of the Monitoring module; it is displayed in the Subscriptions & Search Lists box. Figure 12-5 Starting and Stopping the Push Manager Click on the entry to open a box with detail information and action buttons. Click on the Start button to start the push manager. It is disabled if the push manager is already running. Click on the is running. Stop button to stop the push manager. It is only enabled if the push manager 12.5 Viewing and Deleting Subscriptions Subscriptions are created with the Globalbrain Web Client. Users can manage their own subscriptions from the Subscriptions page. In addition to this, administrators that have the Full Control permission on the subscription manager are able to view and delete other users subscriptions in the Globalbrain Administration Client. To view subscriptions: Select the User Data module. Select the Subscriptions tab. Globalbrain Page 129 of 181

Administration Guide 12 Subscriptions Figure 12-6 Subscriptions view in Administration Client You can either display all subscriptions or select specific criteria from the drop down list: - Source is the source of the subscriptions these are the available query repositories and view repositories. - Subscribed to is the subscribed query, view or class a subscription has been created for. - Channel refers to the channels that are used for the subscriptions. You can select any subscription and remove it by clicking on the Delete button. Globalbrain Page 130 of 181

Administration Guide 13 Search Lists 13 Search Lists 13.1 Search List Processing in Detail 13.1.1 Introduction Globalbrain can generate queries from an external list or database (a "search list") and search with these queries on selected document repositories. Possible use cases are: - Query an archive of historical invoices against a list of prohibited vendors, for example from the Master Vendor File, with each entry (vendor name, address, and account number) acting as one query. - Query an archive of historical invoices against a list of items that company employees are restricted from purchasing. - Query incoming résumés against a list containing known criminal aliases in order to prevent your Human Resources department from looking at a candidate who has been convicted of a felony. - Query incoming résumés against a list containing on the one hand positive attributes that your Human Resources department is directly looking for in a candidate for a particular job and, on the other hand, attributes that disqualify candidates 13.1.2 General Workflow The search list processing service operates in the background and processes search list tasks either at scheduled times or based on events like the completion of a crawl task. When processing a search list task, the service reads data records from the configured source (CSV file, database) and generates zero, one or multiple queries for each record based on the configured templates and filtering rules runs the generated search queries against the selected document repositories stores the search results in a database If a search list processing task is started multiple times, the results of all executions are kept in the database. 13.1.3 Search Lists vs. Subscriptions How does search list processing compare with subscriptions to queries? Subscriptions allow each user to find out which new documents have arrived that match their favorite queries. The focus is on presenting the results of single queries. Search lists give administrative users the ability to fire a large number of queries against document repositories in order to find out if there are any matches at all. The question here is not "Which new documents exist for my query", the question is "For which queries did Globalbrain find results?". For subscriptions, queries have to be entered manually and are stored and maintained within Globalbrain. For search lists, queries are generated on the fly based on an external list that can change every day. For subscriptions, results are always generated on the fly. There is no history. For search lists the results of every run of a task are stored. 13.2 Configuring a Search List Processing Service Search List Processing is only available on installations that use a relational database. Globalbrain Page 131 of 181

Administration Guide 13 Search Lists 13.2.1 Creating and Deleting a Search List Processing Service To configure a search list processing service: Select the Configuration module. Right-click on Subscriptions & Search Lists and select Configure Search List Processing Service. Configure the properties as described below. You can configure only one search list processing service for each security domain. Once it has been configured, it appears as Search List Processing Service in the Subscriptions & Search Lists node in the Configuration module. 13.2.2 Properties Figure 13-1Configuration of Search List Processing Service In the General section the following properties are available: - Server Instance is the server instance on which the service is running - Check Tasks every minutes configures how often the service checks if tasks are due. In the Result Repository section, the following properties are available: - Data Source is the data source in which results are stored. 13.3 Permissions for Search List Processing The permissions granted to the search list processing service determine which users are allowed to view results and manage the tasks. To edit the permissions in the Globalbrain Administration Client: Select the Configuration module. Select Search List Processing Service from the Subscriptions & Search Lists node. Click on the Edit Permissions button in the detail panel toolbar. To edit permissions in the admin section of the web client: Select the Search Lists module Select the Tasks tab Click on the Edit Permissions button in the toolbar at the bottom of the table The following permissions are available: - Task Management grants the permission to create, edit and delete search list processing tasks - Result Access grants the permission to view search list results. Please refer to section 6.4.2 The Permissions Dialog for instructions on how to assign permissions in the permissions dialog. Globalbrain Page 132 of 181

Administration Guide 13 Search Lists Note: users with the Result Access permission are allowed to view the complete result list; no filtering on permissions on the documents is done. If a user does not have access to view a document for which a search list result exists they will see the result with the hot spot in the list but they cannot open the full document. 13.4 Manage Search List Processing Tasks 13.4.1 Properties Figure 13-2 Search List Processing Task with CSV File as Source Globalbrain Page 133 of 181

Administration Guide 13 Search Lists Figure 13-3 Search task properties Web Client 13.4.1.1 General The General section contains general settings for the search list processing task: - Name is the name of the task under which it is listed in tables and other visual components. - The synchronize search indexes on start checkbox controls whether or not the search list processing service will synchronize all the search indexes that the task uses before it starts submitting queries; this ensures that the search indexes are up to date before the querying starts. 13.4.1.2 Query Source The Query Source section configures - how the list from which queries are generated is accessed - how queries are created based on the data records that are found in the list The upper part of the section depends on the type of the query source (CSV, database). The lower part with the query templates and filter expressions is similar for both types of query source. Globalbrain Page 134 of 181

Administration Guide 13 Search Lists CSV Search list processing tasks of type "CSV" can use any CSV file as their source. Such files can easily be created with spreadsheet applications like Microsoft Excel. To configure access to a CSV file, provide the following information: File name is the full path to the CSV file. This can either be a local file on the server instance that hosts the search list processing service, or a file in a shared directory on the Windows network. Please note, that you also can drag and drop a CSV file directly into the File name field. Delimiter is the delimiter that is used to separate fields from each other this is a comma, a semicolon or a tab. The CSV contains header checkbox specifies whether or not the CSV file contains a header line with labels for the fields. Each line of the CSV is considered as one data record. Each field can be referenced in query templates and filter expressions (see below). Depending on whether or not the CSV contains a header with field names, the fields are referenced either via their name as it appears in the header, but with all non-alphanumeric characters substituted by an underscore (Last_Name, Address, ) or via col<number> if the file has no header, with the number of the column as <number> (col1, col2, col3, ) Database Globalbrain can also use database tables or views as a query source. Any database for which a JDBC driver exists can act as source. To connect to a database, provide the following information: JDBC Driver is the class name of the JDBC driver that is used to connect to the database. If you select MS SQL Server or Oracle from the Database Type drop down box, an appropriate value is filled in. Consult the JDBC driver documentation when using other databases. URL is the URL the JDBC driver needs to connect to the database. If you select MS SQL Server or Oracle from the Database Type drop down box, a template is filled in you need to change the hostname and port number. Consult the JDBC driver documentation when using other databases. User name is the user name that is used to connect to the database Password is the password that is used to connect to the database SQL Query is the SQL statement that selects the data. If you are connecting to a database other than MS SQL Server or Oracle, you need to provide a JDBC driver for this database: Copy the.jar file to the lib directory of the server instance that hosts the search list processing service. Invoke java jar RebuildJars.jar from the command line in the tools directory of the server instance. Restart the server instance. Each row that is selected by the SQL query is treated as one data record, each column is a field that can be referenced in query templates and filter expressions. Globalbrain Page 135 of 181

Administration Guide 13 Search Lists Figure 13-4 Database as Query Source Query Templates A query template specifies how a data record from a CSV file or a database is transformed into a query. You need to configure at least one template. To generate multiple queries for each record, you can configure additional templates. If you create a task, the Query Templates field will contain a default list of query properties which can be adjusted to suit your needs: Figure 13-5 Query properties defaults To add an additional template, click on the + tab. To delete a template, delete its content. Queries are configured as a list of properties, using one key-value pair per line. Fields of the records can be referenced as variables of the type $fieldname. The following properties are supported: Property Description text query text minrelevance minimum relevance with values between 0 and 100 maxresultnumber maximum number of results to return lastmodificationdate given as a value in hours this limits the document modification date to between the execution time and x hours before it hotspotsenabled whether hotspots are enabled - true or false textmarksenabled whether text marks are enabled - true or false hottextlength The number of characters for the hot text property.name custom query property The text property is mandatory queries that don't have any text are not executed. Globalbrain Page 136 of 181

Administration Guide 13 Search Lists You can add any number of custom properties to the query using property.name. These properties are not evaluated by Globalbrain but provided as part of the result information when displaying search results. You can use this to add information that helps identify which data record produced the query. As an example, you might want to store an ID of a record as a custom property. Besides the fields from the records, rownumber is supported as an additional field. It contains the number of the record that is currently being processed, starting at 1 for the first record that is processed. Let's take a simple case as an example where three fields exist in a record, id, name1 and name2. name1 and name2 should go into the query while the id is stored as a custom property. The template would look like this: text = $name1, $name2 minrelevance = 75 property.id = $id With "1", "Miller" and "John" as values, this would be evaluated to text = Miller, John minrelevance = 75 property.id = 1 The value for minrelevance does not start with a $ and is thus not treated as variable. 75 is a constant value that is used independently of any of the record s fields. The template is evaluated with the Apache Velocity template engine so you can do much more than just referencing variables. You can, for example, use #if directives for adding properties only under specific conditions. Please consult the Velocity User Guide for more information. Filter Expressions Additionally, you can configure filter expressions to exclude records from being processed. You can configure multiple expressions with one expression in a row. A record is only processed if it passes all the filter expressions. Filter expressions have to be given in OGNL. Fields are referenced simply by their name (no $ as prefix). Common operators like <, > and = are supported. Let's assume we know that the rows 1000 2000 of our search list contain data that cannot be used for searching. This can be expressed like this using the artificial rownumber field: rownumber < 1000 && rownumber > 2000 To exclude records in which name1 does not have at least 4 characters: name1.length() >= 4 13.4.1.3 Document Repository The Document Repository Selector section specifies on which document repositories the searches are performed. There are two options: you can either select a static list of document repositories, or you can specify a pattern that resolves the list of matching document repository names at the time the search list processing task is executed. Static List of Document Repositories Figure 13-6 Static List of Document Repositories Globalbrain Page 137 of 181

Administration Guide 13 Search Lists To let the search list processing task work with a static list of document repositories: Select Static List from the drop down box. Select one or multiple document repositories from the list. Dynamic Selection with Name Pattern Figure 13-7 Dynamic List of Document Repositories To let the search list processing task resolve the list of document repositories dynamically at runtime based on a name pattern: Select Names matching from the drop down box. Enter a pattern for document repository name. Within the name pattern, you can use - * as a wildcard for any number of characters - $date as a placeholder for the execution date of the task. By default, $date is replaced by the date in the format yyyy-mm-dd (e. g. 2011-09-30). To use another date format, append it in square brackets (e. g. $date{dd/mm/yyyy}). See the documentation of SimpleDateFormat for a description of date patterns. 13.4.1.4 Execution Trigger The Execution Trigger section specifies when a search list processing task is executed automatically: tasks can either be executed at fixed, scheduled times or when a crawl task has been completed. You can also choose not to use an execution trigger. In this case you need to start the execution manually as described in section 13.4.4. Fixed Scheduling Time Figure 13-8 Task Execution at Fixed Time To execute a search list processing task at fixed time: Select Fixed scheduling time from the drop down box. Select when the task should be executed. Keep in mind that tasks may start later than scheduled if you have configured the search list processing service to check for due tasks less often than every minute. On Crawl Task Completion Figure 13-9Task Execution on Crawl Task Completion To execute a search list processing task as soon as a crawl task is completed: Select On crawl task completion from the drop down box. Enter the name of a crawl task or a name pattern. Globalbrain Page 138 of 181

Administration Guide 13 Search Lists When configuring a name pattern, the syntax is the same as for document repository selection: * can be used as a wildcard and $date will be replaced by the execution date. A crawl task is considered to be complete if: - there are no more jobs that are open or in progress and - at least one job has been processed since the last execution date of the task If a name pattern was given that has multiple matching crawl tasks, all crawl tasks must be completed before the search list processing task starts. 13.4.2 Viewing and Editing Tasks Figure 13-10 Search List Tasks To view the list of existing search list processing tasks: Select the Search Lists module in the Globalbrain Administration Client or the admin section of the web client. Select the Tasks tab. Each task is displayed with - the type of the query source (CSV, Database) - the name of the task - the current status possible values are Open, In Progress, Completed and Failed - the time the task was last processed (if any) Double click on a task to edit it; this opens up the property sheet. The following action buttons are displayed below the table: Button Command Description New Task A popup menu opens to select the query source type (CSV, DB, ) and then opens the search list processing task editor (see previous section) Delete Task Deletes the selected tasks Process immediately Forces an immediate execution of selected tasks (see section 13.4.4) 13.4.3 Creating and Removing Tasks To create a new search list processing task: Select the Search Lists module. Select the Tasks tab. Globalbrain Page 139 of 181

Administration Guide 13 Search Lists Click on the New Task button and select either "CSV" or "Database" as the query source from the menu that pops up a dialog is displayed where you can configure the details. Please see section 13.4.1 Properties for detailed information. To delete search list processing tasks: Select one or multiple tasks from the table. Click on the Delete Task button and confirm the deletion in the dialog that pops up. 13.4.4 Force an Immediate Execution of Tasks Figure 13-11Processing a Task Immediately You can force a search list processing task to be processed immediately execution as long as the task is not already in progress. To do this: Select the Search Lists module. Select the Tasks tab. Click on the Process immediately button. This opens up a dialog where you can configure the execution time for the run. This time - is used as the execution time when listing results - is used as the basis for the lastmodificationdate property in query templates no documents older than that date will be selected when a last modification date restriction is used - is used when resolving document repositories or crawl tasks via name patterns that use the $date placeholder. After you have selected the execution time, click on Run to start the processing of the task. If you configure an execution time for which results already exist, the results of the previous run are deleted allowing you to repeat a previous run if something has failed. Globalbrain Page 140 of 181

Administration Guide 13 Search Lists 13.5 Viewing Search List Results 13.5.1 Result List Figure 13-12 Search List Results in Web Client To view search results that have been produced by search list processing tasks: Select the Search Lists module in the Globalbrain Administration Client or the admin section of the web client. Select the Search Results tab. The following filter criteria are available: - Task allows you to display only the results of a single search list processing task - Execution Time allows you to select an execution time for the currently selected task. This list is only filled in if a task is selected. - Query contains the text of a query if a query filter has been set. To limit the results to the results of a single query: select a result from the table click on the Filter button under the results table and select Show only hits for selected query from the menu that pops up - Document Repository allows you to limit the results to documents of a single document repository. - File contains the file name of the document if a document filter has been set. To limit the results to the results in a single document: select a result from the table click on the Filter button below the result table and select Show only hits for selected document from the menu that pops up - Min. Relevance allows you to select a minimum relevance for the displayed results. The table displays the matching results with the following information: - Task is the task which produced the results - Execution Time is the execution time of the task - Query is the text of the query that produced the result Globalbrain Page 141 of 181

Administration Guide 13 Search Lists - Document Repository is the document repository that contains the result document - File is the file name of the result document. - Relevance is the score that the query produced for this document A section under the table shows the most relevant information for the currently selected result: - URL is the full URL of the document - Query is the full query text - Hot Text is an excerpt of the document text that contains the hit 13.5.2 Result Details Figure 13-13 Query Information in the Document Viewer of the Web Client Double-click on a result in the result list to open a document viewer that displays the result details. In addition to the normal document viewer information, the header section contains - the query text that produced the hit - any custom query properties that have been added in the query template 13.5.3 Exporting results Click on the Export button to export the search results. You can export the results to a PDF or an Excel file. 13.5.4 Deleting Results Click on the Delete button to delete either selected results or all matching results. Confirm the deletion in the dialog that pops up. 13.6 Monitoring the Search List Processing Service In the Subscriptions & Search Lists section of the dashboard panel (Globalbrain adminstrative client and web client), the status of the service is displayed (stopped or running). By clicking on the service, a properties window pops up providing toolbar action buttons to start or stop the service. Figure 13-14 Properties window for Search List Processing Service on the Dashboard Globalbrain Page 142 of 181

Administration Guide 14 Monitoring 14 Monitoring 14.1 Dashboard Figure 14-1 Dashboard in the Globalbrain Administration Client The Dashboard panel is available in the Monitoring module of the Globalbrain Administration Client and the Globalbrain Web Client. The dashboard gives you a quick overview of the "health" of your system by displaying all the relevant services along with their status. If a service is not running or is unavailable, its status is displayed in red. The dashboard has five sections: - Document Repositories lists the document repositories with their search indexes and the status of those indexes (for details see section 7, Document Repositories) - Server Instances lists the server instances with their status (see section 4.3, Monitoring a Server Instance). - Crawlers lists the crawlers with their agents, and the status of the crawler and agents (see 10.5, Monitoring the Crawlers). - Classification lists the view repositories with the associated classification servers and the autoclassification services, and the status of the servers and the autoclassification services (see 11.6, Monitoring). - Subscriptions & Search Lists show the push manager of the subscription manager with its status, and the status of Search List Processing Service (see section 12.4, Monitoring the Push Manager and section 13.6, Monitoring the Search List Processing Service) Globalbrain Page 143 of 181

Administration Guide 14 Monitoring Clicking on an entry opens up a box that provides more detailed information about the configuration and if any actions are available a toolbar with buttons for the available actions. Figure 14-2 Dashboard in the Globalbrain Web Client Note: You need the "Server Monitoring" and the "Read Configuration" permissions to be able to use the dashboard. The availability of the status information and actions is dependent on other permissions. For example, you need the "Crawler Monitoring" permission on a crawler to be able to view its status. See section 6.4. 14.2 Sessions When a user logs in, Globalbrain creates a new user session for them. A session is identified by a unique ID which is submitted to the server with each request that the user makes. Based on this session ID, the server can identify the user and the security domain, and check if they have the necessary permissions to perform the requested action. A session is destroyed when the user logs out. A session expires if the user has not made any requests to the server within a predefined time the default value is 30 minutes. This timeout can be changed when editing the Session Manager configuration which is the component that is responsible for managing the user sessions (see section 14.2.1 Configuring the Session Manager). Note: Unlike the web application, the Globalbrain Administration Client periodically contacts the server to prevent the session from expiring. A session manager is always created during setup Globalbrain won t work without a session manager. Depending on whether Globalbrain stores its configuration data in a database or in a file, an appropriate session manager type is chosen: Either the session manager stores sessions in the data source that contains the configuration, or sessions are only held in memory on the server instance that hosts the configuration. 14.2.1 Configuring the Session Manager To edit the configuration of the session manager: Select the Configuration module. Select the Session Manager node from the miscellaneous section of the tree. Edit the properties as described below and save. Note: You need the Read Configuration and Write Configuration permissions to edit the session manager configuration. Globalbrain Page 144 of 181

Administration Guide 14 Monitoring 14.2.1.1 Storing Sessions in a Database Figure 14-3 Session Manager Properties - Data Source displays the data source in which the sessions are stored. This value cannot be changed. - Session Timeout is the timeout for sessions in minutes. - Period for Expiration Checks is the time in minutes between two checks for expired sessions. 14.2.1.2 Holding Sessions in Memory Figure 14-4 Session Manager Properties - Server Instance is the server instance that hosts the session manager. This value can be changed but doing so will cause the all active sessions to be lost. It may take up to a minute before a new login is possible. - Session Timeout is the timeout for the session in minutes. - Period for Expiration Checks is the time in minutes between two checks for expired session. 14.2.2 Viewing Sessions Figure 14-5 Monitoring Sessions in the Desktop Application Globalbrain Page 145 of 181

Administration Guide 14 Monitoring Figure 14-6 Monitoring Sessions in the Web Application To view a list of the current sessions: Select the Monitoring module. Select the Sessions tab. You can either display a list of all sessions, or filter by security domain and user name. The table displays the following: the session id, the security domain, and the user name of the session owner, and information about when the session was last active as both a timestamp and as the elapsed time in minutes. Your own session is displayed with a red session id. You can select sessions then remove them by clicking on the useful if the number of sessions is close to the licensed limit. You cannot remove your own session. Delete button. This can be Note: You need to have the Session Monitoring permission to be able to view and delete sessions. You need to be logged in to a security domain with system maintainer privileges to view session from other security domains. 14.3 System Messages and Client Error Reports 14.3.1 System Messages and Client Error Reports in Detail 14.3.1.1 System Messages Each of Globalbrain s server instances writes log files to a directory within the local file system; usually this is the log/ directory within the installation directory. The log files contain all kinds of information about what happened on the server instances, what errors occurred, etc. These log files are not available via the Administration Client and need to be picked up from the machines where the server instances are running. In addition to the log files, Globalbrain can log important events and errors as system messages to the System Message Repository. This covers events like: - starting or stopping a crawler (either manually or automatically). - synchronizing or rebuilding a search index. - any serious error that occurred when trying to instantiate a configured component. System messages are categorized into one of three levels of severity (INFO, WARN and ERROR messages). System messages are assigned to a server module like Crawler, Search or Security. As well as the message itself, the host on which it was created and the timestamp is logged. System messages can be viewed within the Administration Client and give a quick overview of the important events within Globalbrain. System messages are only available for Globalbrain installations that work with a relational database. Globalbrain Page 146 of 181

Administration Guide 14 Monitoring 14.3.1.2 Client Error Reports Figure 14-7 Reporting a client error If a user of the Globalbrain web client experiences a system error they have the option to report it to an administrator. If the user decides to report the error, a client error report with the error information and basic context information are recorded. Like system messages, the client error reports are stored in the system message repository and can be viewed by administrators with the Administration Client or the Globalbrain web client. Administrators need the System Management permission in a security domain with system maintainer privileges to view client error reports. Additionally, there are two ways how Globalbrain can actively inform you about new reports: - Notification pop up at login. Clicking on the View button will redirect you to the Error Report tab (please see section 14.3.4 Viewing Client Error Reports). - Notification via email. Please see following section for email notification configuration. Figure 14-8 Notification pop up 14.3.2 Configuring a System Message Repository Figure 14-9 System Message Repository Properties If a system message repository was not created during the installation of Globalbrain, you can do this later: Select the Configuration module. Right click the miscellaneous node. Select Add System Message Repository from the context menu. Edit the properties as described below. Globalbrain Page 147 of 181

Administration Guide 14 Monitoring The following properties are available: - Data Source is the data source where the system messages are to be stored. This value can only be changed when configuring a new system message repository. - Max. Age for Messages is the maximum age for system messages in days. Older messages will be deleted when a purge is performed on the repository (see section 0). - Mail Recipient for Error Reports: Enter a target email address where client error reports should be sent. You can leave this empty if you do not wish to receive client error reports by mail - Pre-requisite: configuration of the SMTP server. Please see section 12.2.2.2 Creating an Email Channel. 14.3.3 Viewing System Messages Figure 14-10System Messages in the Desktop Application Globalbrain Page 148 of 181

Administration Guide 14 Monitoring Figure 14-11 System Messages in the Web Application To view system messages: Select the Monitoring module. Select the System Messages tab. You can either view all messages or filter on one or more of the following properties: - Level is the minimum level for displayed messages; this means messages with a higher level are also selected. So when the INFO level is selected, WARN and ERROR messages are selected too. - Host is the host on which the message was logged. - Module is the module the message belongs to. This allows, for instance, all messages that are related to search indexes to be selected. - Creation Date is when the message was logged. You can specify a minimum and / or a maximum date. The table displays the matching system messages. Double clicking on a message opens up a dialog for the selected message. This dialog may contain additional information for instance, about an exception that occurred, if this was logged by the server. Figure 14-12 System Message Details You can delete system message that have reached the maximum age by clicking on the Purge button below the table. Note: You need to be logged in to a security domain with system maintainer privileges and you need to have the System Monitoring permission to be able to view system messages. Globalbrain Page 149 of 181

Administration Guide 14 Monitoring 14.3.4 Viewing Client Error Reports Figure 14-13 Client Error Reports in the Desktop Application To view error messages: Select the Monitoring module. Select the Client Error Report tab. You can either view all reports or filter on one or more of the following properties: - Status selects either all messages, unread messages or read messages. A message is consiered as read as soon as one administrator has opened it in the message details dialog or if it was explicitly marked as read using the Mark read button. - User is the login name of the user that reported the error - Error Message is the message that was displayed to the user. You can use * as wildcard here. - Time is the time at which the error was reported. The table displays unread messages in a bold font and read messages in a regular font. The table additionally contains the following information: - Client is the client application via which the report was created. Unless you are using a custom client that can write own reports this will always be GBWebClient. - Action is the action that has caused the error. Globalbrain Page 150 of 181

Administration Guide 14 Monitoring Figure 14-14 Client Error Report Details Double clicking on a message opens up a dialog with detail information. If multiple messages are displayed in the table you can navigate to the next or previous message using the arrow buttons. You can select any number of messages in the table and mark them as read or unread in a single step by using the Mark read or Mark unread buttons below the table. 14.4 Audit Messages 14.4.1 Audit Messages in Detail Globalbrain s audit service allows you to track who accessed or changed something. The audit service can help you to answer questions like: - Who logged in? When or how often did logins fail? - Who changed the configuration? And when? - What queries were made? What were the most frequent queries? Depending on the configuration, Globalbrain logs audit messages into an audit repository. An audit message contains information about who called what, with which parameters and, in the case of an error, what error code was returned. Each message refers to an audit message type. An audit message type is linked to a function that can be called on one of the services of the Globalbrain server this may be login on the session service or search on the search service. The message type determines if these calls are logged at all and, if so, which values from the parameters are stored in the audit message. For instance, the message type for the search method configures that the query text and the attribute query of the query object are logged as part of the audit messages for search requests. Audit message types are organized in groups. Each message type can belong to multiple groups. For example, the group Administration contains all administrative actions while the group Crawler contains all actions related to the crawler. The action Configure crawl task would belong to both the Administration group as well as the Crawler group. A preconfigured list of audit message types is imported into the Globalbrain server when the audit repository is created. If Globalbrain is extended by custom services, additional audit message types can be added via the API. The audit service also allows you to request statistics on the parameters that were used when calling the server including the most frequent values that were used. As an example, Globalbrain Page 151 of 181

Administration Guide 14 Monitoring when requesting statistics on the first parameter of a search that is the query text you get a list of the query text that was most frequently used. The audit service is only available for Globalbrain installations that work with a relational database. 14.4.2 Configuring an Audit Repository Figure 14-15 Audit Repository Properties If you have not chosen to create an Audit Repository during the installation of Globalbrain, you can do it later: Select the Configuration module. Right click the miscellaneous node. Select Add Audit Repository from the context menu. Edit the properties as described below and save. The available properties are: - Data Source is the data source where audit messages are stored. This cannot be changed once the audit repository has been configured. - Max. Age for Messages is the maximum age in days that audit messages are stored. Older messages are deleted when the audit repository is purged. Note: You need to log in to a security domain with system maintainer privileges and you need the Read Configuration and Write Configuration permissions to configure an audit repository. 14.4.3 Activating Message Types Figure 14-16 Audit Message Type Activation By default, none of the audit message types are activated and no audit message entries are written. In order to activate audit messages: Select the Configuration module. Select the Message Types node within Miscellaneous / Audit. Select the message types you want to activate from the list, or click Activate all to activate all message types. Globalbrain Page 152 of 181

Administration Guide 14 Monitoring If you want to activate or deactivate only messages belonging to a specific group, you can filter the list by selecting the appropriate group from the dropdown list: Figure 14-17 Audit Message Type Groups You can deactivate single message types by clicking the checkbox. If you want to deactivate all of the selected message types, click Deactivate all. Click on Save to apply your changes. It may take up to a minute before the audit message logger recognizes the changes. Note: You need to log in to a security domain with system maintainer privileges and you need the System Management permission to activate or deactivate audit message types. 14.4.4 Viewing Audit Messages Figure 14-18 Audit Message in the Desktop Application Figure 14-19 Audit Message in the Web Application Globalbrain Page 153 of 181

Administration Guide 14 Monitoring To view audit messages: Select the Monitoring module. Select the Audit tab. Configure your search. You can either display all of the messages or filter by one or more of the following criteria: - Type is the audit message type of the message. The first dropdown list under Type provides the Groups selection. The second dropdown list contains the message types associated with the selected group. Changing the type also changes the names of the parameter columns, using meaningful names instead of the generic Parameter 1, Parameter 2, and Parameter 3. - Domain is the security domain for which messages are to be displayed. - User is the user name of the user that made the request. - Parameters are the parameter values that are stored for the audit message entries. Their meaning depends on the audit message types. - Error is the error code that was logged if the server call failed. Instead of an error code: - * can be used to select entries that have any error code which covers all entries for calls that failed. - can be used to select entries that have no error code which covers all entries for calls that succeeded. - Date is the time at which the calls were made. You can enter a minimum and / or a maximum value. Click on the Search button to apply your selection. The table displays the matching entries. Please note that you must activate the Message Types that you want to record within Configuration module (see previous section). Otherwise, you will not be able to get search results. To request statistics, click on the Statistics button at the bottom of the table and select the parameter for which you want to retrieve statistics. The statistics cover the audit messages that match the current filter settings. Figure 14-20 Parameter Statistics You can delete audit messages that have reached the maximum age by clicking on the Purge Expired Messages button under the table. 14.4.5 Generating Reports To generate a report in PDF or Excel format on the selected audit messages, click on the Export button. You will be presented with the options PDF and Excel: Globalbrain Page 154 of 181

Administration Guide 14 Monitoring Desktop Client Web Client This function is browser based. As soon as you select one of the options, a file will be generated and downloaded according to your browser settings. In most cases, it will be saved in the Download folder. From there, you can move the file to the desired location. You can also open it directly from the download area of your browser. Globalbrain Page 155 of 181

Administration Guide 15 License 15 License 15.1 Licensing in Globalbrain 15.1.1 The Product License A license for the Globalbrain server consists of a product license and several feature licenses. The product license is mandatory; none of the features are available without a valid product license. A product license has an expiration date. To ensure the uninterrupted availability of Globalbrain, you should import a new license before the current license expires. A new license can be imported after the previous license has expired. If no license has been imported, or if the current license has expired, the Globalbrain server instances will still start or continue working but none of the licensable features will be available. You can still log in and import a new license but you cannot search or start the Crawler. 15.1.2 The Feature Licenses Depending on the type of license you have purchased, certain features are available, unavailable or bound to limits. The following feature licenses are available: - Search enables the search capabilities of Globalbrain. If this feature license is not available you can neither configure search indexes nor if indexes have already been configured with another license perform searches. - Crawler enables the crawler. If this feature license is not available the crawlers cannot start. - Sessions limits the number of concurrent users that are allowed within your system. If the limit is reached, further login attempts will be rejected. - Size of Document Repositories limits the total size of all document texts stored in the document repositories. This covers all document repositories of all security domains. If the limit is reached, the crawlers will stop working and attempts to add new documents via the API are rejected. - Crawler Threads limits the number of threads that are available for crawler agents. This covers all agents of all crawlers. If you try to increase the number of threads for an agent such that the limit will be exceeded, the new configuration cannot be saved. If a higher number has been configured with a previous license, the crawlers will not start until the configuration has been changed. Globalbrain Page 156 of 181

Administration Guide 15 License 15.2 License Monitoring Figure 15-1 License Monitoring in Desktop Client To get an overview of your current license and the usage of the limited features: Select the Configuration module. Select License within Miscellaneous. The main panel displays the license information including the available features and information about licensing limits. To refresh the current values for the limited features, click on the Refresh button. 15.3 Importing a new License The first license is usually imported during the installation. If you have skipped this step or need to import a new license, you can do this from the Globalbrain Administration Client: Select the Configuration module. Select License within Miscellaneous. Click on the Open button. Select the license file you want to import in the file dialog. It may take a minute before all server instances have recognized the new license. There is no need to restart the service instances. 15.4 License Info in the Web Application The web application has its own licensing module under the Monitoring module called License info. It contains the same functionality as the desktop application refresh and import a new license. Globalbrain Page 157 of 181

Administration Guide 15 License Figure 15-2 License info in Web Application Globalbrain Page 158 of 181

Administration Guide 16 Configuration of the Web Application 16 Configuration of the Web Application The Configuration module of the Globalbrain Web Client allows the web application itself to be configured. The configuration page for the Configuration module contains two tabs. In a previous chapter, the System configuration has already been discussed. Click the Web Client tab to configure your web application. Figure 16-1 Web Client tab 16.1 Configuring Default Values The Configuration Defaults tab allows you to configure some application default values. These values are the defaults for the user preferences and are used if the user has not configured their own preferences. Globalbrain Page 159 of 181

Administration Guide 16 Configuration of the Web Application 16.1.1 Search Defaults Figure 16-2 Search Defaults in the Web Application The Search section defines default settings related to searching and viewing documents: - Fuzziness default value is the default fuzziness that is used in queries. This is a number between 0 and 100 with 0 meaning no fault tolerance and 100 a very high fault tolerance. In the Globalbrain Web Client, this is displayed with a slider. - Max. result number default is the default value for the maximum number of results. - Number of context words defines the number of words to be detected on the search results that are associated with the search word. These words will build the context for the search result. - Hits per page is the number of results that are displayed on a page of search results. - Page size for documents is the default size for text pages in the document viewer. This is used whenever the full document is displayed, for example, in the search result details or the classification result details. - Register document access (in seconds) configures the time (in seconds) after which an access is counted for document popularity. - Rendition Type sets the document type for renditions. The value is either Text or Original. - Hit navigation mode determines the navigation between hits in the document viewer. Available options are Text marks and Hotspots. You can also disable hit navigation completely. Figure 16-3 Hit navigation mode selection Please consult the Globalbrain User Guide for a detailed description of Fuzziness and Max. result number. All values except Hits per page can be overridden by the user for the lifetime of their session. Globalbrain Page 160 of 181

Administration Guide 16 Configuration of the Web Application 16.1.2 Markup Defaults Figure 16-4 Text Marks and Hotspots in the Web Application - Enable hotspots controls whether or not hotspots are displayed. - Enable text marks controls whether or not text marks are displayed. - Text mark mode defines which words are displayed as text marks: Ignore Frequent Words does not mark words that occur much less often than the most frequent word of the query. With this option, no text marks may be created for high-frequency words like "of" or "the". Mark Keywords is like Ignore Frequent Words, but guarantees that text marks for required keywords that have been marked with a + in the query are created. Mark all words creates text marks for all words regardless of their frequency. - Hotspot color defines the background color for highlighting hotspots. You may also configure hotspots to be displayed with a bold font. Changes are reflected in the Hotspot Preview section. - Text mark color configures the background colors for highlighting text marks: Select a start color that is used for text marks of level 1 (exact matches) and an end color that is used for text marks of level 10 (very distant matches). For the other levels, colors between the start and the end color are used. - Color levels allows you to define the number of color levels for the color gradient. The more levels you define, the more matches with a lower relevance will be highlighted. Changes to the Text mark color and Color levels are reflected in the Text Mark Preview section. Please note that the Preview areas will only be populated with examples if the appropriate option is selected: 16.1.3 Search Result Export The Search Result Export configuration option allows you to configure some application default values for the dialogs that are displayed when the user is exporting search result documents from a result list. These values are the defaults for the user preferences and are used if the user has not configured own preferences. Globalbrain Page 161 of 181

Administration Guide 16 Configuration of the Web Application Figure 16-5 Search Result Export 16.1.4 PDF Settings The PDF Settings section allows you to configure default values for the export of search result documents as a PDF document. Please see section 5.3.5.3 of the Globalbrain User Guide for details about exporting results as a PDF document. The following settings are available: - Page Size lets you select whether the PDF document should use Letter or A4 as its page size. - Highlight Hotspots controls if hotspots are highlighted in the text excerpts. - Highlight Text Marks controls if the text marks are highlighted in the text excerpts. - Size of Text Excerpt sets the size of the text excerpt that is included from each result document. 16.1.5 E-Mail Settings Please note, that this configuration option will only be available if the SMTP server is configured appropriately. Please see section 12.2.2.2 Creating an Email Channel for instructions. The E-mail Settings section allows you to configure default values that are offered when the user wants to send result documents by e-mail. Please see section 5.3.5.5 of the Globalbrain User Guide for details about sending documents by mail. The following settings are available: - Subject sets the subject of the e-mail - Mail Text sets the message text of the e-mail Globalbrain Page 162 of 181

Administration Guide 16 Configuration of the Web Application 16.1.6 Subscription Figure 16-6 Configuration Subscription in the Web Application The Subscription section allows the default time period to be set that is used when selecting matching documents in the Subscriptions screen (see chapter 7.2 of the Globalbrain User Guide). 16.2 Configuring Sticky Attributes Figure 16-7 Sticky Attributes on Search Screen By default, the user can add attributes like author or title to a query from the settings dialog on the search screen. Attributes usually only appear in the properties bar below the main input field if the user has entered a value for them. You can optionally configure "sticky" attributes that are always displayed in the settings bar, no matter if a value is configured for the attribute or not. They can be accessed and changed easily by the user without having to open the settings dialog. You can also control how a value is entered. Select the Configuration module and switch to the Sticky Attributes configuration page to view the list of configured sticky attributes and to add, remove or edit a sticky attribute. Globalbrain Page 163 of 181

Administration Guide 16 Configuration of the Web Application 16.2.1 Creating a Sticky Attribute Figure 16-8 Creating a Sticky Attribute Click on the Add button below the table to add a new attribute. This opens a dialog with the following properties: - Attribute is the attribute that is to be displayed permanently. The list only contains attributes that have been configured as searchable in at least one document repository (see section 8.3.2) and that have not already been configured as a sticky attribute. If you have already added all of the available attributes as sticky attributes, following error message will be displayed: - Data type is the expected data type for this attribute. This value is taken from the search index settings and cannot be changed. - Render as configures what kind of input field is displayed for the attribute. Possible values are: Text field displays a simple text field into which the user can enter a value. Select displays a list of a maximum of fifty values that currently exist for this attribute in the selected document repositories. The user can choose one value. Range displays two text fields, allowing the user to enter a minimum and maximum value. - Columns is the width of text fields. This is only available when the input field is rendered as a text field or a range. With sticky attributes of data type DATE, simple text fields or ranges are displayed with a button that allows the user to select a date from a calendar widget. DATE attributes can only be rendered as a Range. STRING attributes can be rendered as a Text field, Range or Select. FLOAT and INT attributes can be rendered as a Text field, Range or Select; we don t recommend that you use a Select with these types and if you do, a warning icon will appear in the dialog. Once you've entered your values, click on Save to store the configuration. You can edit a sticky attribute at any time by double clicking on the attribute in the table. Globalbrain Page 164 of 181

Administration Guide 16 Configuration of the Web Application 16.2.2 Removing a Sticky Attribute Figure 16-9 Removing a Sticky Attribute To remove a sticky attribute, select the attribute you want to remove and then click on the Remove button. 16.2.3 Ordering Sticky Attributes The sticky attributes are displayed on the search screen in the order in which they are displayed in the table. When several sticky attributes are configured, you can change the order by selecting an attribute and using the / arrow buttons to move it up or down in the list. 16.3 Configuring File System Access Figure 16-10 Configuring File System Access When the Globalbrain Web Client displays a document, it tries to offer a link that allows the user to download the original document. For documents that have been crawled from a Windows network file system (nfile), the files are delivered to the user by the web client: the web client picks the original document up from where it is stored and sends it to the user s browser. This means that the web application must have the permission to retrieve all of the files that have been crawled from file servers in the network. The easiest way to achieve this is to start the servlet container (Apache Tomcat) under a domain account that has all the necessary permissions. If you don t want to do that, you can set up rules that allow the web application to access certain servers or directories using the SMB protocol. This is done on the File Access tab. A rule consists of the following properties: - Domain and User name must contain a valid domain account that has the permission to access files within the requested directory. - Password is the password for this account - Host is the name or IP address of the server that contains the files. This must match the name or IP address as it is used in the URLs that are stored in your document repositories. - Path is the path on that server. To add an access rule: Globalbrain Page 165 of 181

Administration Guide 16 Configuration of the Web Application Click on the Add button under Authentication. Provide the properties as described above in the pop up window. Figure 16-11 Authentication Setting Multiple rules can be added as desired. Please ensure that the rules are non-ambiguous: No two rules must cover the same directory and no rule should point to a subdirectory of a directory that is already covered by another rule. 16.4 Activating NTLM Authentication The Globalbrain web client supports single sign-on with NTLM. This can be activated on a per-web-client basis with multiple Globalbrain web clients installed you can enable NTLM authentication for one client and keep it disabled for another one. 16.4.1 Preparing NTLM Authentication Figure 16-12 Jespa Setup Wizard Globalbrain uses the Jespa library to perform the NTLM authentication. Before you can configure NTLM authentication in your web client, you need to create a computer account in your Active Directory server which Jespa can use. The easiest way to do this is to use Jespa s setup wizard. You can find this wizard in the tools\jespa\ directory of your Globalbrain server installation. To run the wizard, double click on SetupWizard.vbs and follow the instructions. This wizard will generate the file SetupWizard.txt which will contain the properties you will need to populate in the NTLM settings window (see next section). Note: It is also possible to create an Active Directory account without the wizard. Please refer to the Jespa Operators Manual which is also located within the tools\jespa\ folder, look for section Alternative Step 1: Creating a Computer Account Manually. 16.4.2 Configuring NTLM Authenticaton To enable NTLM: Select the NTLM node within Configuration area. Activate the check box for NTLM authentication Globalbrain Page 166 of 181

Administration Guide 16 Configuration of the Web Application Figure 16-13 NTLM Settings Open the SetupWizard.txt file that was created by the Jespa Setup Wizard. Populate the fields with the information in the.txt file. Click on the Save button. 16.5 Configuring Ticket Authentication Figure 16-14 Ticket Authentication Globalbrain allows trusted external applications - such as SharePoint - to pass over a user that has already been authenticated by this external application without the user needing to log in to Globalbrain again. To achieve this, the external application sends an encrypted ticket with information about the authenticated user and a time stamp to Globalbrain. Globalbrain accepts this if it has been encrypted with the correct password. To set a ticket encryption password Select the Ticket node in the web client configuration Enter a value for Encryption password Save your settings Ticket authentication is disabled if you do not enter a password. When configuring a SharePoint web part for Globalbrain, make sure to use the same password (see chapter 15.3 of the Globalbrain Installation Guide). Globalbrain Page 167 of 181

Administration Guide 16 Configuration of the Web Application Figure 16-15 Web Part Configuration in SharePoint Globalbrain Page 168 of 181

Administration Guide 17 SOAP Access 17 SOAP Access 17.1 Introduction Globalbrain provides a SOAP interface that allows non-java clients to access the functionality of Globalbrain such as searching and inserting documents. The details of the SOAP interface are explained in the Developers Guide. This document only explains how to configure SOAP access. SOAP support is only available if the _GBSetupServer_SOAP_All.jar file was present when installing the server instances. You can verify this by checking if your server installation contains a lib/cxf/ directory. 17.2 Configuring SOAP Access To add SOAP support to your Globalbrain installation: Select the Configuration module. Select the Miscellaneous node. Select Add SOAP support from the context menu. Edit the properties as described below. Figure 17-1 SOAP Service Configuration The following properties are available: - Server Instance is the server instance on which the SOAP service will be started. Clients will need to know the host name or IP address of this machine. - Port is the port on which the SOAP service will be available. - Log SOAP Messages enables the logging of the full SOAP communication to Globalbrain s log files. Enable this only for debugging purposes. Once you have saved your configuration, a SOAP Service entry is available in the Miscellaneous section. Select this node to change the properties. Globalbrain Page 169 of 181

Administration Guide 18 Tools 18 Tools 18.1 Command line tools 18.1.1 Introduction: Using the Tools The Globalbrain server comes with some command line tools. They provide functionality that is either not offered by the administration client at all this may be because the tools access low-level APIs of Globalbrain which are not available for clients or they make functionality available on the command line. You can find the tools in the tools/ directory of your server installation. The tools are supplied as executable JAR files. You can run them like this: java jar filename.jar parameter1 parameter2 The filenames are case sensitive. 18.1.2 Exporting and Importing a Document Repository The content of a document repository (groups and documents) can be dumped to files which can be imported into an existing document repository. This may be used either for database independent backups or to transfer documents from one system to another. The export is invoked as follows: java -jar DocumentRepositoryExport.jar <DocumentRepositoryName> <targetdirectory> This produces a couple of files in the supplied target directory with the name of the document repository as a prefix. The files can be imported as follows: java -jar DocumentRepositoryImport.jar <DocumentRepositoryName> <sourcedirectory> <filenameprefix> Note: This tool only works with document repositories that store the data in a database. 18.1.3 Crawling a Single URL CrawlURL can be used to crawl a single URL and dump the crawl result to stdout. This is helpful when investigating problems that occur during crawling. It is invoked as follows: java -jar CrawlURL.jar <url> [<username> <password>] A user name and a password only need to be specified if the URL references a protected resource like a mail server or an FTP server. 18.1.4 Validating a Crawled Directory When crawling directories from a local file system with the nfile or lfile protocols, ValidateCrawledDirectory can be used to examine which files weren t crawled and which caused errors during crawling. The list of missing documents is printed in the following format: <status> <return code> <URL> Status can be one of Globalbrain Page 170 of 181

Administration Guide 18 Tools - ILLEGAL_EXTENSION the document has a file extension that is not covered by the allowed extensions configured for the crawl task. - DISALLOWED the document is located in a directory that has been excluded from crawling based on the allowed and forbidden paths configured for the crawl task. - REJECTED the document has not been crawled due to another exclusion rule configured for the crawl task. - LAYER - the job has not been crawled due to the maximum layer limitation in the crawl task configuration. 2 - MISSING_IN_REPOSITORY the document has been crawled successfully but is no longer present in the document repository. It may have been deleted by an administrator. - CRAWLING_ERROR crawling the document failed. Please check the return code for more specific information. - UNKNOWN the document has not been crawled for an unknown reason. The ValidatedCrawledDirectory tool is invoked as follows: where: java jar ValidatedCrawledDirectory [domain@]user:password crawltask[@crawler] [port] - domain is the name of the security domain to log in to; this value is only required if multiple security domains are available. - user is the user name to log in with. - password is the password for the given user. - crawltask is the name of the crawl task to be validated. - crawler is the name of the crawler this crawl task is assigned to; this value is only required if multiple crawlers exist. - port is the port on which the server instance is running. By default, port 1499 is used. 18.1.5 Changing the Logging Behavior The Globalbrain server instances write log files based on the configuration in the log4j.properties file which can be found in the conf/ directory. This includes the configuration of the log level which determines how detailed the logging output is. The following levels are provided by the log4j framework that Globalbrain uses: - ERROR messages contain information about errors that occurred on the server. The errors might be harmless like a failed login - but might also indicate a serious issue like the loss of the database connection. - WARNING messages indicate that something may be wrong on the server without an error having occurred yet. - INFO messages provide information about activity on the server for instance, documents that are crawled or searches that have been performed. - DEBUG messages give a more detailed insight into what s happening on the server. They provide additional information for support if you experience some kind of problem with your Globalbrain installation. - TRACE messages contain even more detailed information on the steps that the server makes. This level is used only rarely within Globalbrain. The default level is INFO which means INFO, WARNING, and ERROR messages will be logged. If you need to change the log level of a server instance at runtime for instance, to get additional information on an issue you can do this with the ChangeLogLevel tool by running 2 nfile and lfile crawl tasks that have been created via the Globalbrain Administration Client have an unlimited layer. This can only happen for crawl tasks that have been created or modified via the API. Globalbrain Page 171 of 181

Administration Guide 18 Tools where: java [-Dgb.serverNodeRole=<role>] -jar ChangeLogLevel.jar <logger> <level> - logger is the name of the logger. To change the log level for the full server, use com.brainware. You may also change the log level for parts of the server independently for instance, only for the crawler (com.brainware.gb.crawler) or for the search indexes (com.brainware.gb.search). - level is one of levels described above: ERROR, WARN, INFO, DEBUG or TRACE - role is the role of the server instance - if this option is not given, server is used The same tool can also be used to activate and configure the logger for the ASSA search engine. By default, logging at the engine level is turned off for performance reasons. To activate or change it, run ChangeLogLevel and use: - engine as the logger - INFO, DEBUG or OFF as the level 18.1.6 Configuration Explorer Configuration Explorer is a tool that provides low level access to Globalbrain s configuration database. It allows you to view the configuration entries and to modify existing entries. Please be aware that this tool is for Brainware Support or Professional Service personnel. Inappropriate usage can corrupt your Globalbrain installation. The configuration explorer can be started as follows: java -jar ConfigExplorer.jar This opens up the main window: Figure 18-1 Crawler Configuration The configuration is displayed in a tree structure with three nodes on the first level and several categories on the second level: - Properties contains the system settings of the server and the clients. If you click on a category, the properties stored for this category are listed in a table. Globalbrain Page 172 of 181

Administration Guide 18 Tools - Binary Objects contains binary data that is stored in the configuration database. This includes at least the license. Content can only be displayed if the stored object contains text data. - Config Objects contains the configuration of the resources and services that are available, including Security Domains, Server Instances, Crawlers, etc. Click on an entry to view its details. Configuration Objects can also contain other configuration objects. - Nested configuration objects are displayed as a label with a leading arrow; click on the label to view the nested object. - Use the navigation bar to step back to the parent object. 18.1.7 Script Runner Script Runner is a small application that allows you to run BeanShell scripts against the Globalbrain server. Scripts allow operations to be run directly on top of the Globalbrain API. This can be useful: - if you need to modify a large number of data objects (for example, crawl tasks) or - if you want to perform operations that are not supported by the clients (for example, report generation). To use the Script Runner, - you must be familiar with the BeanShell scripting language, see http://www.beanshell.org for details, - you must be familiar with the Globalbrain API please consult the Globalbrain Developers Guide for details. The Script Runner is started as follows: java -jar ScriptRunner.jar This opens up the main window: Figure 18-2 Script Runner You can: Type in the script code in the upper window. Globalbrain Page 173 of 181

Administration Guide 18 Tools Load an existing script via the Load button. You can store your script to a file using the Save button at any time. Within a script, two predefined variables are available: - sessionid contains a valid session id that can be used for API calls. - serviceregistry contains an instance of a Globalbrain ServiceRegistry which provides access to the services. Before you run a script: Enter the name or IP address of the host on which your server is running; if the server uses a port other than 1499, please add the port to the host name (for example, myserver:1599 ). Enter your credentials for logging in to Globalbrain (domain, user name, and password). Click on Run to start the execution of the script. Any output that is produced by the script is shown in the lower window. 18.2 CairoExtractor Designer Globalbrain uses an OCR tool called CairoExtractor which is provided along with Globalbrain. Please refer to the Installation Guide for installation instructions. If you want to take full advantage of this tool, select the CairoExtractor Designer during installation. This tool provides you with extended settings to configure the OCR process. Furthermore, you will be able to assign specific OCR settings to different crawlers depending on the task to be performed. 18.3 Starting CairoExtractor Designer To start the Designer tool: Click on Start. Open All Programs. Open the Brainware node. Select CairoExtractor 1.0. You will be presented with the OCR Properties window. 18.4 Selecting Preprocessing Methods Figure 18-3: Setting OCR Options Use the Preprocessing tab to edit image preprocessing options. Only the selected preprocessing methods will be applied. Globalbrain Page 174 of 181

Administration Guide 18 Tools To select a preprocessing method, under Available Methods highlight the method and click the arrow pointing to the right to move it to Selected Methods. To remove a method, highlight it under Selected Methods and click the arrow pointing to the left. The up and down arrows change the order in which the selected preprocessing methods will be applied. Only the methods that have been licensed are available and therefore visible in the Available Methods list. The following Preprocessing options are available: Binarisation: Box & Comb Removal: Clean Border: Despeckle: Lines Manager: Invert Converts grayscale and color images to bitonal images. Images must be bitonal for optimal OCR and barcode recognition. Removes black boxes or regular and irregular combs from a scanned image. This setting can be further configured. Removes dark borders from a document. Removes speckles. Speckles are made up of groups of black pixels surrounded by white pixels or vice versa. Removes horizontal or vertical lines. This setting can be further configured. Reverses white-on-black pages to black-on-white pages. A word on despeckling: this process, which removes dots and visual dust from your documents sometimes detects decimal points and removes them. This, of course, can be undesirable. If your documents are generally light, do your first preprocessing on them without despeckling to test the results. If the quality is still poor, enable Despeckling and preprocess them again. 18.4.1 Setting Specific OCR Tolerances Still on the Preprocessing tab, you can set the specific tolerances for the OCR methods. This is only available for the methods Box & Comb Removal and Lines Manager. To do this, click on a method in the Selected Methods column. 18.4.1.1 Box & Comb Removal Figure 18-4: Box and Comb Removal Settings Boxes, regular combs, and irregular combs can be removed in any combination. For example, you can remove boxes only, boxes and regular combs, or boxes and irregular combs. To have these objects removed without further configuration, select the Automatic checkbox. To specify the length of lines to be removed, uncheck the Automatic checkbox and enter values for: Globalbrain Page 175 of 181

Administration Guide 18 Tools Horizontal: Vertical: Hor. Interval ± tol. (%): Enter a minimum line length in millimeters for line and black run. Enter a minimum line length in millimeters for line and black run. The text boxes will be disabled when irregular combs should be removed. 18.4.1.2 Lines Manager Figure 18-5: Lines Manager Settings Remove hor. lines: Remove ver. lines: Repair Characters: Removes horizontal lines. Enter a minimum line length in millimeters in the text box horizontal line and its thickness in pixels in the text box black run. Removes vertical lines. Enter a minimum line length in millimeters in the text box vertical line and its thickness in pixels in the text box black run. Select the checkbox to repair characters. 18.5 Recognition Use the Recognition tab to configure OCR recognition. The OCR within Globalbrain will be performed by the FineReader 10 engine, which is already preselected in the Available Engines field. Figure 18-6: Recognition - General Tab Globalbrain Page 176 of 181

Administration Guide 18 Tools 18.5.1 General Tab On the General tab, the following options are available: Recognition Trade Off is a trade-off between performance and OCR accuracy. - Most Accurate: If selected, recognition will have a lower error rate but a longer processing time. This is the default mode. - Balanced: If selected recognition is in between Most Accurate and Fast. - Fast: If selected recognition is quicker but less accurate. Rotation - Detect Orientation: Select this to automatically fix the alignment of badly scanned pages - Deskew: Select this to automatically fix the alignment of badly scanned pages. Enhancement - Remove Lines: Select to remove all internal lines, either horizontal or vertical. - Remove Texture: Select to remove background noise from the image for recognition. - Despeckle: Select this option to remove speckles from your documents. Speckles are made up of a group of black pixels surrounded by white pixels or vice versa. 18.5.2 Recognition Tab Figure 18-7: Recognition Tab The Recognition tab provides the following options: Automatic: Typewriter: Matrix: OCR-A/OCR-B MICR E13 B Allow Superscript Allow Subscript Automatically detects how the documents were created. Use this settings unless your document input is very homogenous and either from a dot matrix printer or a typewriter. For typewritten material. For material printed with dot-matrix printers. OCR-A and OCR-B fonts are used where high OCR character recognition rates are needed. The fonts are optimized for machine character recognition. For example, they are used on passports and other security documents. MICR is a character recognition technology to facilitate the procedure of instance check. For text containing superscript characters. For text containing subscript characters. Globalbrain Page 177 of 181

Administration Guide 18 Tools 18.5.3 Languages Tab Figure 18-8: Languages Tab The Languages tab provides the following options: - Installed: Set of languages installed on the system. - In Use: Selected languages which are currently in use. On the Languages tab, select one or more languages from the Installed list by moving the language to the In Use list. Recognition results will be compared with entries in the corresponding dictionary. Select multiple languages only if you are processing multilingual documents. FineReader 10 features extended support of additional languages e. g. Hebrew. For more details please refer to the ABBYY documentation. However, not all of the languages available with FineReader 10 are enabled. They will be provided on request. 18.6 Getting Path Information To map a settings file to a specific crawler, the Designer tool allows you to display the location path, to copy it, and to paste it in the Administration Client. Open the File menu. Select Path. Figure 18-9: Calling Path Information Globalbrain Page 178 of 181

Administration Guide 18 Tools Figure 18-10: Path Information Window Click on Copy to ClipBoard. Now the full pathname is in the clipboard and can easily be pasted into the Administration Client. Please refer to section 10.4.1.6 Extraction for further steps. Globalbrain Page 179 of 181

Globalbrain Administration Guide. Version 5.4