FAST Enterprise Search Platform

Size: px

Start display at page:

Download "FAST Enterprise Search Platform"

Jordan Nash
6 years ago
Views:

1 FAST Enterprise Search Platform version:5.2 Product Overview Guide Document Number: ESP1000, Document Revision: A, April 3, 2008

2 Copyright Copyright by Fast Search & Transfer ASA ( FAST ). Some portions may be copyrighted by FAST s licensors. All rights reserved. The documentation is protected by the copyright laws of Norway, the United States, and other countries and international treaties. No copyright notices may be removed from the documentation. No part of this document may be reproduced, modified, copied, stored in a retrieval system, or transmitted in any form or any means, electronic or mechanical, including photocopying and recording, for any purpose other than the purchaser s use, without the written permission of FAST. Information in this documentation is subject to change without notice. The software described in this document is furnished under a license agreement and may be used only in accordance with the terms of the agreement. Trademarks FAST ESP, the FAST logos, FAST Personal Search, FAST msearch, FAST InStream, FAST AdVisor, FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FAST Contextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective, GetSmart, NXT, LivePublish, Folio, FAST Unity, and other FAST product names contained herein are either registered trademarks or trademarks of Fast Search & Transfer ASA in Norway, the United States and/or other countries. All rights reserved. This documentation is published in the United States and/or other countries. Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Netscape is a registered trademark of Netscape Communications Corporation in the United States and other countries. Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Red Hat is a registered trademark of Red Hat, Inc. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. AIX and IBM Classes for Unicode are registered trademarks or trademarks of International Business Machines Corporation in the United States, other countries, or both. HP and the names of HP products referenced herein are either registered trademarks or service marks, or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries. Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United States and/or other countries. XML Parser is a trademark of The Apache Software Foundation. All other company, product, and service names are the property of their respective holders and may be registered trademarks or trademarks in the United States and/or other countries. Restricted Rights Legend The documentation and accompanying software are provided to the U.S. government in a transaction subject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure of the documentation and software by the government is subject to restrictions as set forth in FAR Commercial Computer Software-Restricted Rights (June 1987).

3 Contact Us Web Site Please visit us at: Contacting FAST Fast Search & Transfer, Inc. Cutler Lake Corporate Center 117 Kendrick Street, Suite 100 Needham, MA USA Tel: +1 (781) (8:30am - 5:30pm EST) Fax: +1 (781) Technical Support and Licensing Procedures Technical support for customers with active FAST Maintenance and Support agreements, tech-support@fastsearch.com For obtaining FAST licenses or software, contact your FAST Account Manager or customerservice@fastsearch.com For evaluations, contact your FAST Sales Representative or FAST Sales Engineer. Product Training fastuniversity@fastsearch.com Sales sales@fastsearch.com

5 Contents Preface...ii Copyright...ii Contact Us...iii Chapter 1: The FAST ESP Documentation Set...11 Standard FAST ESP Product Documentation...12 FAST ESP Product Overview...12 FAST ESP Installation...12 FAST ESP Configuration...12 FAST ESP Operations...12 FAST ESP Advanced Linguistics...12 FAST ESP Troubleshooting Guide...12 FAST Home...12 FAST Search Business Center...12 FAST ESP Deployment Planning...13 FAST ESP Migration Guide...13 FAST Search Front End (SFE) Users Guide...13 File Traverser...13 FAST Classifier...13 FAST ESP Query Languages and Parameters...13 FAST ESP WebAnalyzer...13 Additional FAST ESP Components and Documentation...13 FAST Enterprise Crawler...13 FAST ESP Software Development Kit (SDK)...14 Chapter 2: FAST ESP at a glance...15 Introduction to FAST ESP...16 System Architecture...16 Data Flow Overview...16 Module Overview...17 Chapter 3: Basic Concepts...21 Content and Flow...22 Collections...22 Search Profiles...23 Document and Document Elements...24 Index Schema, Index Profile...24 Search Rows, Columns, and Clusters

6 FAST Enterprise Search Platform Chapter 4:Integrating FAST ESP with your Content and Query Infrastructure.27 Retreiving and Processing Content...28 Integrating FAST ESP on the Content Side...28 Using the Crawler...28 Using the File Traverser...29 Pushing Content to the Content API...31 Using a FAST Content Connector...31 Integrating FAST ESP on the Query Side...32 Custom Integration...32 Administration and Installation Integration...33 Chapter 5: Processing Documents...35 Document Processing Overview...36 Document Processing Engine, Stages, Pipelines...36 Custom Document Processing...37 Entity Extraction...37 Chapter 6: Making Documents Searchable...39 Indexing Documents...40 The FAST Search Engine...40 Search Engine Clusters...40 Search Columns and Search Rows...40 Defining How Documents are Searchable...41 Document Processing, Index Profile, and Search Engine Cluster...41 Index Profile Structure...42 Including Metadata...43 Executing Search Queries and Returning Results...43 Partial Document Updates...43 Query Highlighting in Dynamic Teasers...43 Query Highlighting in Source Documentation...44 Chapter 7: Concepts of Relevancy...45 Components of Search Relevancy...46 Contextual Insight...46 Ranking Concept...47 Freshness...47 Authority...47 Quality...47 Proximity and Context...48 Freshness Boosting...48 Analyzing linked web pages using the WebAnalyzer

7 Tools to Modify Rank for Individual Documents...49 Search Business Center...49 Boost Bulk Tool...49 Boosting Mechanisms...49 Relevancy Modifications Based on Business Rules...50 Proximity Ranking and Matching...50 Explicit Proximity...51 Implicit Proximity...51 Sorting Overview...52 Full Text Sorting...52 Multi-Level Sorting...52 Sorting on Geographical Coordinates...52 Field Collapsing...52 Controlling Ranking and Sorting of Query Results...53 Boundary Matching...53 Duplicate Removal...54 Dynamic (Result-Side) Duplicate Removal...54 Chapter 8: Processing Queries and Results...55 Query and Results Server Overview...56 Query Concepts...56 Query and Result Server...56 Query and Components...56 Query Processing...57 Query Modifications...57 Query Resubmission...57 FAST Query Language...58 Result Processing...58 Result Views...58 The FAST Search Front End (SFE)...59 Chapter 9: Geo Search...61 GEO Search Overview...62 Chapter 10: Scope Search and Dynamic XML Indexing...65 Scope Search Overview...66 Definition of a Scope...66 Example of Using a Scope Search...66 How Scope Search Works and Why It is Used...67 Scope Search vs. Fielded Search...67 Scope Search Concepts and Capabilities...68 Scope Fields...68 Scope Data Types

8 FAST Enterprise Search Platform Query Language in Scope Search...69 Return Matching Scopes...69 Scope Boosting...69 Dynamic Document Summary (Teasers)...70 Linguistics and Scope Search...70 Partial Updates...71 Dynamic XML Indexing...71 Chapter 11: Taxonomy and Navigation...73 Taxonomy and Navigation Overview...74 Navigators...74 Field Navigators...74 Deep and Shallow Navigators...75 Contextual Navigators...75 Field Navigators for Values in Scope Fields...75 Taxonomy...76 FAST Taxonomy Explorer...76 FAST Classifier...77 Unsupervised Clustering...77 Creating Taxonomy on the Fly...77 Chapter 12: Advanced Linguistic Processing...79 Linguistics Overview...80 Linguistics and Relevancy...80 Linguistics Concepts...80 Dictionaries...81 Automatic Language Detection...81 Lemmatization...81 What Lemmatization Means...82 Advanced Phrase Recognition and Lemmatization...82 Synonyms and Spell Variations...82 Synonym Overview...82 Dictionary Management...83 Advanced Phrase Recognition...83 Query Transformations...83 Advanced Phrase Customization...84 Advanced Phrase Recognition and Spell Checking...84 Applying Advanced Phrase Recognition...85 Spell Checking and Phrase Recognition Framework...85 Phrase Recognition and Correction...85 Spell Checking on Simple Terms...85 Applying Spell Checking...85 Required Dictionaries for Spell Checking...86 Anti-Phrasing

9 Required Dictionaries for Anti-Phrasing...88 Supported Language for Anti-Phrasing...88 Sub-String Search...88 Sub-String Search Overview...88 Application Scenarios...89 Applying Sub-String Search...90 Wildcard Search...90 Special Characters and Accents...90 Chapter 13: Operation and System Administration...91 Operation Overview...92 ESP Administrator Interface...92 Main Views...92 FAST Home and Search Business Center...94 Licensing...95 Fault Tolerance...96 Security...96 Chapter 14: Supported Document Formats...97 Supported Formats Overview...98 Supported Input File Formats...98 Word Processing Formats...98 Desktop Publishing Formats Database Formats Spreadsheet Formats Presentation Formats Graphics Formats Compressed Formats Formats Other Formats Chapter 15: Glossary ESP Term Definitions

10 FAST Enterprise Search Platform 10

11 Chapter 1 The FAST ESP Documentation Set Topics: Standard FAST ESP Product Documentation Additional FAST ESP Components and Documentation This chapter lists the components of the FAST ESP documentation set and explains how and when to use them.

12 FAST Enterprise Search Platform Standard FAST ESP Product Documentation The FAST ESP documentation set consists of both reference and task-oriented documentation. It includes the following guides: Note: The FAST ESP documentation set covers both standard and optional FAST ESP features. Optional features are enabled with individual license keys. If you have not purchased these optional features, they will not be enabled in your installation of FAST ESP. FAST ESP Product Overview The Fast ESP Product Overview Guide explains the basic concepts of FAST ESP and describes its features. It contains a glossary of terms for FAST documentation. It serves as an introduction to FAST ESP and basic concepts. FAST ESP Installation The FAST ESP Installation Guide describes the procedures needed to install FAST ESP. FAST ESP Configuration The FAST ESP Configuration Guide describes the basic procedures for creating a collection, configuring document processing, managing index profiles, configuring advanced linguistics, and other configuration information. In addition, it contains the DTDs used in FAST ESP. FAST ESP Operations The FAST ESP Operations Guide describes core operational procedures such as starting and stopping the system and back-up procedures. It is a task-oriented guide that answers the question How, as opposed to the Configuration Guide which answers the question What. The Operations Guide is for skilled users and does not provide descriptions of concepts or features. FAST ESP Advanced Linguistics The Advanced Linguistics Guide provides information about advanced linguistic processing in ESP. It provides descriptions, configuration information for linguistics features, and it provides procedures required to perform linguistics customizations such as creating own dictionaries, customizing existing dictionaries, or advanced tokenization configuration, for example. Basic conceptual information about linguistic processing is described in the Product Overview Guide. The Advanced Linguistics Guide is for advanced users and system administrators. FAST ESP Troubleshooting Guide The FAST ESP Troubleshooting Guide is a task-oriented guide intended to help you out of a bind while working with ESP. It provides scenarios and possible solutions to those scenarios. It also lists log errors and log messages. FAST Home The FAST Home Guide describes the FAST Home graphical user interface. FAST Home is the Business Manager's personal portal to the FAST ESP installation. FAST Home is where you create and set up the initial search profiles, and where you manage the users and groups that should have access to work with the search profiles. FAST Home has links to other FAST applications, such as Search Business Center and the Administration GUI. FAST Search Business Center The Search Business Center Guide describes the Search Business Center application. Business Managers use Search Business Center to manage ranking, relevancy, synonyms, navigators and more. Search Business 12

13 The FAST ESP Documentation Set Center is where you tune and configure the search experience for the search profile before you publish it to your production environment. In Search Business Center you can monitor the end-users query behavior (query logs and reports).you can make changes to the search profile settings and test them out in the internal Preview before publishing the changes to the Published Search Front End. FAST ESP Deployment Planning The FAST ESP Deployment Planning Guide describes what to consider before installing the product. For example, it describes the concepts and overall principles for system dimensioning, fault-tolerance setups, and component optimization. FAST ESP Migration Guide The FAST ESP Migration Guide describes what you need to take into consideration when migrating FAST ESP 5.0 to FAST ESP 5.1. It provides migration scenarios, reference information about requirements, and procedures for successful migration. The Migration Guide provides information for the following types of users: Managers and supervisors, who need to understand what is involved in the migration, and System administrators and operators, who need detailed information on how to perform the migration. FAST Search Front End (SFE) Users Guide The Search Front End User s Guide describes how to use the default Search Front End in FAST ESP. The guide describes how to search using the different search types (Simple, Contextual, Similarity and Fielded search), how to understand and navigate in your search results, and what you can do to improve relevancy and search effectiveness to get the right results. It also describes how you can customize the Search Front End using Search Business Center. File Traverser The File Traverser Guide describes the File Traverser and how to configure it. FAST Classifier The FAST Classifier Guide describes how to use the FAST Classifier. It provides an overview of what the FAST Classifier is and how to use it through either the Command Line or GUI. FAST ESP Query Languages and Parameters The Query Language and Parameters Guide describes the parameters available for controlling query submission, transformation and result gathering. Parameter interfaces are presented for both the API and the FAST Query Language (FQL) directly. FAST ESP WebAnalyzer The WebAnalyzer is a FAST ESP module that uses links between documents to improve search relevancy. The WebAnalyzer Guide describes the feature and provides installation, configuration, operations, and troubleshooting information. Additional FAST ESP Components and Documentation FAST provides the following additional components with separate documentation sets. FAST Enterprise Crawler The FAST Enterprise Crawler Guide related to ESP describes the Enterprise Crawler for version 6.6, and includes migration, installation, configuration, and operational information. The guide also includes some deployment planning information that is specific to the Crawler. 13

14 FAST Enterprise Search Platform FAST ESP Software Development Kit (SDK) The SDK contains additional integration tools for query, content and document processing integration, and Search Front End development. Content Integration The FAST ESP Content Integration Guide describes the available programming interfaces along with the required steps needed to integrate FAST ESP with your content sources. Query Integration The FAST Query Integration Guide describes the available programming interfaces along with the required steps needed to set up a customized search interface towards your FAST ESP implementation. Indexing Database Content and XML The FAST ESP Indexing Database Content and XML Guide provides an overview of how to set up a FAST ESP installation for structured data. Document Processor Integration The FAST ESP Document Processor Integration Guide describes how to create your own document processors. Document Hit Highlighting The FAST ESP Document Hit Highlighting Guide describes how to configure and use the document hit highlighting feature in FAST ESP. The feature enables you to view the matching sections of the full document with the matching query terms highlighted. Search Front End Developers Guide The SFE Developer's Guide explains how the Search Front End (SFE) and Search Front End API (SFEAPI) interact, and how the SFE and SFEAPI can be customized using the Java Struts framework and Velocity templates. This guide is for people using the SFE and SFEAPI as as the basis for building custom search front ends. It can also be used for people who need some information when building a search front end from scratch, as the SFE provides examples on how different parts of the search front end can be implemented and how the Search API can be used. Content Connector Tool Kit (CCTK) The FAST ESP Content Connector Toolkit Guide describes the Content Connector Toolkit (CCTK), a framework that makes it easier to develop connectors for ESP and InStream. This guide provides some conceptual information about content integration, conceptual information about connector development and guidelines related to that, explanations of the content connector framework and architecture, and procedures for using the CCTK. Application Integration This guide gives a description of the ESP Administration integration architecture, and describes how to integrate applications with the ESP Administration and Application services. The services include component management, collection management, search profile management and query log handling. Query Reporting Framework The Query Reporting Framework Guide describes what the query reporting framework is and how to use it. 14

15 Chapter 2 FAST ESP at a glance Topics: Introduction to FAST ESP This chapter gives you an overview of the FAST ESP product, its system architecture, features, and modules.

FAST Enterprise Search Platform Introduction to FAST ESP FAST ESP is an integrated software application that provides a platform for searching and filtering services.

16 FAST Enterprise Search Platform Introduction to FAST ESP FAST ESP is an integrated software application that provides a platform for searching and filtering services. It is a distributed system that enables information retrieval from any type of information. ESP combines real-time searching, advanced linguistics, and a variety of content access options into a modular, scalable product suite. FAST ESP does the following: 1. Retrieves or accepts content from web sites, file servers, application-specific content systems, and direct import via API 2. Transforms all content into an internal document representation 3. Analyzes and processes these documents to allow for enhanced relevancy 4. Indexes the documents and makes them searchable 5. Processes search queries against these documents 6. Applies algorithms or business rule-based ranking to the results 7. Presents the results along with the navigation options System Architecture Figure 1: FAST ESP System Architecture Data Flow Overview The data flow through the FAST ESP system consists of the following basic steps: 1. Submitting Content 2. Analyzing and Processing Documents 3. Matching Documents and Search Queries 4. Matching Documents and Triggers 5. Managing and Tuning Submitting Content Content is submitted using either: one of the Content Connectors that are included with FAST ESP one of the FAST Content Connectors, available as separate software packages 16

17 FAST ESP at a glance the FAST Content API to push content directly to FAST ESP. For detailed concepts about submitting content, refer to Integrating FAST ESP with your Content and Query Infrastructure. Analyzing and Processing Documents Once a content entity has been submitted to the FAST ESP system, it is converted to a document that complies with the FAST ESP internal document format. Each document goes through a set of document processing steps performed by the FAST Document Processing Engine. The purpose of document processing is to extract additional information, such as the language of the content, and to add additional information to the document to improve the search relevancy. For detailed concepts about document processing, refer to Processing Documents. Processed documents are passed on to the FAST Search Engine. Matching Documents and Search Queries As new documents arrive at the FAST Search Engine, the Search Engine generates search indices from them. 1. The end-user or external query application submits search queries through a search front end or directly to the FAST Query API. 2. The Query API in turn sends the query to the FAST Query & Result Server, which pre-processes the search queries to improve the relevancy of the results returned. Examples of such pre-processing are spell-checking or proper name recognition (see Advanced Linguistic Processing). 3. After having been pre-processed, the search queries are sent to the FAST Search Engine. The engine matches them against its indices and returns a list of resulting documents along with result set navigation options which let the user further refine the search. The FAST Query & Result Server can then perform post-processing on the result list, such as category result grouping, sorting, or adding navigators for dynamic drill-down. 4. Finally, the result list is returned through the FAST Query API back to the Search Front End (SFE) or the external query application. For detailed concepts about matching documents and search queries, refer to Making Documents Searchable, section Defining how Documents are Searchable and Processing Queries and Results. Managing and Tuning 1. The FAST Administrator Interface (also referred to as the Admin GUI) allows you to easily manage and monitor your FAST ESP implementation. It displays status messages from a range of administrative modules such as the FAST Log Server, or the FAST Configuration Server. Monitoring via SNMP enables ESP to be monitored from other systems such as IBM Tivoli and HP OpenView, and supports ESP status reads (component, indexing, document processing, and query statuses). 2. The Taxonomy Explorer allows you to manage categories and categorization rules for grouping the search results. For details about taxonomy management, refer to Taxonomy and Navigation, and to the Taxonomy Explorer Guide. 3. FAST Home and Search Business Center applications are used for setting up and tuning search sites, creating users and user access, and accessing other FAST interfaces. Refer to the Fast Home and Search Business Center Guides. Module Overview The FAST ESP system consists of different types of modules that can be categorized according to the purposes of the modules: in other words, what the module does in the system, such as matching and query/result processing. 17

18 FAST Enterprise Search Platform Category Data Sources Document Processing Matching and Query/Result Processing APIs Administration Relevancy Tuning Module FAST Crawler File Traverser FAST Document Processing Engine FAST Search Engine FAST Query & Result Server FAST Content API FAST Search API FAST ESP Administrator Interface (Admin GUI) License Manager AdminServer Boost Bulk Tool Description Locates and retrieves files on Web servers. Traverses and retrieves files from directories on file servers. Performs document processing tasks for format conversion and document relevancy such as language detection, Asian language tokenization, and lemmatization. Performs the indexing and searching tasks within FAST ESP. It indexes new documents coming from the FAST Document Processing Engine, matches them against search queries submitted by the Query Result Server, and returns a list of resulting documents and result set navigation options to the Query and Result Server. Processes search queries and search results to enable relevancy-focused searching and result presentation. It provides linguistic query processing features like spell checking, and results processing features like result clustering. Allows the standard data source modules of FAST ESP, as well as custom applications, to push content to the FAST Content Distributor. The API is available in Java, C++, and.net Allows external search front end systems to submit their queries and receive result sets in return. The API is available in Java, C++ and.net. Provides a browser-based graphical user interface that allows the system administrator to monitor and configure FAST ESP. License server for all components controlled by the licensing scheme. Allows system administrators and business users to monitor the end-users query behavior and to fine-tune the ranking of individual documents based on the monitoring results. Can be used to import rank boosting specifications for individual documents into an existing search index. It reads boost records from an XML file and 18

19 FAST ESP at a glance Category Additional Modules (distributed separately) Module FAST Content Connectors FAST SDK FAST Taxonomy Explorer FAST Security Access Module Description applies the boosts to the specified documents. Various FAST Content Connectors allow you to submit content from databases such as DB2, Oracle, or SQL Server, and other specific applications. Provides additional integration tools--in addition to standard APIs--for query, content, and document processing integration and search front end development. Used to create taxonomies for document organization and/or use concept extraction. Provides application level security when integrating FAST ESP with security environments such as Active Directory. 19

21 Chapter 3 Basic Concepts Topics: This chapter introduces you to the basic concepts of FAST ESP. Content and Flow Collections Search Profiles Document and Document Elements Index Schema, Index Profile Search Rows, Columns, and Clusters

FAST Enterprise Search Platform Content and Flow Data that has not yet been submitted to the FAST ESP system is called content. Searchable content entities are called documents.

22 FAST Enterprise Search Platform Content and Flow Data that has not yet been submitted to the FAST ESP system is called content. Searchable content entities are called documents. Examples of content are MS Word files, HTML-pages, or database entries. Content that is submitted to and flows through FAST ESP undergoes different steps of normalization, document processing, and indexing before it is available for searching. Collections Content is retrieved, processed, made searchable, and then grouped into collections in ESP. Collections allow you to treat different groups of content differently, specifying for each collection the way in which its documents are to be processed and indexed. Grouping of content into collections is typically based on criteria like: Different views of the content seen from the end-user application, such as product data, Web site pages and news. (See Search Profiles.) Content ownership, such as intranet versus extranet content Special processing rules, such as metadata handling Grouping content enables end-users or external query applications to narrow down the scope of a search to specific types of documents. In addition, the collection concept allows you to specify the order in which different types of content are to be processed during document processing by prioritizing individual collections. 22

Basic Concepts Figure 2: Content Refinement Showing Different Collections The collection concept does not imply any physical partitioning of the index.

A collection is set up by defining the content source, for example a set of Web domains, and the document processing rules (see Processing Documents) to be applied.

23 Basic Concepts Figure 2: Content Refinement Showing Different Collections The collection concept does not imply any physical partitioning of the index. FAST ESP can effectively support very large numbers of collections with minor performance impacts.you can also partition based on collection (collection-based routing). A collection is set up by defining the content source, for example a set of Web domains, and the document processing rules (see Processing Documents) to be applied. For procedural details on how to do this, see Basic Setup, section Creating a Basic Web Collection, in the Configuration Guide. Search Profiles While collections group the documents and/or other indexed content, search profiles define what to search and how your queries and results should be processed and displayed. Search profiles are created through Fast Home, and are monitored and tuned through the Search Business Center. Refer to the Fast Home and Search Business Center Guides for more information. 23

24 FAST Enterprise Search Platform Document and Document Elements Content is submitted to the FAST ESP system and converted into documents. A document represents the content entity as a set of data elements. These elements contain information extracted from the original content entity, such as the information contained in the title or body section of an HTML-page. A document represents a searchable entity within the FAST ESP Index. Generally, there is a one-to-one relation between a content entity and a document. The definition of a content entity depends on the way your content is structured. This document representation is used for the processing performed prior to indexing. Refer to the chapter Processing Documents for more information. In addition to the information included in the original document, information improving search relevancy is added to the document. Examples of elements are: title text author body text an ID that uniquely identifies the document the language the document is written in The conversion preserves the structure of the documents, as well as meta data if embedded in the documents. By default, text elements are assumed to be encoded according to UTF-8 (Unicode). The document concept is independent of the type of data being added to the system. For example, if the content source is a database table, each row of information from a table or view may become a document. For both search and filtering, each document is treated as one searchable item and is listed as such in the result list. Each document has a document identifier that is unique across the entire set of documents handled by the FAST ESP system. Note that the document identifier is not necessarily a URL. It may be a constructed URI representing, for example, the exact location of a record in a database. There are no restrictions to the format of the document identifier. However, for crawled content, it makes sense to use the URL of the crawled document. For content pushed to the system from a custom application using the Content API, the client pushing the document into FAST ESP needs to supply the URI. In this case, the document identifier may, for example, be the key for storing and loading documents in external storage. Index Schema, Index Profile This topic explains the relationship between fields and the index profile. Prior to indexing a document, the FAST Search Engine maps the document's elements to fields. Fields are defined document elements that are to be searchable. Defining fields allows the end-user or external query application to specify searches that cover only individual parts of a document such as the title or body part. You define fields by creating and specifying an Index Profile. FAST ESP supports text, signed and unsigned integer, float, double, and datetime fields. Text fields may contain words or numbers, and queries can be specified for single words, phrases, or a combination of these. Integer, float, and double fields contain numerical values that can be matched against a query by using numerical comparisons such as less than, greater than, and equal to. 24

25 Basic Concepts Multiple fields may be grouped into composite fields, allowing a query to be executed on several fields at the same time. Scope Fields are special field types that support dynamic indexing and searching in hierarchical content, such as XML. Refer to Scope Search and Dynamic XML Indexing for details. For details on the Index Profile structure, refer to the Configuration Guide. Search Rows, Columns, and Clusters The concepts of clusters, rows and columns--and the relationships between them, are used in FAST ESP. Search Engine instances are grouped into Search Clusters. Each Search Cluster shares a common Index Profile, that is the content must be possible to represent within a common index schema. Within each Search Cluster any number of Search Engine Nodes may exist in organized rows and columns. Rows are used for query scaling and columns are used for document volume scaling. Figure 3: Multi Node Search Installation Showing Columns and Rows 25

27 Chapter 4 Integrating FAST ESP with your Content and Query Infrastructure Topics: Retreiving and Processing Content Integrating FAST ESP on the Content Side Integrating FAST ESP on the Query Side Administration and Installation Integration This chapter introduces you to the basics of integrating FAST ESP into your content and query infrastructure.

28 FAST Enterprise Search Platform Retreiving and Processing Content Content can be retreived from data sources in two ways: content pull and content push. The content pull approach leverages content connectors to retrieve the informatino via standard APIs or interfaces provided by the source content repositories. This is the core technology of most search solutions, and includes retreival of Internet-based information (Enterprise Crawler), databases and other enterprise applications (FAST Smart Connectors) or file server-based documents (File Traverser). The content connectors do not require integration programming towards the target data repositories. The content push approach requires that the data repositories, applications or messaging middleware send the data directly to FAST ESP via the ESP Content API. This omits the latency of crawling but it requires a closer relationship between the content application and the search engine. Integrating FAST ESP on the Content Side The FAST ESP system accepts content submitted using one of its standard data source modules or pushed through the Content API. Type of Content to be Submitted Content stored on Internet, Intranet or Extranet Web servers Content stored on file servers, including XML data exported from databases Content stored on file servers, including XML data exported from databases Other content Content stored in databases, or specific applications Table 1: Content Access Options Data Source Module to Use FAST Enterprise Crawler File Traverser File Traverser Pushing content through the FAST Content API FAST Content Connectors. Content Connectors may also be created using the Content Connector Toolkit available in the FAST ESP SDK. Type of Module Standard FAST ESP data source Standard FAST ESP data source Standard FAST ESP data source Customer application using the FAST Content API Optional data source module The content push approach implies that a custom application or third-party messaging middleware sends data directly to FAST ESP through the Content API. Using the Crawler You can access content on Web sites using the FAST Enterprise Crawler. The crawler scans specified web sites and follows hyperlinks, extracts the desired information and detects duplicates. The document processing converts the HTML into structured data as defined by the Web representation. This means separating heading and body, as well as extracting relevant meta-information from HTML pages. The Enterprise Crawler usually begins from a single URL or list of URLs and follows every link from this set according to the configuration of the collection. FAST ESP enables specific parameters to be set on the crawler such as: crawling frequency, excluded documents, paths, and domains. Intelligent loop detection keeps the crawler from repeatedly traversing the same page. Loop detection is instensitive to minor changes in URLs and time. During crawling process, duplicate files are identified and excluded from the index. Intranet, Extranet and Internet content from Web servers can easily be submitted using the FAST Crawler. It scans specified web sites by following links for appropriate content and extracting the relevant information. 28

29 Integrating FAST ESP with your Content and Query Infrastructure The FAST Crawler: Allows crawling based on an unlimited number of start URLs. Scales in a cost-efficient manner with total content size, number of documents, and number of different sites being crawled. Allows you to specify sub collections within collections with separate request rates and refreshes. This enables you to crawl individual subdomains of sites differently. Enables incremental crawling. The FAST Crawler can be configured to focus on retrieving new content only, or detecting modified or deleted items in previously retrieved content. Allows you to specify the types of files to be crawled by adding the MIME type through the FAST ESP Administrator Interface, telling the FAST Crawler to recognize and bring back the desired file types only. Detects whether content on a Web server has been deleted. When a document once detected has not been seen for a given period of time, the FAST Crawler regards it as deleted. This document is deleted from the collection(s) it belongs to. Enables specific crawling parameters per collection such as crawling frequency, excluded documents paths and domains. Retrieves both static and dynamically generated web content. Allows you to manually activate crawling of specific URIs, sites, or collections. The crawling process consists of two steps content retrieval and post processing. During content retrieval, Web content is retrieved. During post processing, the retrieved content is analyzed to determine new or modified content and the parts of the content on the crawled Web server that have been deleted. In addition, during this step, the FAST Crawler detects duplicates within a collection. The FAST Crawler interfaces directly with the Content API to submit the content. Note: To retrieve content from locations other than Web servers, you can use the File Traverser for regular file server. Or you can purchase one of the FAST Connectors, allowing you to retrieve content from specific applications like Microsoft Exchange or Documentum. For purchase information, contact FAST Support. For details on the features of the FAST Crawler and how it is configured, refer to the Crawler Guide. Using the File Traverser You can retreive files from a file server using the File Traverser. The File Traverser scans specified file directories on file servers, retrieves content of various formats, and submits it to a collection in your FAST ESP installation. The File Traverser: Works on any reachable file server. Allows you to locate individual types of files by specifying individual file extensions like html, htm, pdf, and doc, for example. Sends the located files to the Content API in batches. The size of the batches is configurable by two parameters total file size and number of files. Allows you to locate files incrementally by reporting only those files that have changed since the last run (mods_only mode). Typically, file servers contain a lot of static content: there are many documents that do not change frequently. If the File Traverser is run in mods_only mode, it will only submit content that has changed since the last run. This saves your FAST ESP installation from processing documents that it has processed before, and helps to increase system performance while ensuring index freshness. Allows you to determine the files that have been deleted between two runs of the File Traverser (dels_only mode) and to delete them from their collection(s). Can be run without actually performing any operations (report mode). This allows you to verify your File Traverser configuration. Traverses and submits any XML files, including FastXML. 29

30 FAST Enterprise Search Platform Can run independently from FAST ESP on a separate node. Retrieving Macromedia Flash Files The Enterprise Crawler includes functionality to retrieve the Flash files, and they are indexed as separate files within the searchable index. FAST ESP includes the ability to follow hyperlinks and index textual content from Macromedia Flash files. The following document processing pipelines are used: Generic, SiteSearch and NewsSearch. For more information, refer to the Configuration Guide, and to the Enterprise Crawler Guide. Internal Process and Data Flow This topic explains how files are processed with the File Traverser. The File Traverser is a command line tool. It works on any reachable file server, recursively locating any files associated with the top directory specified in the command line. It processes files that match some specified file extensions like.html,.htm,.pdf or.doc. Furthermore, you can configure the File Traverser to map file names to URLs based on a given URI prefix. There is also a GUI-based configuration option in FAST ESP. You can configure the File Traverser via the Data Sources Admin GUI tab. This is to be activated through the Connector Controller. Refer to GUI-based operation via Connector Controller for more information. Interfacing with File Traverser The File Traverser interfaces directly with the Content API to submit content. Monitoring and Logging with File Traverser The File Traverser logs to the FAST Log Server. You can monitor its log messages in the FAST ESP Administrator Interface (also known as the Admin GUI). In addition, the File Traverser logs output to the shell it is started in. To retrieve content from locations other than file servers, you can use the FAST Crawler for Web servers, which is included in your FAST ESP distribution. Or you can use one of the FAST Connectors, allowing you to retrieve content from applications like Microsoft Exchange or Documentum (see section Using a FAST Content Connector ). Contact your FAST Account Manager or FAST Technical Support for purchase information. For details about the File Traverser features, refer to the File Traverser Guide. Optional Data Source Modules In addition to the FAST Crawler and the File Traverser, the optional FAST Connectors provide support for extracting and submitting content from databases such as DB2 and individual content management systems. For purchase information, contact your FAST Account Manager or FAST Technical Support. GUI-based Operation via Connector Controller The Connector Controller acts as a proxy between the File Traverser and the Administration Interface (Admin GUI), and as a proxy between some connectors and the Admin GUI. It allows for configuration and operation of the File Traverser and connectors. With the proper configuration, the selected connector or File Traverser appears in the Admin GUI as a Data Source which, if selected, will enable you to work with the File Traverser or connector settings through the user interface. See the Configuration Guide, Integrating the File Traverser Connector Controller. The process is similar for connectors as for the File Traverser, but there are some variations. Refer to the Connectors Guide for the specific connector you are using for information on how to install and configure the connector controller. Note: Not all connectors support the Connector Controller in FAST ESP. 30

31 Integrating FAST ESP with your Content and Query Infrastructure Pushing Content to the Content API If the content you want to submit is not retrievable from a Web server, a file server, or one of the applications covered by the optional FAST Connectors, you may use the Content API directly to push your content to FAST ESP. The FAST Content API: Allows submission of content and attached meta data. Packages the raw data and submits it to the Document Processing Engine. Allows for passing the content entity as such or passing a URL pointing to the content. Allows the standard data sources and the custom application to add, remove, and update content within the FAST Search Engines. Is provided for Java,.NET and C++. The FAST Content API allows the standard data sources of FAST ESP as well as custom applications to push content to the FAST Document Processing Engine. This implies improved freshness, as content may be submitted when published, and allows integration with applications not supported by the standard FAST ESP data source modules or one of the FAST Connectors. You can use the Content API to submit all types of content formats compliant with FAST ESP. When content is pushed to FAST ESP through the Content API, the structure of the retrieved content may be preserved and mapped to Document Elements. XML content that is already coded according to the FastXML structure is processed and mapped directly into the index. Other XML dialects are converted during document processing (see the chapter Processing Documents) using a built-in XML Mapper stage. The Content API uses HTTP as the underlying transport mechanism between the API client and FAST ESP. For details about the FAST Content API, refer to the Content Integration Guide. Allowed Content Formats FAST ESP allows you to submit content in the following formats: One of the multiple document formats that the FAST Document Processing Engine is able to handle. For details, refer to Appendix A Supported Document Formats. Directly from an application using the Content API. The API enables you to submit structured data that can be mapped to the FAST ESP Document Model. A format complying to the FastXML DTD. Any XML format. In this case the mapping from XML to the FAST ESP Document Model can be performed using scope search or an XPath-based conversion stage. Using a FAST Content Connector In addition to crawling internet sites, traversing file servers, and using the content API, FAST ESP allows you to submit content from other specific applications using the respective FAST Content Connectors. A content connector is a program that extracts content from some source system, maps the content from the source document model to the document model of FAST ESP, and feeds the documents to FAST ESP for indexing. There are Connectors for databases, content management systems, portal servers, and applications. FAST Content Connectors are optional modules. For purchase details, contact FAST Technical Support. Refer to the individual connector guides for information related to a particular connector. The FAST ESP SDK also provides a Content Connector Toolkit which helps you create your own connector application. Refer to the Content Connector Toolkit Guide for information on how to use it. 31

32 FAST Enterprise Search Platform Integrating FAST ESP on the Query Side FAST ESP provides some application programming interfaces (APIs) for creating search interfaces and integrating FAST ESP on the query side. Search API, available in Java, C++, and.net HTTP-based Query Interface FAST Web Service interface The FAST Search API handles the search query and result traffic between the Search Front End and the FAST QR Server. The Search API: takes search queries sent by the end-user and passes them to the QR Server. takes results coming from the QR and provides these as query result objects to the API application. provides abstraction layer interfaces for handling query result features such as Result Clustering and Dynamic Drill-down. For detailed deployment information, refer to the Query Integration Guide and the Query Language and Query Parameters Guide. The Search API uses HTTP as the underlying transport mechanism between the API client and FAST ESP. For details about how to use the APIs, refer to the Query Integration Guide. Custom Integration FAST ESP technology uses a modular approach with well-defined APIs for customer integration. A variety of content types can be retrieved using APIs, specialized connectors, and other tools. Here are some examples. Content Interface The Content API supports integration of applications via C++, Java, and.net. A Java-based Content Connector Toolkit provides a set of integration tools that simplifies the development of connectors. Refer to the Content Connector Toolkit Guide for more information. Search Interface FAST ESP is typically integrated into an existing Web site through the Query API. You may also use a SOAP/WSDL-based Web Services interface for query integration. Refer to the Query Integration Guide for more information. Document Processing Interface FAST ESP provides an interface for inclusion of customer-defined document processors, e.g. for advanced text analysis. Query/Result Processing Interface FAST ESP provides an interface for dynamic linking of custom query and result processors. For example, for custom query analysis/re-write and result parsing. Refer to the Query Integration Guide for more information. Administration Interface FAST ESP supports API integration for system administration and collection configuration. You can use either a Java-based API or command-line tool. Refer to the ESP Application Integration Guide for more information. Security Integration 32

33 Integrating FAST ESP with your Content and Query Infrastructure Securty Access Module provides document-level security capabilities for integration with your content and portal infastructure. Refer to the separate Security Access Module (SAM) documentation for more information. SDKs ESP Content SDK, Search SDK, and Application SDK provide various interfacing capabilities. ESP Content SDK provides integration capabilities for interfacing your content applications with FAST ESP. ESP Search SDK provides programmatic API and Web Services integration capabilities for your search application. ESP Application SDK supports a Java/Web-based SDK for interfacing to a set of core services in the FAST ESP platform. Examples are reporting and rank tuning. Refer to the SDK documentation set for more information for details on how these work. Web Services Interface Web services are a collection of standards and protocols that allow computers to communicate across the internet using XML and the ubiquitous HTTP protocol. Web services interfaces are particularly popular because they eliminate typical barriers to technical integration differences in, for example, hardware platform, operating system, and software language. For more information on web services, refer to Using the FAST Web Services Query Interface in the Query Integration Guide. Administration and Installation Integration Administration and Installation Integration can be performed in ESP using the View Admin Tool, the FAST ESP Installer, and Application SDKs. View Admin Tool The View Admin Tool can be used if the client administration system is not able to utilize the Java API. The tool can be used to perform Fast ESP administrative tasks from a UNIX or Windows command line including collection management. The tool can be executed on any Fast ESP node. For more information on the View Administration Tool, refer to the Operations Guide. Installation Integration You can configure and invoke the FAST ESP Installer from another application (OEM installation). Application SDK Integration With this SDK it is possible to interface to a set of core services in the FAST ESP platform incluing reporting and rank tuning. 33

35 Chapter 5 Processing Documents Topics: This chapter introduces you to the basic concepts of processing documents. Document Processing Overview Document Processing Engine, Stages, Pipelines Custom Document Processing Entity Extraction

36 FAST Enterprise Search Platform Document Processing Overview After content has been retrieved, submitted via the FAST Content API, and converted to documents, these documents are processed within the FAST Document Processing Engine for format conversion and relevancy enhancement. As explained in Basic Concepts, Documents and Document Elements, a document consists of a set of named elements, which contain values such as text strings or integers. Within the Document Processing Engine, these element values are read, analyzed and modified when required. New values can be added to empty elements. How document processing is performed, is defined per collection. Document Processing Engine, Stages, Pipelines The Document Processing Engine provides linguistic processing of documents through customizable document processing pipelines. These consist of multiple document processing stages. The Document Processing Engine also: allows customers to modify document processing pipelines. allows customers to write specific document processors with a minimum of constraints and plug them into arbitrary points in any pipeline. provides support for entity extraction. Document processing pipelines consist of multiple document processing stages. These document processing stages read element values of the document to be processed, compute analyses on them, and modify or add elements to the document. The Document Processing Engine consists of multiple document processing pipelines. Any incoming document is sent through a specified document processing pipeline. A document processing stage performs a particular document processing task and can modify, remove, or add elements to a document. It takes one or more document elements to be input and the resulting output is new or modified elements that may be further processed. With each document processing stage focusing on one particular area of document processing, document processing stages can be reused in a multitude of settings. When you configure one of the data sources provided with FAST ESP, you specify the collection(s) to which the data source submits documents. Then you assign the collection to a unique document processing pipeline that defines how the collection's documents are processed prior to indexing. Document processing pipelines are configurable through the FAST ESP Administrator Interface. For details about configuring document processing pipelines, refer to theconfiguration Guide. You can define new document processing pipelines from the interface, as well as specify the document processing stages to be involved and the sequence of execution within each pipeline. A typical document processing pipeline for web-retrieved information consists of the following stages: format detection to detect the MIME type of the document and determine if a format conversion is required. format conversion to convert the document's format from one of a whole range of external formats to the internal FAST ESP document structure. HTML parsing to extract structure from HTML documents such as title or body. language and encoding detection to enable language dependent processing and narrowing the scope of a search. unification of character encoding to UTF-8 Unicode representation. 36

37 Processing Documents tokenization. special tokenization for Asian languages. extraction of document summary. lemmatization. The Document Processing Engine also includes a Content Distributor which is responsible for dispatching incoming documents to the right document processing pipelines by controlling processor servers. The Content Distributor sends the current document to the processor server along with a pipeline request, and the processor server executes the stages in the requested pipeline on the document. The Document Processing Engine interfaces with data sources or the Content API for input and with the Search Engine for output. The Document Processing Engine sends its log messages to the Log Server. The Document Processing Engine can be monitored through the FAST ESP Administrator Interface (Admin GUI). The FAST Document Processing Engine supports a large variety of document formats. Custom Document Processing If you want to apply custom document processing to a set of documents without using and customizing one of the document processors provided with FAST ESP, you can do so by using the ExternalDataFilterTimeout document processor as an interface from and to which you can output and input documents. Also, it is possible to develop custom document processing stages using the FAST SDK. Refer to the Document Processor Integration Guide for details. Entity Extraction Document processing also includes entity extraction. Entity extraction is detecting, extracting, and normalizing entities from documents such as names of people or companies. This makes unstructured data more structured, and enables navigation or relevancy enhancements possible on specific entities. Both pre-defined and customized entities shipped with FAST ESP can be detected and extracted. Extraction of pre-defined entities is supported out-of-the-box for English, German, French, Spanish, Portuguese, Japanese, Italian and Dutch. Examples of pre-defined entities are: person company location job title newspaper university sentence date paragraph price measure upper acronym 37

38 FAST Enterprise Search Platform airline car file name ISBN phone zip code ticker time quotation Entity extraction is, per default, part of the NewsSearch processing pipeline for extracting entities on document level and the Semantic pipeline for extracting entities on scope level. Entity extraction can, however be used in custom document processing pipelines as well. Extraction of other entities is possible by: using the Admin GUI to specify additional extractors via a regular expression document processor which supports entity extraction based on regular expressions. The default configuration of this document processor supports extraction of addresses and US locations. Additional regular expressions can be defined, for example, extracting product names or customer specific information. Refer to Creating Entity Extractors in the Document Processor Integration Guide for more information. For support on extending the entity extraction feature, contact FAST Technical Support. 38

39 Chapter 6 Making Documents Searchable Topics: Indexing Documents Defining How Documents are Searchable This chapter introduces you to the basic concepts of indexing documents to make them searchable.

40 FAST Enterprise Search Platform Indexing Documents This topic explains how the FAST Search Engine, Search Clusters, and Search Columns and Rows affect document indexing. The FAST Search Engine The FAST Search Engine receives processed documents from the Document Processing Engine and makes them available for searching. The Search Engine consists of three sub-modules: the RTS Indexer: It indexes all documents arriving from the Document Processing Engine and stores the index. the RTS Searcher: It runs queries submitted by the end-user or external query application against the index stored by the RTS Indexer. the RTS Dispatcher: It distributes queries to different Search Columns, selects Search Rows based on load balancing and merges search results from different Search Columns and Search Partitions within the Columns. On the content side, the Search Engine interacts directly with the document processing pipelines. On the query side, the Search Engine interfaces with the Query & Result Server. Both RTS Indexer and RTS Searcher may be made operative on one or more machines.they may be spread across columns and rows to balance load and network traffic. For details about how to arrange RTS Indexer and RTS Searcher instances, refer to the Deployment Planning Guide. Search Engine Clusters Search Engine instances are grouped into search engine clusters. A search engine cluster is a group of Search Engine instances that share the same index schema, which is provided by an index profile. A search engine cluster has a number of collections logical groups of content assigned to it. One collection resides inside one search engine cluster, but may be spread across multiple search columns. Since all Search Engine instances in one cluster share the same index profile, all collections assigned to this cluster are indexed in the same way. There is a one-to-one relationship between an index profile and a search engine cluster: Each search cluster in your system needs one index profile. That means, if you want all content fed to your FAST ESP system to be handled according to one index profile, only one search cluster is required. If the content fed to your FAST ESP system consists of different types of content, where each content type requires a separate index profile, several search clusters are needed. As a rule of thumb, you select a single cluster configuration whenever possible especially if you want to be able to integrate results from Web and other sources for the same query, within a common result list sorted by relevance. Each cluster requires its own instance of the QR Server. Defining multiple clusters normally requires some support from FAST Solution Services. Consult FAST Technical Support for details. For details on how to deploy the FAST Search Engine, refer to the Deployment Planning Guide. Search Columns and Search Rows Within one search cluster, multiple Search Engine instances can be arranged in search columns and search rows to distribute query traffic and document load. Sets of indexed documents are stored in all Search Engine instances within a search column to scale data volume. That means that member rows of a search column share the same set of indexed documents. 40

41 Making Documents Searchable Queries are shared among all Search Engine instances within a search row to scale query rate. This means that when a query is sent to the search engine cluster, it is sent to all members of one search row (one node within each column) within this cluster to be matched against all sets of indexed documents. Defining How Documents are Searchable In the process of creating the search index, FAST ESP uses an index profile. An index profile is an XML-based configuration file. It is an index schema that defines the way documents are searchable. It specifies search properties like: which document elements are to become searchable fields which document elements are to become fields that are returned as part of a result how to calculate values that are used for sorting and ranking The purpose of an index profile is, to some extent, similar to the process of defining a database schema. Each document arriving at the FAST Search Engine is parsed and indexed based on the document s elements. These elements are mapped to the fields given in the index profile. Once the document resides in the index, you can search directly on these fields. You can set up and use several index profiles to address different types of content, for example Web pages and product database entries. Setting up several index profiles is done by defining multiple Search clusters. When you install FAST ESP, you can choose between standard index profiles or load a custom index profile. All default index profile files are located in $FASTSEARCH/index-profiles/ (UNIX) or %FASTSEARCH%\index-profiles\ (Windows) with $FASTSEARCH and %FASTSEARCH% environment variables set to the directory where FAST ESP is installed. Document Processing, Index Profile, and Search Engine Cluster The index profile concept is closely tied to the concepts of document processing and search engine clusters. During document processing, each document is represented by a set of elements that can be processed and later mapped to searchable fields related to the index profile. Both elements and fields represent content parts and attributes related to the document, for example, body, title, heading, URI, author, and category. Figure 4: The relationship between Document Processing, Indexing and Search Engine Clusters 41

42 FAST Enterprise Search Platform The index profile defines the layout of the searchable index, and specifies how fields are to be treated by query and result processing. Each search engine cluster has an associated index profile. The index profile also includes one or more result views. A result view defines alternative ways for a query front end to view the index with relation to queries. Index Profile Structure The structure of the index profile is the composite of fields and attributes. The index profile can be configured to allow different features in ESP. Fields The basic entity of an index profile is a field with its attributes. A field is searchable by default, and is also the basic entity in the result presentation. Typical field attributes are name, specifying the field's name, type, specifying the type of content the field holds, or index, specifying whether the field should be searchable. Scope Fields The FAST ESP Indexer is based on a field structure that defines the schema of the indexed content. The schema is defined using the Index Profile. The Scope Search feature is facilitated by introducing a new field type in the FAST ESP index, named scope field. Hence, a scope-enabled index may include different types of fields. A scope-enabled index may include the following types of fields: Basic field. A basic field may be of type string (any textual content), int32 (32 bit signed integer), float, double or datetime (representing a date/time value as a numeric value in the index), uint32. Composite field. A composite field includes a set of basic string fields that can be matched using the built-in dynamic ranking mechanisms in FAST ESP. Scope field. A scope field contains hierarchical scope content. The individual subscopes of a scope field may be of any data type supported by FAST ESP (string, int32, float, double or datetime). For textual scopes, a subset of the dynamic ranking mechanisms as provided for composite fields will apply. When defining a scope field, there is no need to define the actual scope structure within the scope field in advance. A FAST ESP index profile may contain a combination of one or more fields, composite fields and scope fields. Hence, it is possible to combine in one index both schema based content in fields with and scoped dynamic content. In the query language you may specify individual fields, composite fields or scopes to limit the scope of a query. For scope queries the scope specification in the query must include the scope field name (also called the root scope) and sub-scopes within the indexed scope structure. A scope field may include a hierarchy of scopes in arbitrary depth. The Scope Indexing is generic in the sense that it does not require any specific content input format. FAST ESP supports XML input format - other input may be supported by creating custom document processors. Composite Fields Composite fields allow you to group individual source fields by referencing the source fields through field-ref tags or field-ref-group tags. You can use this feature to apply a common rank score to a group of fields or to make them searchable as a unit. Features Enabled by the Index Profile The Index profile is configurable to enable different features in ESP. The following features are enabled by being specified in the Index Profile: ranking sorting tokenization lemmatization 42

43 Making Documents Searchable teaser generation navigation (dynamic drill-down) For procedural details about how to configure an index profile, refer to the Configuration Guide. Including Metadata ESP offers different ways to include meta data about content in the indexing process. Metadata can be included in the following ways: You may push meta data information along with the content using the Content API. The meta data information is treated as any other element of the content entity and transformed into a document element after the content has been submitted. You then need to design the index profile accordingly to catch any document elements you might want to be included in searching or result presentation, regardless of whether they originate from content meta data or not. If your content consists of HTML files, the built-in HTML parser extracts all HTML <meta> tags as document elements whose names are prefixed with meta_. For example, with the HTML fragment <meta name= DC.Identifier content= >, the internal document representation of this HTML file includes an attribute called meta_dc.identifier, that has the value Meta_xxx attributes are text chunks instead of plain strings. Refer to the Document Processor Integration Guide for information on what text chunks are and how to handle them. Meta names are lower cased, that is <meta name= KEYwordS content= abc, 123, abc, 123 > --> meta_keywords = abc, 123, abc, 123 For other content that complies to one of the formats the FAST Document Processing Engine is able to handle (see Processing Documents), the Document Processing Engine is able to detect meta data. For content in MS Word format, for example, it extracts meta data using the MS Word Properties field. Executing Search Queries and Returning Results The FAST Search Engine receives queries from the FAST Query & Result Server, which may have pre-processed the query to perform spell checking for instance. When a query is received, the FAST Search Engine matches it against the search index to identify a list of documents that match the query. The ordering of the list is either based on field based sorting or on ranking (see Concepts of Relevancy, section Sorting Results). The FAST Search Engine returns a set of fields from the most relevant documents based on ranking or sorting from this ordered list to the FAST Query & Result Server.The number of documents to return can be specified as part of the query.which fields to return for each document is defined in the Index Profile. Finally, the FAST Query & Result Server may perform post-processing on the result before it is returned to the end-user. Partial Document Updates In some situations, updates to a document can be frequent, but each update consists of a single changed value. Examples can be temperature measurements or bids in an auction. In this case, performing an update operation on the document in question will result in the new value being unavailable for several seconds as it has to wait for the smallest index to be re-indexed. This may be an unacceptable delay, in which case a partial update scheme can be used to improve the freshness of small update operations. The update may be performed without a need for re-indexing of the entire document. Partial updates can be performed via the Content API. Fields to be updated can (but don t have to be) updated as real-time properties. Real-time property fields can be configured in the Index Profile using the field attribute latency=low. Query Highlighting in Dynamic Teasers Query Highlighting in Dynamic Teasers extracts a range of the document centered on representative occurrences of the query terms. 43

44 FAST Enterprise Search Platform The document body is stripped for markup during document processing and stored. A maximum of 64 KB is stored. For each document on the result page, this document extract is retrieved and text segments are generated that include the best matches of the query in that document. The query highlighting supports advanced query operators such as proximity (NEAR/ONEAR), and also supports linguistics processing such as lemmatization and spell check. Query Highlighting in Source Documentation FAST ESP supports query highlighting in source documentation. The Document Hit Highlighting feature enables you to create a search application where the end-user may browse through the query hits within the full context of a matching document. When using this feature, FAST ESP keeps an HTML representation of the original document. If the original document is an HTML page, a copy of the HTML is stored in this field. If the original document is a different format (e.g. MS Office, PDF), a dedicated document processing stage converts the document to a similar HTML representation which will be stored and used for the hit highlighting on the client side. As part of the regular search results, the dynamic teaser contains HTML links which will lead you to the HTML representation of the source document. The link from the dynamic teaser will bring you to the most relevant query match within the document. From there you can browse through the query hits within the document by relevance. Figure 5: Components and features in Document Hit Highlighting Refer to the SDK Document Highlighting Guide for more information. 44

45 Chapter 7 Concepts of Relevancy Topics: Components of Search Relevancy Contextual Insight Ranking Concept Freshness Boosting Analyzing linked web pages using the WebAnalyzer Tools to Modify Rank for Individual Documents Boosting Mechanisms Relevancy Modifications Based on Business Rules Proximity Ranking and Matching Sorting Overview Boundary Matching Duplicate Removal This chapter introduces you to the basic concepts of relevancy features and tuning.

46 FAST Enterprise Search Platform Components of Search Relevancy Relevancy is the measure of how well a set of results answers or addresses the intent of a given query. FAST ESP supports search relevancy through the following key steps: Data mining The document processing framework provides support for extensive data mining to perform real-time content relevancy refinement. This includes embedded relevancy tools and integration points for 3rd party modules. For details on document processing, refer to Processing Documents. Linguistic processing Multiple linguistic processing features provide a number of approximate matching techniques to improve query recall. This includes automatic spell check, matching with inflectional variations of terms (lemmatization), thesaurus (synonym) matching and natural language support (anti-phrasing). The advanced linguistics features are described in further detail in Advanced Linguistic Processing, and in the Advanced Linguistics Guide. Sorting Sorting results based on individual document elements allows for highly relevant result presentation. For details on sorting refer to section Sorting Results. Rank value calculations The calculation of a rank value based on the FAST ESP ranking model provides a multi-faceted measurement of the quality of the match between the query and a candidate result document. This rank value consists of query dependent and query independent parameters. Query context analysis Query context analysis refers to the ability to present the information from the query results in context of the query. FAST ESP supports dynamic document summaries that display the segments of the matching document that provide the most relevant match with the query. Navigation Data driven navigation provides drill-down into the query result or related areas. Drill-down queries may be based on document similarities, category, entity, or terminology information extracted from the documents, parametric drill-down into multiple dimensions of the query result (dynamic drill-down) and drill-down into content domains (for example, all documents from a given site). This feature is further described in Taxonomy and Navigation. Contextual Insight Contextual Insight is the next generation for search relevancy, drastically improving precision without sacrificing recall. Using Contextual Insight you can create fact-finding applications enabling queries like when was NN born, where is the next winter Olympic games. Conventional search engines will return links to documents that include the name NN or the terms winter Olympic games. With Contextual Insight you can also detect the intent of the query, search for the terms/phrases and return requested entities that appears in context of the matching text in this case dates or cities. Contextual Insight is based on the following ESP features: Contextual processing Detection and markup of text flow such as sentences, paragraphs and other semantic structures in unstructured content. This is mapped to a hierarchical scope structure which enables search within different contexts of the document. This enables you to limit your search to paragraphs or other semantic elements in the text. 46

47 Concepts of Relevancy Context aware entity extraction Automatically detects entities in text and annotates the detected semantic structures with normalized entities. Entities include such things as people s names, phone numbers, geographic locations, and in our example, company names. Scope Search Enabling efficient indexing and search in a contextual decomposition of the documents. Contextual navigation (also sometimes referred to as scope navigation) Previously, successful navigation has been limited to global metadata. Contextual Navigation unlocks the semantic meaning of contextual metadata in the form of extracted entities. It is able to extract textual entities from the results of the previous search. Unlike taxonomies and facets, entity extraction draws its navigators directly from the results and is contextually aware so that you can return navigation entries that appears in the context of the matching sentences or paragraphs. Natural language a natural language query processor is available to enable creation of natural language query rules, enabling semantic query transformation for e.g. where, when, what, who type of queries. Ranking Concept FAST ESP ranking is based on a multi-faceted measurement of the quality of the match between the query and a candidate result document. The relevancy of a document with respect to a query is represented by a ranking value. In the index profile, you specify one or more rank profiles. A rank profile specifies the relative weight of each rank component for a given query. This enables individual relevance tuning of different query applications using a FAST ESP installation. The FAST ranking model is based on the individual tuning of the ranking parameters of freshness, authority, quality, proximity and context. Freshness Freshness denotes the age of a document compared to the point in time when the query is issued. Refer to section Freshness Boosting for more information. Authority Authority denotes the importance of a document as determined by the links from other documents to the document in question. To determine a document s authority, FAST ESP detects links from other documents and uses the anchor texts associated with these links to compute an authority rank component. Refer to section Anchor Text Analysis for more information. Quality Quality denotes the assigned importance of a document. Since quality metrics are assigned to individual documents or groups of documents directly, the quality of a document is query independent. FAST ESP provides a set of business manager tools that allow you to assign quality metrics to individual documents or groups of documents. It is also possible to apply quality metrics through metadata when submitting content via the Content API or Content Connectors. This is further described in the Content Integration Guide. The source (element name) and weight of the quality component can be specified in the Rank Profile. The Quality ranking component may also be referred to as Static Rank. Note: The term quality here refers to document quality. This should not be confused with the word quality when talking about Result Quality, which refers to quality criteria in the search result itself. 47

48 FAST Enterprise Search Platform Proximity and Context Proximity and context measurements determine how well the content of a document matches the query. This is based on the following aspects of a query match: The number of query terms matching a document within the result set (for an OR type query). Query term weighting. Different relevancy weights may be applied to different terms in a query. Proximity. When a query contains multiple terms that are not detected as known phrases, the ranking process takes the relative position of the terms and determines the most relevant results based on the proximity the matching terms in the document have to each other. Proximity denotes the distance between, and location of, query terms in the documents. Frequency of query terms occurring in a matching document, compared to the global frequency of the terms in the index. More occurrences in the matching document imply a higher ranking value. However, if the term has high frequency over the total index, this will reduce the ranking value. Context based Relevance Tuning. Different document fields, for example title, body, description, price, or type, may be assigned different relevance weight. This allows you to specify for example that a match in the title field of a document contribute more to the document's ranking value than a match in the body field of a document. The proximity and context parameters of the rank profile control these statistics metrics, except for the query term weighting which is selected at query time. Proximity and context metrics only apply to composite fields, not to query terms with wildcards. Freshness Boosting The freshness rank boosting feature controls to what extent relative age of the documents impacts the rank (relevance score). If enabled, newer documents will appear higher up in the result set. The date of a document is set when processing a document. The date source may be the content source itself (for example, an application submitting documents via the API), date information from file servers or web servers, or the time of processing within FAST ESP. The Crawler, File Traverser and Content Connectors will set the time stamp automatically for the document when submitting to FAST ESP. The Content Integration Guide describes how you can apply a custom time/date source for documents submitted via the API. When performing a query, the document date/time value is converted to a 'freshness' parameter that reflects the age of the document from the time of a query. The age is scaled to reflect the perceived importance of age (the difference between 1 and 5 days age may reflect the same relevance difference as between 1 and 12 months age). The freshness boost feature is controlled using the Rank Profile feature of the Index Profile and by query parameters. This feature can be controlled on a per query basis by: Selecting rank profile for the query by a query parameter. Multiple rank profiles (defined in the Index Profile) may have different weight on the freshness boost within the total rank. Selecting the time base for calculating the freshness boost. The freshness boost is calculated based on the relative age of each document compared to the given time base. Default time base is the current time when performing the query. Note: Boosting can also be managed through the Search Business Center. Refer to Managing Boosts & Blocks in the Search Business Center Guide. 48

49 Concepts of Relevancy Analyzing linked web pages using the WebAnalyzer The hyperlink structure of Web pages may provide valuable information about the importance of a web page. A web page to which a high number of other web pages refer to, is assumed to be more important than a web page to which only few or no other web pages refer. In particular, links from what are referred to as good pages - pages that are referenced by many other good pages - indicate that the linked page is important. The WebAnalyzer uses links between documents to improve search relevancy. The WebAnalyzer is a FAST ESP module that uses links between documents to improve search relevancy. The WebAnalyzer Guide describes the feature and provides installation, configuration, operations, and troubleshooting information Refer to the the WebAnalyzer Guide for information about the WebAnalyzer, including procedural information about the tasks you can perform from the WebAnalyzer Overview tab. Tools to Modify Rank for Individual Documents There are tools that enable you to perform Absolute Query Boost, Relative Query Boost or Relative Document Boost for given documents in the index. An example could be a product database where it might be desirable to boost products with highest profit margins, boost products related to campaigns, etc. The following tools exist for this purpose: Search Business Center and Boost Bulk tool. Search Business Center Search Business Center (SBC) is a GUI-based administrative tool that enables rank tuning on a document level. The boost value may also be negative, in order to avoid pages to appear on the top of a result list. In Search Business Center you can boost or block documents to change their rankings for a specific query. Note that only query-side boosting and blocking can be handled using the Search Business Center. Using the SBC you can change the ranking for each query using three different methods: Top Ten - to position the document in one of ten reserved places that will be returned at the top of the results list. Add boost points - to add a value to a document to increase its relevancy relative to the other documents returned in the search results.you can also add negative boost points to a document. Block from query - to prevent the blocked document from appearing in the search results for the query. Refer to the Search Business Center guide for details. Boost Bulk Tool This is a standard FAST ESP tool that enables you to perform the same rank tuning as the SBC, using an XML file as input. The XML file contains a specification of the rank modifications to be performed. This approach is preferred if you have the ability to extract the rank boost information from other data or other applications. Refer to Boostbt Tool in the Operations Guide for information on how to use this feature. Boosting Mechanisms FAST ESP supports the following types of boosting mechanisms: Absolute Query Boosting, Relative Query Boosting, and Relative Document Boosting. 49

50 FAST Enterprise Search Platform Absolute Query Boosting Suppose you want a document to be consistently displayed at a given position in the result set, for example at position one, when a user searches with a specific query. Then you can specify a document-query-combination and assign a fixed absolute ranking position that the specified document is to get within the result list, whenever a user is searching with the specified query. Absolute Query Boosting also allows you to exclude individual documents from being displayed at all when a user searches with a specific query. Relative Query Boosting Suppose you want to ensure that a particular document is always displayed among the first 20 documents in the result list, provided a user searches with a specific query. For all other queries, the ranking position of the document shall not be impacted by any boost. Thus, you specify a document-query-combination and assign an amount of ranking points with which the document s overall ranking value is to be increased whenever a user is searching with the specified query. Relative Document Boosting Suppose you want to ensure that a particular document is always displayed within the first 20 documents in the result list, no matter which query a user has submitted. At the same time you do not want to assign a fixed result list position to the document. For this purpose it is possible to specify that the overall ranking value of the particular document to be ranked higher, be increased with a certain amount of ranking points. Relevancy Modifications Based on Business Rules Analyzing the impact of business rules makes it possible to impact the relevancy model and direct search end-users to business-generating results. Organizations are governed by business rules and workflows. Business rules should be adjusted in line with market trends, analytic regression patterns, etc. to meet the needs of the business and the market. An example of a business rule could be that a credit check is not necessary for returning customers. FAST ESP lets you apply business rules at various stages of a search. FAST ESP allows you to impact or override the automatic ranking of documents based on business related rules, such as to direct the end-users to business-generating pages. The following tools are available for this purpose: Search Business Center Boost Bulk Tool Proximity Ranking and Matching The term proximity denotes the degree to which a query and a document match, based on the distance between the query terms within a document. The calculated proximity value of a query contributes to the overall ranking value of a document within a result set. In general, a document in which the query terms are located close to each other is expected to be more relevant to the query than a document in which the query terms are located far from each other. A high proximity value boosts the overall ranking value of the respective document. Proximity only has affect on queries with multiple words, and is likely to have higher impact on the result set the more terms the query includes. Like the overall ranking value, proximity does not change the total number 50

51 Concepts of Relevancy of resulting documents for a query, but will improve the ranking order of the result presentation for searches that return multiple results. FAST ESP supports proximity as a selection and ranking criteria in two main ways: Explicit Proximity Implicit Proximity Explicit Proximity Explicit proximity denotes the fact that you can restrict a query by combining query terms with special proximity operators. These operators are: NEAR This operator returns documents that contain the two terms combined by the NEAR-operator with no more than n words separating them. Each term may be a single word or a phrase (enclosed in double quotes). The order of the query terms does not matter for the matching, only the distance. ONEAR The ONEAR operator provides ordered near-functionality. This means that the query terms combined by the ONEAR operator must have the same order in the matching document section as in the query. This query expression returns documents that contain the first and second term with no more than n words separating them. Furthermore, the first term must appear before the second term in the matching section of the document. Adding explicit proximity constraints to a query will improve the precision of the result by eliminating irrelevant results from the result set. For details about applying explicit proximity, refer to Proximity relevance features in the Query Language and Parameters Guide. Implicit Proximity Implicit proximity denotes the fact that documents get a higher rank value the closer the query terms they contain are to each other. Implicit proximity will not change the total result set, but will improve the ranking precision of the result set, as documents that contain the query terms close to each other are ranked higher than documents that contain the query terms less close to each other. As such, implicit proximity provides much of the effect of explicit proximity constraints. Implicit proximity is only one part of the entire ranking of a matching document. Matching text segments in a document are assessed along the following criteria (in decreasing order of significance): Completeness: The higher the number of query terms present in the same element of a matching document, the higher the document s ranking value gets. In addition, important query terms, that is, words that are not stopwords, add a higher boost to the ranking value of the document than stopwords. Distance: Query terms occurring very near to each other add more to a document s rank value than query terms that are less near to each other. Position: The earlier a query term occurs in a document, the higher the document s rank value gets. A built-in feature provides a rank boost if the query terms occur close to each other within the 255 first words of a field or composite field. This feature is applied on index level, that is, based on indexed proximity information. This feature is not configurable. In FAST ESP, implicit proximity boosting features are applied as part of the core matching within the FAST Search Engine. Proximity boosting features are applied to both AND, OR, NEAR, and ONEAR query expressions.this means for example that a query expression a AND b will give higher relevancy to a document where a appears closer to b than in other documents. 51

52 FAST Enterprise Search Platform Note: Proximity boosting is not applied to query expressions using the ANY operator. Sorting Overview The term sorting denotes the ability to order search results according to a value in one or more index fields. Sorting search results depends only on the fields in a document. The position or frequency which the query term or terms may have in the matching documents do not influence sorting. Which field to use for sorting is specified as part of the query. For details on how to enable this, refer to the Query Language and Parameters Guide. FAST ESP supports sorting along the following data types: field values (numeric and full text) rank (relevancy and score) geographical distance (geo field) Full Text Sorting FAST ESP supports sorting on full text.this means that you may sort on a configurable number of characters, without any limitations on the text string. Full text sorting includes national text sorting rules. FAST ESP allows you to sort results in ascending or descending order. Multi-Level Sorting Sorting may be defined for either single or multiple fields. Specifying multiple fields allows for multi-level sorting. This way, you may for example sort a result set by product name, then by price, and then by date. Multi-level sorting enables database-type sorting schemes with a list of fields to be used for sorting. Combining Ranking and Sorting Multi-level sorting allows you to combine ranking and pure sorting, as the rank field may be one of the sort levels. Multi-level sorting is supported for any field that has been defined for sorting in the index profile. Ascending and descending sort order is available for all fields as part of multi-level sorting. The sort order is specified at query time. Sorting on Geographical Coordinates The GEO Search feature enables sorting based on distance from the end-user location. Refer to Geo Search. Field Collapsing A feature related to sorting is field collapsing. Field collapsing allows for folding of results with identical value for a given result field. You can use this feature in order to collapse results with given attributes. There are two kinds of field collapsing: field collapsing which removes collapsed documents field collapsing which does not remove collapsed documents (default) Refer also to Hit and Navigator Count in the Query Language and Parameters Guide for information field collapsing. Field Collapsing without Document Removal You may for example want to collapse all results with the same product name or code in the result set. The result will be re-sorted in such a way that the collapsed results are presented last. 52

53 Concepts of Relevancy This type of field collapsing can be enabled/disabled at query-time. However, the result specification area of the index profile must be specified to support the feature. Field Collapsing Including Document Removal, Query-Side Collapsing This type of field collapsing is controlled at query time and does not require additional index profile support, which makes it possible to select collapse fields on a per query basis. Unlike the default option, this type of field collapsing makes it possible to remove documents from the result set. (Collapsed documents are always removed.) The following options are possible with field collapsing including document removal: simple collapse, which removes collapsed documents from the result set. collapse on specified numeric fields, where fields for collapsing can be determined at query time, and thus doe not need to be specified in the index profile. collapse and keep N number of collapsed documents, where it is possible to keep a specified number of documents for each collapsed group. More information on field collapsing can be found in the Query Language and Parameters Guide. Controlling Ranking and Sorting of Query Results ESP lets you control ranking and sorting of query results in several different ways. Controlling ranking and sorting can be done by: Specifying multiple rank profiles in the index profile. Specifying sorting attributes for individual fields in the index profile. This will define which sorting attributes are available. Controlling result sorting on a per query basis. By default the result is sorted based on the default rank profile. Query parameters enable you to specify an alternative rank profile for the query, or a set of fields that the result set is to be sorted by. For details on specifying rank profiles and sorting attributes in the index profile, refer to the Configuration Guide. How to use the result sorting query parameters is described in further details in the Query Language and Parameters Guide. Boundary Matching FAST ESP provides support for boundary-sensitive matching. This means that you may search for words in the start/end of a field, as well as an exact token match with a field. Boundary matching can be applied to fields of type string. Use case examples may be a product name field where the full name of one product is a substring of another product name, or a field containing a list of string values, for example, a list of names. In this case it may be desirable to be able to match the exact content of each string, and to avoid query match across string boundaries. Boundary matching is applied on the tokenized text. This means that it is not a true exact match including upper/lower case etc. Exact matching is enabled per index profile field. For more information about Boundary Matching refer to the Query Language and Query Parameters Guide. Applying Boundary Matching Applying boundary matching requires you to configure the relevant field in the Index Profile 53

54 FAST Enterprise Search Platform Refer to the Configuration Guide for details on how to configure the Index Profile accordingly. Refer to the Query Language and Query Parameters Guide for details on how to apply boundary matching in queries. Duplicate Removal FAST ESP provides different ways of detecting and removing duplicate documents. Crawler Duplicate Removal - The FAST Crawler is able to detect duplicates within collections. This duplicate removal may be configured to exclude metadata in the HTML document. Refer to the Crawler Guide for more information. Dynamic (Result-side) Duplicate Removal - A result-side duplicate removal feature may be used to detect and remove duplicates across collections, and also enable a more flexible definition of perceived duplicates. Field Collapsing, which does not remove duplicates, but re-ranks documents based on similar value for a given field. Dynamic (Result-Side) Duplicate Removal The result-side duplicate removal feature may be used to detect duplicates across collections, and also to enable a more flexible definition of perceived duplicates. This feature is called dynamic duplicate removal. The dynamic duplicate removal feature ensures that, within a result set, duplicate documents are represented only by one single document. This document is the one that has the highest relevancy ranking within the set of duplicate documents. A field to which you typically would apply dynamic duplicate removal is the field of a document containing its URI. When this is specified in the index profile, only documents with different URIs will be included in the result set. In general, basic duplicate removal based on URI is performed prior to indexing by the data sources. However, in certain cases the same document, that means one specific URI, may appear in different collections within your FAST ESP installation. As queries may be applied to selected collections only, it is not possible to detect or remove such duplicates prior to indexing. In such cases, dynamic duplicate removal allows you to filter out duplicates from the current result set. There are two ways to use this feature: Activating the feature in the Index Profile Result Specification. For details, refer to the Configuration Guide. The result set to a query will then only list those documents that have different values in this field. Activating the feature on a per query basis. For details, refer to the Query Language and Parameters Guide. In this case the index profile configuration is optional. 54

55 Chapter 8 Processing Queries and Results Topics: Query and Results Server Overview Query Concepts Query Processing Result Processing The FAST Search Front End (SFE) This chapter introduces you to the basic concepts of processing queries and results.

56 FAST Enterprise Search Platform Query and Results Server Overview The FAST Query & Result Server (QR Server) provides query and result processing prior to submitting the queries to the Search Engines and presenting the result list on the search interface. It receives search queries from the Search API, analyses them, and, if required, transforms them. It distributes them to the appropriate Search Engine nodes and creates a feedback about what the query analysis has brought up and what search results it gives. Depending on configuration, this feedback is sent back to the end-user, ignored, or used for automatic query re-submission. Furthermore, the Query & Result Server receives search results from the Search Engine nodes, processes them and forwards them to the Search API. For more details about the Search API, refer to the Query Integration Guide. Query Concepts Query and Result Server The FAST Query & Result Server provides query and result processing prior to submitting the queries to the Search Engines and presenting the result list on the search interface. The Query & Result Server contains multiple transformers that perform specific query and result processing tasks. There are two types of processors: Processors that contribute to query processing. They form the query transformation framework. Their names have the format qtf_*. Processors that contribute to result processing. They form the result processing framework. Their names have the format rpf_* or rff_*. The Query & Result Server provides: Linguistic query processing such as spell checking and anti-phrasing Result Clustering Navigation Find Similar Dynamic Duplicate Removal Query and Components A query submitted to the FAST ESP system consists of two main components: a natural language query component which is subject to linguistic query processing and proximity/context ranking; and a structured filter component which is not modified during query processing. The search query must comply to one of the supported Query Languages. Refer to the Query Language and Parameters Guide for a detailed description of the supported query languages and query features. 56

57 Processing Queries and Results Query Processing When an end-user submits a query to the FAST ESP system, the query is subject to query processing for relevancy enhancement in the FAST Query & Result Server, before it is passed to the FAST Search Engine to perform the original or processed query. Query processing is based on linguistic analyses of the query string. It includes the following linguistics features which are explained further in the Advanced Linguistics Processing chapter of this guide and in the Advanced Linguistics Guide: Proper name and phrase recognition Spell checking Anti-phrasing Lemmatization Query Modifications Query processing may be configured globally and per query, and there are several different wasy to modify a query in FAST ESP. Query modifications may be applied in the following ways: as an automatic rewrite of the query before execution against the index This is most useful for Anti-Phrasing, when common query parts as in Where do I find information about Japan are removed and the query is reduced to the essential query string information Japan. as a suggested rewrite, typically presented as a search tip on the result page. This is a more conservative approach avoiding any unexpected query rewrites that the end-user did not intend. It is most useful for proper name recognition, when the query string World Cup is detected as a phrase, and a search tip such as Did you mean World Cup? is returned. It is also useful for spell checking. a combination of the two above: The query is first executed in its original form. In case of no hits, the query is automatically resubmitted using the automatic rewrite option, and the new result is presented to the user. This is an approach that is transparent to the end-user. The resubmission parameter is set per query and the result received on the API will also indicate the transformed query. Query Resubmission The resubmission parameter is set per query and defines which of these features are to be enabled if the original query returns no hits. FAST ESP is able to perform a number of automatic or suggested transformations of the user s query, based on advanced linguistics. This includes spell checking, proper name recognition and anti-phrasing. There are three types of the query transformation: Modify The query term string is automatically modified using the transformation parameters. The modified query is executed and the result set is returned Conditional Modify The query term string is automatically modified if no hits are returned by the executed original query Suggest The executed query is not transformed, but a suggested transformed query is returned together with the result set, based on the original query. This flexibility allows the application or the user to decide how to modify the query terms entered by the user. For details, refer to the Query Language and Parameters Guide. 57

58 FAST Enterprise Search Platform FAST Query Language The FAST Query Language (FQL) is used to express query terms, operators and query modes/options. This is further described in the Query Language and Parameters Guide. In addition to the FAST Query Language (FQL), FAST ESP provides two alternative query languages, the Simple Query Language and the Advanced Query Language. These query languages are included for backwards compatibility and do not support all features provided in FAST ESP. Refer to Backwards Compatibility in the Query Language and Parameters Guide for information. You can use the FAST Query Language to perform exact searches and to narrow the scope of your search to values belonging to a specific FAST ESP field, composite field or scope field. A query language expression may contain a number of nested sub-expressions of one or more of the following types: query term: A query term consists of one or more words, strings or numeric values. (A query consists of one or more query terms.) scope specification: A scope specification limits the possible matching sections of the documents to a specific field, composite field or a scope structure within the field or composite field. operators: Operators may apply boolean operations (AND, OR, etc.), define certain constraints to the operands (for example, filter()), apply explicit proximity constraints (max word distance between matching terms), apply numeric range operations, or specify data types and attributes to the data (such as linguistics operations). Result Processing After the query sent by the end-user has been processed, it is passed on to the FAST Search Engine, which matches it against the index and returns the list of results to the FAST Query & Result Server. Result processing includes the following features: Category result grouping Find Similar Field-based categorization Query highlighting through teasers Duplicate removal Result Views A result view includes the information that is returned with each search result. In its simplest form the result view is a short teaser summarizing the content of a document. However, the result view in FAST ESP is completely configurable and may contain a smaller or larger set of fields from the initial document. In certain cases, such as database indexing, it is convenient to provide all the indexed fields of each database record in the result view, so that the customer application may present the data in various ways without the need for retrieving the database record once more. When defining the index profile for a certain collection, you specify which fields are to be returned as part of result views. This configuration impacts the total size of the index, as this information will reside on disk within the index. Based on that, it is possible to define different result views that can be applied to a query. Which of these specified result views to apply when a result set actually is to be presented, is specified by a query parameter. The definition and selection of field views impacts the amount of data returned from a query. Therefore, more information in the result view implies more bandwidth used between FAST ESP and the customer application, and will also have some minor performance constraints. 58

59 Processing Queries and Results Query Result Highlighting through Teasers The result view may include a teaser field. Teasers allow you to highlight important parts of a query result. A teaser is a summary field that is generated in order to be used as a general result summary of documents in the result set presentation. Two types of teasers are supported. For details about defining teasers in the index profile, refer to the Configuration Guide. Type of Teaser Static teaser Dynamic teaser Description This is a generated summary field that is convenient to use when presenting results from, for example, web pages or text documents. This teaser is created during document processing, and typically analyzes a HTML document, extracting a few lines of text that reflect the most relevant content of the document. This is a generated summary field that enables presentation of a document extract in context with the search query. The text of the document body is used during result processing in order to retrieve the text segments that include the best matches of the query. In most cases the dynamic teaser provides a more relevant text for the result pages than the static teaser. The relevancy of a text segment is determined by (in decreasing order of significance): 1. phrase matching. 2. completeness: The more search terms a text segment contains, the more relevant this text segment is. 3. proximity: Text segments that contain query terms that occur near each other are more relevant than others. 4. position: The earlier a text segment containing one or more search terms occurs in the document, the more relevant it is compared to the others. Which teasers to use in a result view is specified in the result view sections of the index profile for the respective search engine cluster. You can specify one teaser per result view; in addition, you may specify a field to be used as a fallback teaser field in case the generation of the original teaser field fails. Query Result Highlighting in Source Document FAST ESP also enables Query Result Highlighting in the source document. See Making Documents Searchable, section Query Highlighting in Source Documentation for more information. Refer also to the FAST SDK Document Highlighting Guide for details. The FAST Search Front End (SFE) The Search View selection in the Administration interface (Admin GUI) allows you to view the default Search Front End (SFE) provided with FAST ESP. This front end lets you search the documents of your implementation for testing purposes. 59

60 FAST Enterprise Search Platform Figure 6: Example of the Search Front End (SFE) Refer to the SFE User's Guide for more information about how the SFE works. Refer also to Default Search Front End Features in the Query Integration Guide. 60

61 Chapter 9 Geo Search Topics: GEO Search Overview This chapter provides an overview of the FAST ESP Geo Search feature. The Geo Search feature provides capabilities for sorting and filtering query results based on geographical location.

FAST Enterprise Search Platform GEO Search Overview The Geo Search feature provides capabilities for filtering, sorting and boosting query results based on geographical location.

62 FAST Enterprise Search Platform GEO Search Overview The Geo Search feature provides capabilities for filtering, sorting and boosting query results based on geographical location. To enable the geo search feature, you must provide location specific information for each document on the content access side.you can add several sets of coordinates for a single document, to imply that the document provides relevant information for all the specified locations. The location specific information can be added as meta data from the content source, or added during document processing based on analysis of the content, mapping from URL, etc. In some cases the processing may require some interaction with an external, customer specific application or data base. Note that the location information must be in the form of one or more longitude/latitude pairs prior to indexing. The geographical coordinate information is indexed using optimized geo index structures for high performance searching. The fields/elements used for geographical coordinates are configured in the Index Profile. On the query side it is possible to filter and sort the result set, using the end-user location and the geographical distance between end-user position and the positions associated with each document in the index. The user can also specify an alternative center coordinate along with the end-user location. All sorting and filtering will then be performed using the end-user location, but the distances shown in the result set will be calculated using the alternative center coordinate. The latter approach may, for instance, be useful in cases where the displayed distance shows the distance from the current user location, while the sorting/filtering is based on a displayed map extract where the user is not located in the center of the map. Filtering of the result set can be based on a radius (a circle) or a square box. The square box is typically used in association with a map presented on the result page. Both the radius and the box size are configurable per query, and are specified using dedicated query parameters (in addition to the query string). When sorting the result set based on the distance from the end-user location to the hits, the hits closest to the end-user position appear on top of the list.when no sorting is specified, the result set is sorted according to the dynamic rank values. Use the geo boosting feature to combine the two: By boosting the dynamic rank values with an offset based on the distance, the most relevant hits (in terms of both dynamic rank and distance) will appear closer to the top. Important: When using geo boosting in combination with the stopword threshold feature, it may happen that a hit very close in distance still ends up at the end of the result-set. The reason is that the stopword threshold defines a limit, a maximum number of matching documents, where the system stops computing a dynamic rank value for the given search term. Instead it sets the dynamic rank value to zero, diminishing the boosting effect. Refer to the Configuration documentation for details. 62

63 Geo Search A typical location application would enable drilling down on distance from the end-user location. Such a drill-down would not be implemented using the FAST ESP Dynamic Drill-Down feature, but instead the query application may provide the end-user with +/- selections or similar which map to a different restriction on distance and/or area. Examples In the illustration above, C1 indicates the end-user s geographical position, and the bullets 1-9 indicate the geographical position associated with 9 different documents in the index. One document, 7, contains two different locations (for example, two different offices for the same company). Filtering When filtering using a radius is enabled, the user can specify a center coordinate C1 and a radius r, and only documents within the given radius will be included in the result set. In the illustration, this would include documents 4, 6, 7, and 8. When filtering using a box is enabled, the user can specify an area defined by coordinates B1 and B2, and only documents with coordinates within the box will be included in the result set. In the illustration, this would include documents 2, 3, 4, and 6. Sorting When sorting on distance is enabled, the user can specify a center coordinate C1 in the query, and the result set will be sorted in either ascending or descending distance from C1. If a document contains more than one coordinate, the coordinate with the shortest distance to C1 will be used. In the illustration,, document 7 has the shortest distance to C1 (location 7 1 ), followed by 6, 4, 8 and so on. If you want the dynamic rank value to influence the sorting of the result set, as well as the distance, you can use the boosting feature instead of regular sorting. For example, if the documents 6 and 7 in the illustation are given the rank values 0 and 1 respectively, and document 8 and 9 the value 2, this will be the result depending on the sorting criteria: Sorting Criteria Distance Result Set Order (7, 6, 8, 9) or (9, 8, 6, 7 ) Dynamic rank values (9, 8, 7, 6) or (8, 9, 7, 6) Note that the order of documents with the same rank value is not defined. Boosted rank values (7, 8, 9, 6) 63

64 FAST Enterprise Search Platform Sorting Criteria Result Set Order Note that the order of the documents depends on the weight of the boosting. Using Alternative Center Coordinate When the user specifies an alternative center coordinate C2 along with the coordinate C1, the distance D1 in the illustration (between C1 and document 6) would be used for sorting and filtering the result set, but it is distance D2 that will be displayed in the actual result. Combining Features It is possible to combine the two approaches to filtering as well as sorting. In the illustration, filtering using the radius r and the box defined by B1 and B2, includes only documents 4 and 6 in the result. Then you can sort the filtered result set based on distance. 64

65 Chapter 10 Scope Search and Dynamic XML Indexing Topics: Scope Search Overview Scope Search vs. Fielded Search Scope Search Concepts and Capabilities Dynamic XML Indexing This chapter explains what Scope Search is and how it works. It also explains what dynamic XML indexing is and describes its core capabilities.

FAST Enterprise Search Platform Scope Search Overview Scope Search is a feature that enables search in hierarchical content structures without a need to know the index schema in advance.

66 FAST Enterprise Search Platform Scope Search Overview Scope Search is a feature that enables search in hierarchical content structures without a need to know the index schema in advance. Using Scope Search makes searches more precise than searching using a standard index schema. It allows you to specify hierarchies to be used as the basis for identifying exactly what kind of information you want to extract, and how you want it to be presented. When using scope search, it does not matter how this schema is defined. Definition of a Scope A scope is an entity or object that has a name and has content. A scope can have sub-scopes.when a search is performed on a scope, the search is done in the content of the scope and all its sub-scopes. Fields can support nested scopes or sub-fields. A field is a top-level, root scope. Examples of fields are: content (body), authors, or metadata. Multiple fields can be defined. A nested scope or sub-field is an element within that root scope. Figure 7: Scope Example This example of the document scope structure shows a simple example of a scope field within a searchable document. The indicated scope field in the Index Profile is named book, and contains Authors elements, which in turn contain one or more Author elements. Scope fields, normal (text/numeric) fields, and composite fields can be combined within the same Index Profile. Example of Using a Scope Search The following example shows an excerpt from the play Hamlet in XML form. The expanded elements have a - in the margin and the collapsed elements have a +. FQL syntax to get the famous speech from Hamlet is: - <PLAY> <MAINTITLE> The Tragedy of Hamlet, Prince of Denmark</MAINTITLE> + <FM> + <PERSONAE> <SCNDESCR>SCENE Denmark.</SCNDESCR> <PLAYSUBT>HAMLET</PLAYSUBT> + <ACT> + <ACT> - <ACT> <TITLE>ACTIII</TITLE> - <SCENE> <TITLE>SCENE I. A room in the castle.</title> <STAGEDIR>Enter KING CLAUDIUS, QUEEN GERTRUDE,POLONIUS,OPHELIA, ROSENCRANTZ, and GUILDENSTERN</STAGEDIR> 66

67 Scope Search and Dynamic XML Indexing + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> <STAGEDIR>Exeunt ROSENCRANTZ and GUILDENSTERN</STAGEDIR> + <SPEECH> + <SPEECH> + <SPEECH> <STAGEDIR>Exit QUEEN GERTRUDE</STAGEDIR> + <SPEECH> + <SPEECH> + <SPEECH> <STAGEDIR>EXEUNT KING CLAUDIUS and POLONIUS</STAGEDIR> <STAGEDIR>Enter HAMLET</STAGEDIR> - <SPEECH> <SPEAKER>HAMLET</SPEAKER> <LINE>To be, or not to be: that is the question:</line> <LINE>Whether tis nobler in the mind to suffer</line> <LINE>The slings and arrows of outrageous fortune,</line> <LINE>Or to take arms against a sea of troubles,</line> <LINE>And by opposing end to them? To die: to sleep;</line> <LINE>No more; and by a sleep to say we end</line> How Scope Search Works and Why It is Used Hierarchical content is represented as a hierarchy of scopes inside the FAST ESP index. Scope Search in FAST ESP is based on the following: Scope Indexing provides a scope-aware indexing of content with hierarchical structure, enabling efficient search in scope structures. Scope Indexing is generic in the sense that it does not require any specific content input format. FAST ESP supports XML input format - other input may be supported by creating custom document processors. Dynamic XML Indexing provides a mapping from any XML to the internal FAST Scope structure. The FAST Query Language (FQL) is a query language which supports scope search queries. Scope Search can be used for: Indexing customer XML content without any knowledge of the DTD/schema. FAST ESP includes a dynamic XML pipeline that maps submitted XML to one or more scope fields. Indexing a more dynamic field structure using the Scope Search framework. In this case it is possible to change field structure without changing the Index Profile. In this case XML is used as an intermediate format in order to submit structured data to the system. Scope Search vs. Fielded Search Scope search Strengths Entities tagged in their original position (instead of stored in global meta fields) Search and navigation precision search witin context (sentence, paragraph, etc.) search for entity or existence of entity 67

68 FAST Enterprise Search Platform return specific scopes (for instance sentences) instead of teaser navigation menus generated from local context only full schema flexibility preserve existing hierarchical document structure for instance XML docs Normal Search Strengths Performance Deep navigation (accurate counts based on full index) Current relevancy model is really targeted for documents (not sentences or paragraphs) Scope search and normal search are often used together, with normal global meta fields (such as document date), and the body content as a scope field.you can refer to both scope fields and normal fields in the same FQL query. Scope Search Concepts and Capabilities The core concepts and capabilities in Scope Search are described in this topic. Scope Fields The FAST ESP Indexer is based on a field structure that defines the schema of the indexed content. The schema is defined using the Index Profile. The Scope Search feature is facilitated by introducing a new field type in the FAST ESP index, named scope field. Hence, a scope-enabled index may include different types of fields. A scope-enabled index may include the following types of fields: Basic field. A basic field may be of type string (any textual content), int32 (32 bit signed integer), float, double or datetime (representing a date/time value as a numeric value in the index), uint32. Composite field. A composite field includes a set of basic string fields that can be matched using the built-in dynamic ranking mechanisms in FAST ESP. Scope field. A scope field contains hierarchical scope content. The individual subscopes of a scope field may be of any data type supported by FAST ESP (string, int32, float, double or datetime). For textual scopes, a subset of the dynamic ranking mechanisms as provided for composite fields will apply. When defining a scope field, there is no need to define the actual scope structure within the scope field in advance. A FAST ESP index profile may contain a combination of one or more fields, composite fields and scope fields. Hence, it is possible to combine in one index both schema based content in fields with and scoped dynamic content. In the query language you may specify individual fields, composite fields or scopes to limit the scope of a query. For scope queries the scope specification in the query must include the scope field name (also called the root scope) and sub-scopes within the indexed scope structure. A scope field may include a hierarchy of scopes in arbitrary depth. The Scope Indexing is generic in the sense that it does not require any specific content input format. FAST ESP supports XML input format - other input may be supported by creating custom document processors. Navigation Dynamic Drill-down (Navigation) in FAST ESP provides functionality for drilling down into the query result based on value distribution of one or more individual fields. Note: Dynamic Drill-down can only be applied to non-scope textual or numeric fields, that is, it is not possible to apply this feature to scope fields as such. 68

69 Scope Search and Dynamic XML Indexing If a specific element from the scope structure is desired to use for dynamic drill-down, it is possible to extract this element from the source content (for example, an XML element) during content processing prior to the scope mapping and assign this to an individual field in the index, with an associated Navigator specification. In this case the element may still be searchable within the scope structure, but may also be used for drill-down. Refer to section Mapping XML to One or More Scope Fields for further details. It is also possible to apply a result-side navigator on an extended document summary. When this feature is enabled, dynamic extraction of entities and concepts is enabled on the n first matching documents of the result set, where n is by default 100. The entities are extracted from the sections of the documents where the query matches best, i.e. similar to the dynamic teaser. Scope Data Types Scopes may be of any supported data type. The data types are supported in a similar way as for individual fields. This means that string scopes support dynamic ranking mechanisms, linguistics, phrasing and wildcards as for composite fields. Numeric scopes support exact matching and range matching mechanisms as for individual numeric fields. Numeric matching requires that the query term and target numeric scope is of the same type. Matching a query term of type float with a scope of type double will not return any results. Therefore, it is required to apply consistent data typing in queries. Such data typing can be applied using explicit type conversion (for example, double(24.5)) or implicit default typing based on the term format. When querying non-scope numeric fields the system will know the type for the field, and will perform an automatic type detection based on the indicated field. Refer to Numeric Operators in the Query Language and Parameters Guide for details on literals and explicit type conversion. Query Language in Scope Search Scope queries are only supported within the FAST Query Language (FQL). The Simple Query Language and the Advanced Query Languages do not support scope queries, but may be used for individual fields and composite fields even if the index also includes scope fields. Refer to the Query Language and Query Parameters Guide for more information. The scope query root:date:foo will search for the term foo in all scopes with name date within the scope field named root, including all sub-scopes to such date scopes. Return Matching Scopes Using Scope Search, any scope in your query can be returned. This is referred to as Matching Scopes, and makes it possible to make retrieval more specific by returning a sub-section of a scope, if desired. Using the example of the Hamlet play, you can have the entire Hamlet play returned, or, you can specify that you want to see the speech that contains the line question which will only return the speech for you. Using Matching Scopes, you can specify to search for the speech in specific instead of the entire play. Return matching scopes can be used to provide data for Navigation as well. Refer Taxonomy and Navigation, section Navigators for more information. Scope Boosting Selected scopes are assigned a boost level during document processing. There are eight possible levels. The boost level is inherited in descending scopes unless they are explicitly assigned a new boost level. At query time, the scope boost level is used when calculating the dynamic relevance score (ranking) for terms in a query (together with the term statistics (tf-idf)). For more information on document scope boosting, contact your FAST Account Manager. Refer to the Configuration Guide for more information on Scope Boosting. 69

70 FAST Enterprise Search Platform Dynamic Document Summary (Teasers) The dynamic document summary (teaser) is a short abstract of the matching document where the matching terms are highlighted within the context. FAST ESP supports creating dynamic document summaries for non-scope text fields and scope fields. The dynamic document summary for a scope field only displays matching contexts with query terms that are within the same scope field. The dynamic document summary will, by default, highlight query matches within all string scopes of the scope field. Full scope search is supported for FAST ESP. For scope fields the default is to return the matching scopes as valid xml, inside a <matches><match>...</match></matches> envelope. Up to the most relevant 100 matches will be returned, although it is configurable. This means that the document summary may highlight string scopes that do not match the sub-scope specification but match the query terms. For normal fields based on scope field input, the default is to generate a normal teaser that is scope aware. For normal text fields without markup in the document summary source, teasers are not scope aware. The dynamic document summary identifies sub-scope boundaries, so that each unique text segment within a document summary is within one scope. Metadata within the query (using the filter operator) will be considered during matching, but not highlighted. In other words, matches that do not pass the filter will not be presented in the dynamic teaser. A scope field can be configured to fall back to a static (query independent) document summary. In addition, there are several index profile features (source-ref, default-result, dynamic-type) related to the dynamic summary generation. Refer to the Configuration Guide for more information. Linguistics and Scope Search Scope fields support the standard FAST ESP linguistic features. In addition, Word Stacking is supported for lemmatization and synonyms. Refer to the Advanced Linguistics Guide for information on the FAST ESP linguistics features. Word Stacking and Normalization Word Stacking is supported for scope fields. The concepts of Word Stacking and Normalization are explained briefly here. Refer to the Advanced Linguistics Guide for more details. The FAST ESP Scope Search Index supports indexing multiple variations of the same word (token) at the same word position within the index. This concept is called Word Stacking. Word Stacking enables you to index multiple normalization variants for the same word, e.g. original form, lemmatized, lowercased, de-accentuated. This is a more flexible alternative to the traditional indexing approach, which means a uniform normalization of all words in documents and queries. Uniform normalization means that you need to select one level of normalization for all content (e.g. lowercasing and accent removal). This works well in most cases, but removes the ability to select higher precision for the queries (e.g. case sensitive search). When using Word Stacking it is possible to select the desired level of normalization on a per query basis. This in turn enables the following advanced linguistics features: Phrase or proximity (NEAR/ONEAR) queries in association with wildcards, lemmatization and character normalization (accent normalization) Per query selection of accent/case sensitive/insensitive matching Efficient handling of lemmatization in a multi-lingual environment. If you do not know the language of the query, and the index contains content in multiple languages, then a simple normalization to the base form (based on the default language setting) is not appropriate, as this can introduce ambiguities. Instead the linguistic query processing expands the query term to an OR between the original word and the base form of the word. Efficient handling of character normalization in a multi-lingual environment. Example: French publications will include all appropriate accents for French language words whereas English language documents 70

71 Scope Search and Dynamic XML Indexing containing French words often omit the accents. Suppose a French document contains the phrase "Côte d'azur" and another English document contains the phrase "Cote d'azur" without the accent. In the first document, "Côte" would be indexed as the two variants "côte" and "cote". A user query of "Côte d'azur" would hit only the first document (if selecting accent sensitive search), but a user query of "Cote d'azur" would hit both documents. Normalization, which is replacing a character or sequence of characters is not always sufficient. If we simply normalized both versions to "cote" this would be less acceptable to French-speaking users because the accents differentiate "côte" meaning "coast" from "côté" meaning "side". In this case precision would be lost. Refer to the Advanced Linguistics Guide and the Configuration Guide for information on how to configure linguistics normalization features. Partial Updates FAST ESP supports partial updates on scope fields as a whole, but not sub-trees within the scope field. This means that it is possible to update a scope field and nothing else in the document which results in less content being fed through the document processing pipeline. This can be particularly useful when using Connectors, for example. Dynamic XML Indexing Dynamic XML Indexing implies mapping of XML content to the FAST ESP scope Indexing framework. FAST ESP provides document processors that can be configured to map any XML structure to a FAST ESP scope structure. The document processor can be configured to map one or more input document elements containing XML content to corresponding scope fields. The document processor does not take into consideration the DTD, but will map all XML elements and attributes to scopes and sub-scopes within the scope field. By default the scope representation does not differentiate between XML attributes and sub-elements. Both will be represented as sub-scopes. The attribute names are prefixed with which must be used if using the attribute name in queries. For details refer to Configuring the FAST Document Processing Engine in the Configuration Guide. There are 2 document processing pipeline templates that support XML to scope mapping. One that feeds the XML structure as-is, and one that extracts entities and adds further scopes. They are called Lightweight XML and XML respectively. By default, the pipeline expects that the XML is included in the data document element. This is also the default option when using the File Traverser. All XML elements and attributes are indexed as text (type string ) by default. It is, however, possible to specify a data type for elements using a pre-defined attribute. The name of this attribute is configured in the document processor. The values for this attribute and the mapping to FAST data types are also configurable. This default data type support enables typing of elements, not attributes. All attributes will be treated as string. Other custom data type handling may be implemented by creating a custom document processing stage. For information on Submitting XML and mapping XML, refer to the Content Integration Guide and the Configuration Guide. 71

73 Chapter 11 Taxonomy and Navigation Topics: Taxonomy and Navigation Overview Navigators Taxonomy Unsupervised Clustering What appears in the result set of a search, and how it is displayed, depends on a number of factors. How data is structured, and how the system is set up to display results affect what the user sees after submitting a search. Navigation, taxonomy, and unsupervised clustering make it possible for users to have different views on the result sets.

74 FAST Enterprise Search Platform Taxonomy and Navigation Overview What appears in the result set of a search, and how it is displayed, depends on a number of factors. How data is structured, and how the system is set up to display results affect what the user sees after submitting a search. Navigation, taxonomy, and unsupervised clustering make it possible for users to have different views on the result sets. Navigators allow users to view a list of values or ranges. Taxonomy is an organized classification structure that groups documents by category. Unsupervised clustering allows for automatic grouping of similar documents in the result set and suggested naming of these groups or clusters. Navigators Navigators provide functionality for drilling down into the query results based on value distribution of one or more individual fields. It is possible to apply navigators to all fields or just some fields from, for example, a database or product catalog. FAST ESP supports navigation on scope fields and non-scope document fields both textual, like product name, and numeric, like price attributes. Different types of navigators can be applied depending on the field types. Refer to Navigators in the Query Integration Guide for more information. Field Navigators In FAST ESP, it is possible to perform multi-dimensional navigation in structured data based on facets of the content (such as database rows, product catalog descriptions, etc.). Navigators are used to limit overhead on search environment for e-commerce, Yellow Pages, Supply Chain, CRM, etc. Relevant results can be found faster using a combination of searching and browsing by parametric value and range. Navigators can also be used on taxonomy fields to apply deep navigation (meaning the entire result set) into categories that occur within the results. When used with a taxonomy, each taxonomy node that appears within the result set appears as a navigation entry. Refer to the Query Integration Guide and the Configuration Guide for details on navigators and how to configure them. The left column displays the results in the usual ranked order, which is based on FAST ESP static and dynamic rank mechanisms, including the relevant parameter fields. The right column displays the drill-down and binning attributes. The feature is dynamic. The range for numeric values per bin is computed on-the-fly, trying to give a mean distribution of values to displayed range categories. It is also possible to manually specify internal boundaries. For each field range drill-down links are provided in order to navigate within the displayed value range, for example, "Lease Price 30-40". It is also possible to reverse an applied navigator, i.e. reversing filtering criteria. The navigation parameters are sometimes denoted faceted metadata, and may apply for applications such as: Product databases may have attributes such as price, weight, color, country of origin and product type. Music store: songs have attributes such as artist, title, length, genre and date. Recipes: cuisine, main ingredients, cooking style and holiday. Travel site: articles have authors, dates, places, prices. 74

75 Taxonomy and Navigation Regulatory documents: product and part codes, machine types and expiration dates. Image collection: artist, date, style, type of image, major colors and theme. An indexed field or attribute can be seen as a dimension in which the query can be refined.the search results are examined on the fly, and data is produced that can be rendered in the form of hyperlinks. This will help the user navigate to find what he or she is looking for by modifying the query. This is especially relevant in the context of shopping search, where the searchable index is a database or product catalog. The fields indexed for each product may vary according to the type of the product. By supplying a navigational aid on top of the search engine that is adaptive to the search results for the user s query, relevant results can be found faster. Deep and Shallow Navigators Field navigators can be deep or shallow. Shallow navigators are based on values specified in flat fields, and in scope fields. Deep navigators are based on values specified in flat fields only. Deep navigators reflect the entire result set and usually require re-indexing when a new navigator is added. This type of navigator is recommended for all commonly used navigators associated with individual fields (not scope fields). Shallow navigators work immediately after being defined. They are based on a smaller number of results than deep navigators. Shallow navigation is used when it is not convenient to keep aggregation data in main memory within the search nodes. Scope navigators are always provided as shallow navigators, as they are based on matching scopes only (not known at index time). For more information refer to: The Configuration Guide for information on how to configure navigators. The Query Integration Guide, Search API Overview chapter for information on navigator interfaces in the Search API. The Query Language and Parameters Guide, Query Parameters chapter, for information on navigator parameters in the Search API. Contextual Navigators Contextual navigation (also referred to as scope navigation) is applying navigation to scope fields. Scope fields represent the content in a hierarchical structure as opposed to a flat field. It is not necessary to know the index schema in advance. Applying navigation on scopes lets you limit your search results by narrowing in on a scope such as a paragraph or sentence. The values that are shown in the navigators come from the scope used in the search and not the full document. Contextual navigators are shallow. It is possible to create navigators over structural elements in the matching scope, as well as on scope attribute values and the content of scopes. Refer the Configuration Guide for information on configuring shallow navigators. Field Navigators for Values in Scope Fields When an element from a scope structure is desired to use for navigation, it is possible to extract the element from the source content (such as an XML element) during content processing prior to the scope mapping. This can be assigned to an individual field in the index with an associated Navigator specification. The element is still searchable within the scope structure, and is also used for navigation (drill-down). If you have, for example, scope fields for product codes, you can put all the product codes in a flat field in order to get navigators on them. Otherwise, the product codes would all have to occur in the same context in order to be seen. 75

FAST Enterprise Search Platform If you want to create a field navigator for values in a scope field, you can extract the values during document processing and show them in a flat field and associate

Categorization is the process of mapping documents to specific categories. FAST ESP lets you configure and maintain taxonomies and the mapping of categories.

76 FAST Enterprise Search Platform If you want to create a field navigator for values in a scope field, you can extract the values during document processing and show them in a flat field and associate the navigators with the field. Taxonomy A taxonomy is an organized classification structure that groups documents by category. A document could, for example, belong to the category sports, or news. Categorization is the process of mapping documents to specific categories. FAST ESP lets you configure and maintain taxonomies and the mapping of categories. It is also possible to apply navigation to taxonomies. When a set of results is returned, a taxonomy tree is created, which lets you browse information by category. Figure 8: Example of Taxonomy Tree FAST Taxonomy Explorer The FAST Taxonomy Explorer, an optional ESP taxonomy management tool, contains categorization based on advanced Linguistic technologies which classify documents and organize information into a hierarchical or a flat set of categories. The categorization process inserts category tags into the documents prior to indexing. This is done in several ways. Refer to the Taxonomy Explorer Guide for more information. When the documents in an index have been categorized, end users can restrict a query to a specific category in that index. Figure 9: Example navigation using a taxonomy Applying a taxonomy gives a category view of the result set. 76

77 Taxonomy and Navigation FAST Classifier The FAST Classifier provides a framework for training-based classification that can be used when there is a sufficient set of documents pre-tagged with category information. Refer to the FAST Classifier Guide for more information. Unsupervised Clustering If there is no taxonomy information associated with documents it is possible to setup the system to automatically suggest a category for a document in the result set. This is referred to as unsupervised clustering. Unsupervised clustering is a kind of automatically-generated taxonomy. Creating Taxonomy on the Fly Unsupervised clustering means that documents are clustered ( grouped or categorized ) based on how similar they are rather than using static taxonomy information. Similarity is calculated by comparing document vectors, which are lists of prominent words in the document. Document vectors are representations of the unstructured textual content that is associated with a document. Vectorization is the process of computing document vectors and is performed as part of document processing using the Vectorizer document processor. This is a standard part of all document-related processing pipelines in FAST ESP. When a set of documents have been put into a cluster, appropriate name(s) or label(s) for the group are calculated based on the terms in the document vectors. Refer to Configuring Similarity Vector Creation in the Configuration Guide for information on vectors. 77

79 Chapter 12 Advanced Linguistic Processing Topics: This chapter introduces you to the basic concepts of advanced linguistic processing. Linguistics Overview Note: Refer to the Advanced Linguistics Guide for information on Linguistics and Relevancy configuration and customization of linguistics features. Dictionaries Automatic Language Detection Lemmatization Synonyms and Spell Variations Advanced Phrase Recognition Spell Checking and Phrase Recognition Framework Anti-Phrasing Sub-String Search Wildcard Search Special Characters and Accents

80 FAST Enterprise Search Platform Linguistics Overview Here you will find information that introduces you to the basic concepts of advanced linguistic processing. Refer to the Advanced Linguistics Guide for information on configuration and customization of linguistics features. Linguistics and Relevancy In search linguistics is defined as the use of information about the structure and variation of languages so that users can more easily find relevant information. The document s relevancy with respect to a query is not necessarily decided on the basis of words common to both query and document, but rather the extent that its content satisfies the user s need for information. Linguistics tools determine the intent behind keywords. For example, a user searching for MP3 player would be interested in a hit that matched ipod. If the site only shows results for the keywords MP3 and Player, a sale could be lost. In order to achieve relevancy, linguistic processing is performed both at the document level during document processing and at the query level during query and result processing. On the query side linguistic processing results in a query transformation, on the document side, linguistic processing results in document enrichment prior to indexing in order to cover grammatical forms and synonyms. FAST ESP provides a comprehensive set of linguistic features. Linguistics Concepts There are a number of basic linguistics concepts that are are used throughout the documentation. Understanding these concepts makes it easier to understand how relevancy with respect to linguistics works in FAST ESP. These concepts include: entity extraction, lemmatization, tokenization, normalization, synonym expansion, and spell checking. Entity extraction is isolating known linguistic constructs, such as proper names or location designators. Synonyms are words that are related in meaning, such as notebook and laptop. In a search engine, synonym expansion can be performed at query time or index time. Synonym expansion at query time lets the search system administrator modify thesauri when necessary without the need to re-index. Lemmatization is the aggregation of different word forms to enable search across different forms of the same word (such as products and product). Lemmatization enables searches to match documents with similar meaning, but different word forms in the document or the query. Lemmatization has similar effects as stemming, but is more precise, as it based on dictionaries. Tokenization (also called segmentation) is the detection of white space characters and symbols that separate words from each other that are not relevant to the matching process. More complex tokenization is used for CJK languages. For Asian languages, tokenization and lemmatization (by reduction) are combined in one processing step. Character normalization is the replacement of characters or character sequences with others to enable search across variants of words that differ in accents or other character properties. An example is the mapping of the the French é to the unaccented e. Character normalization improves recall, but may have a negative impact on precision. It can be beneficial in languages that have accented characters and other non-ascii characters that are used inconsistently or in different variations. Phonetic normalization is normalization using phonetic matching rules and is performed on the query and document side. Terms that are written differently but sound the same can give the same result. For example, if searching for the name Eyvind, a user could type in Eyvind or Oyvind and get the same result. Contact your FAST Account Manager for configuration information. 80

81 Advanced Linguistic Processing The Offensive Content Filter is a document analysis tool to filter content regarded as offensive. The filter is implemented as a separate document processor that can be added to an ESP pipeline. Refer to the Advanced Linguistics Guide for more information. Dictionaries Some linguistic features depend on dictionaries. By default, FAST ESP provides dictionaries for lemmatization, entity extraction, spell checking including proper name and phrase recognition, synonym expansion, and variation expansion. For details on how to edit dictionaries, refer to Configuring Linguistic Processing in the Configuration Guide, and to the Advanced Linguistics Guide (per feature). Automatic Language Detection During document processing, documents can be analyzed to detect the language in which they are written. This functionality is provided by the Automatic Language Detection feature. Detecting the language of a document is essential to all other linguistic analysis features, as the resulting language information is used to select language-specific dictionaries and algorithms during document processing and query processing. During language detection, a given document is analyzed for all supported languages. For each language, a certain score is calculated, based on the occurrences, number, and length of certain test strings. The language(s) that reaches the highest score, and for which the score exceeds a preset threshold, are specified as the document languages. Attention: For queries, the language has to be explicitly set by the end-user or search application, as the query itself generally provides too little context for determining the language it is written in. Default Language If the language of the document cannot be determined, a value of "unknown" will be specified for the document element. The fallback value can be set in the parameter FallbackLanguage in the LanguageAndEncodingDetector. Required Custom Dictionaries Language detection does not require any custom dictionaries. Supported Languages A list of supported languages for automatic language detection can be found in the Advanced Linguistics Guide Lemmatization This section explains the concept of lemmatization. Refer to the Advanced Linguistics Guide for more details on how lemmatization works. 81

82 FAST Enterprise Search Platform What Lemmatization Means Generally speaking, lemmatization means the mapping of a word to its base form and / or all its other inflectional forms. Lemmatization can occur for: singular or plural for nouns, tense and person for verbs, positive, comparative, or superlative forms for adjectives. Lemmatization makes it possible to submit a query for one form of a word and still get matches that contain a different form of the same word. This allows a user to search for a term like car and get both documents that contain the word car and documents that contain the word cars. In contrast to stemming or wildcard search, which would match all documents containing words starting with car, such as cared or career, lemmatization allows for recognizing words as matching terms on basis of their being inflectional variations of the query word. With this, lemmatization also takes irregular inflections such as tooth and teeth into account. Refer to the Advanced Linguistics Guide for more information on lemmatization. Advanced Phrase Recognition and Lemmatization Advanced Phrase Recognition is spell checking for phrases. Lemmatization and Advanced Phrase Recognition cannot be applied on the same query term at the same time. Lemmatization will not be applied to query terms that are recognized as proper names or phrases. These terms are matched only against the usual search index. For example, FAST Search may be included in the list of proper names, which would exclude the inflections fasts and searches in the lemmatized index. Likewise, a search for FAST, recognized as a proper name, will not be expanded. This means that in a standard FAST ESP configuration and Search Front End, lemmatization is available for the default search index in the any word and all words query modes, but not in the exact phrase mode. When advanced phrase recognition and lemmatization are applied simultaneously to a query, advanced phrase recognition overrides lemmatization. Thus, if your Search Front End provides both lemmatization and advanced phrase recognition, not as mutually excluding functionalities, but as options that can be selected simultaneously, advanced phrase recognition overrides lemmatization. It is therefore recommended to provide these two selections as mutually excluding radio buttons on your Search Front End. Synonyms and Spell Variations Synonyms are words that have the same or idential meaning, for example, live and dwell. In ESP, spelling variations can be viewed as a special case of synonyms. Synonym Overview There are two available options for synonym handling: Query-side synonym expansion and Index-side synonym expansion. Query-side synonym expansion. This enables dictionary-based synonym expansion on the query side. Index-side synonym expansion. This feature enables synonym expansion similar to applying lemmatization a document to be indexed is expanded with a defined list of synonyms or spell variations to the words it originally contains. 82

83 Advanced Linguistic Processing As with lemmatization by expansion on the document side, the original document is indexed in the original search index, whereas the expanded document is indexed in a separate expanded index. This allows you to control enabling synonym expansion on a per-query basis.you can decide whether a query is to be executed with synonym expansion, in which case the query is sent to the synonym index, or without synonym expansion, in which case the query is sent only to the original index. Dictionary Management Dictionaries can be edited with LingStudio or with the dictman tool, both explained here. Synonym dictionaries in FAST ESP can be edited with LingStudio, an interface for advanced editing of dictionaries and lemmatization strategies. Online help for LingStudio is available through the LingStudio application itself. You can aslo edit dictionaries with the Dictionary Management (dictman) Tool. It is a command-line based tool that lets you update, extend, and maintain your dictionaries. The tool can be run interactively or as a batch processor. It is also possible to edit query side synonym dictionaries using the Search Business Center. For procedures on how to use the Dictionary Management Tool, refer to Configuring Linguistic Processing in the Configuration Guide, and to the Advanced Linguistics Guide. Advanced Phrase Recognition Advanced Phrase recognition is based on a mapping of the query terms against a dictionary of names and phrases, whose content you can modify. Advanced Phrase Recodnition includes phrase and proper name detection. Typical proper names are product names, trademarks, product models, part numbers, promotion codes, or stock keeping units. In general, proper names are not part of a language's usual vocabulary, such as expressions like CJK-400ex. Furthermore, proper names or phrases can be words of a language that have a particular semantic value within the content, such as expressions like Data Search. In either case, proper names and phrases are protected from lemmatization and anti-phrasing. Restriction: Note that Advanced Phrase Recognition is not available for Chinese, Japanese, or Korean. Query Transformations There is one query transformer that handles the spell checking framework. The didyoumean QT handles the transformation from didyoumean queries, phrases, and words. 83

FAST Enterprise Search Platform Figure 10: The Didyoumean QT is where Advanced Phrase Recognition is handled Advanced Phrase Recognition applies the following transformations to a query: It detects

84 FAST Enterprise Search Platform Figure 10: The Didyoumean QT is where Advanced Phrase Recognition is handled Advanced Phrase Recognition applies the following transformations to a query: It detects implicit phrases and proper names in the query and phrases them explicitly by adding quotation marks ("the phrase"). This means that the detected phrase is protected from further query transformation. In addition, the query will return phrase matches only. By creating for instance a list of product names, you may ensure that queries are directed to the desired pages that match the implicit product name phrase. It detects and corrects misspelled phrases and words. Implicit phrases in the query will be spell checked and corrected. If the dictionary contains the phrase "nissan micra", the queries "nissan macra" will be detected as a misspelling and corrected to "nissan micra". The spell check will even detect the phrase if both terms are misspelled. The phrase "nisan macra", for instance, would then be corrected to "nissan micra". It detects and corrects query terms with alternative spell grouping. If the dictionary contains the term "thinkpad", a query "think pad" will be corrected to "thinkpad". If the dictionary contains the term "alpha server", a query "alphaserver" will be corrected to "alpha server". Advanced Phrase Customization Phrase dictionaries are customizable. In addition, a list of exceptions allows you to fine-tune terms that are close to both proper names and valid words in the supported languages. Advanced Phrase Recognition may be applied in several sequential steps, which may use increasingly broader dictionaries. This way, you can, for example, apply Advanced Phrase Recognition starting with a narrow list of company specific product names followed by domain specific terms such as computing, pharmaceutical, or engineering terms. For details about customizing Advanced Phrase recognition, contact your FAST Account Manager or FAST Technical Support. Advanced Phrase Recognition and Spell Checking Advanced Phrase Recognition is also used for spell checking (see Spell Checking and Phrase Recognition Framework ). A query term that is close to a proper name is replaced with the proper name.the spell checking is also applied to phrases, including word splitting and joining. Thus, "datasearch" or "ffast" for example, are recognized as "data search" and "fast" respectively if the dictionary of proper names includes data search. 84

85 Advanced Linguistic Processing Advanced Phrase Recognition also provides a list of exceptions that avoid spell checking of words that are similar to the defined proper names. For example, assuming your content contains the product name eserver, then the English word server should probably not be changed to eserver. In this case, the word server is added to the exception list for proper name and phrase recognition. Tip: It is recommended that you modify the exception list on the basis of past queries. Applying Advanced Phrase Recognition Advanced Phrase Recognition is applied as part of the Advanced Spell Checking. The search string the end-user submitted to the system is analyzed. Depending on the configuration of the FAST Query & Result Server, the result of the query analysis is either sent directly to the FAST Search Engine or sent back to the end-user as feedback. For details about how to apply Advanced Phrase Recognition, refer to the Advanced Linguistics Guide. Spell Checking and Phrase Recognition Framework The purpose of spell checking is to improve the quality of the queries by comparing the query terms against dictionaries and identifying misspelled query terms. As a result of the spell checking process, FAST ESP either replaces the query terms automatically with the correct terms, or it suggests modifications to the query terms to the end-user. The latter is referred to as Didyoumean spell checking. The spell checking algorithm operates on individual query segments. A query segment is a portion of the query that forms a syntactical entity of some kind. For example, if something within the query is put in quotes, that quoted part forms a query segment. Spell checking a query is executed in two stages: First, an Advanced Spell Check is performed, followed by a Simple Spell Check. Restriction: Note that spell checking is not available for Chinese, Japanese, or Korean. Phrase Recognition and Correction During the Advanced Spell Check stage, the query terms are run through Advanced Phrase Recognition. FAST ESP supplies a default dictionary containing names of persons, names of places, names of companies, and other common phrases. You can extend this dictionary with your own custom phrases, for instance product names. This stage combines phrase detection with spell check. The Advanced Spell Check stage enables all query transformation capabilities included in Advanced Phrase Recognition. Refer to the section Spellchecking Framework in the Advanced Linguistics Guide for more information. Note: For previous FDS 4.x users: this was previously referred to as Proper Name Recognition. Spell Checking on Simple Terms The Simple Spell Check stage supports spell checking of individual terms against language specific dictionaries. (See Supported Languages for Anti-Phrasing.) This spell check stage will only detect misspelling of single words, not phrases. Simple spell checking does not protect the corrected terms from further processing. Applying Spell Checking Spell checking is applied during query processing. Spell checking is controlled by the license file (the feature itself and language support). 85

86 FAST Enterprise Search Platform You activate the Simple Spell Check dictionaries as part of the installation process. You enable and configure Advanced Spell Check by: adapting the required dictionaries source files to your content and end-users needs compiling the dictionaries configuring the appropriate query transformer. For details, refer to the Advanced Linguistics Guide. Spell checking can be controlled on a per query basis (on/off/suggest). Required Dictionaries for Spell Checking Both Advanced and Simple Spell Checking require a set of dictionaries. Refer to Configuring Linguistic Processing in the Configuration Guide for dictionary file locations. Advanced Spell Check Dictionaries The following dictionaries support advanced spell checking: phrase dictionaries FAST ESP supplies a phrase dictionary that contains common phrases such as names of famous persons (for instance "elvis presley"), names of places (for instance "san francisco"), and names of companies (for instance "kraft foods"). You may modify the supplied phrase dictionary by adding or removing terms. Alternatively, you may create separate phrase dictionaries that contain customer specific phrases only. If you choose to create multiple phrase dictionaries, you can enable selecting a specific phrase dictionary to be used for spell checking at query time. If a query phrase does not exactly match any entry in the selected or default dictionary, but is close to some dictionary entries, the phrase that is considered the closest match is suggested as a replacement to the original query phrase. If a query phrase matches an entry in the dictionary exactly, the phrase is protected by quotes and Simple Spell Check will not be allowed to change the terms of the phrase. If there are no entries in the dictionary that are close to matching the query phrase, the query phrase remains unchanged and is sent to the FAST Search Engine. the phrase exception list FAST ESP supplies a default phrase exception list that contains words that are not to be considered for spell checking. When a query term matches an entry in the exception list, the term will be protected from spell checking changes. You can adapt this phrase exception list to suit your content. Note: All phrase dictionaries are language-independent. Note however that the default phrase dictionaries supplied with FAST ESP are optimized for English. Simple Spell Check Dictionaries The following dictionaries support simple spell checking: single word dictionaries: FAST supplies language specific dictionaries that contain common words for the particular language. If a query term does not match any entry in the dictionary, but is close to some dictionary entries, the term that is considered the closest match is suggested as a replacement to the original query term. If a query term exactly matches an entry in the dictionary, or there are no entries in the dictionary that are considered close to matching the query term, the query term remains unchanged and is sent to the FAST Search Engine. You may modify the supplied single word dictionaries by adding or removing terms. 86

87 Advanced Linguistic Processing single word exception lists: Single word exception lists are dictionaries that contain words that are not to be considered for spell checking. When a query term exactly matches an entry in the exception list, the term will be protected from Simple Spell Check. Note: In contrast to the phrase exception list, the single word exception lists are language specific. Supported Languages for Simple Spell Checking Dictionaries for simple spell checking are provided for the following languages: Dutch English French German Italian Norwegian Portuguese Spanish During installation, you select which of the supported languages to be installed. This is mainly in order to save disk space. To change this after installation, contact your FAST Account Manager or FAST Technical Support. Dictionaries may also be provided for the following languages (contact your FAST Account Manager or FAST Technical Support for details): Arabic Czech Danish Estonian Finnish Hungarian Latvian Lithuanian Polish Romanian Russian Swedish Turkish Ukrainian Hebrew (contact your FAST Account Manager or FAST Technical Support) Anti-Phrasing Anti-phrasing removes common phrases from the query strings. These common phrases are defined in the anti-phrasing dictionary. This way, query strings like "Who is Miles Davis?" are reduced to "Miles Davis", which improves query recall, particularly for AND queries. Anti-phrasing has less effect on the results for OR queries. It may still enhance precision as it may reduce result rank for documents with many irrelevant occurrences of who is in parts of the document where "Miles Davis" does not appear. 87

88 FAST Enterprise Search Platform Anti-phrasing is closely related to the concept of stopwords. In contrast to stopwords, however, the anti-phrasing feature does not remove single words, but entire phrases only. Removing single words implies the risk of removing important words that happen to be identical with stopwords. Phrases, in contrast, are more unambiguous and can therefore be removed from the query more safely. Required Dictionaries for Anti-Phrasing The following dictionary is involved (required or optional) in anti-phrasing: default anti-phrasing dictionary. This is a common dictionary for all supported languages. Supported Language for Anti-Phrasing Anti-phrasing is supported for the following languages: Dutch English French German Italian Japanese Korean Norwegian Portuguese Spanish Anti-phrasing may also be provided for the following languages (contact your FAST Account Manager or FAST Technical Support for details): Arabic Czech Danish Estonian Finnish Hungarian Latvian Lithuanian Polish Romanian Russian Swedish Turkish Ukrainian Sub-String Search FAST ESP supports sub-string search, that means searching for parts of a string as with a wildcard search ("*term*"). Sub-string search can also be used to enable n-gram for Chinese, Japanese, and Korean. Refer to the Advanced Linguistics Guide for more information. Sub-String Search Overview This section explains what sub-string search is compared to a wildcard search, and how a sub-string search works. 88

89 Advanced Linguistic Processing A wildcard search is using a wildcard character to substitute for any other character or characters in a string. Refer to Wildcard Search for more information.a wildcard search is using a wildcard character to substitute for any other character or char. Sub-string search is based on a specific composite field configuration within the index profile. By setting a composite field to be a sub-string field, you enable your end-users to search for sub-strings of arbitrary lengths and at arbitrary positions inside the indexed content. Wildcard search does not work with phrases: sub-string search does. Restriction: Sub-string search is not available for scope fields. When enabled, sub-string search is applied to both document and query. For a field in the index profile that is specified for sub-string search, each word or token (for Asian language documents) in the field is split up in smaller entities, so called sub-strings, consisting of a defined number of signs. As an example, the word "midsummer" is split up into the sub-strings "mids," "idsu," "dsum," "summ," "umme," and "mmer," provided the specified number of characters the sub-strings are supposed to have is four. This allows the end user to search, for example, for the query "summer" and to find a document that actually contains the word midsummer. The end-user s query is split into the sub-strings summ, umme, and mmer. During the matching process, the document containing the word midsummer with all its sub-strings and the query summer will result in a match because both contain common sub-strings. You may configure the length of the sub-strings into which a word or token is to be split. In addition, you may configure whether white space or other non-word characters functioning as word separators are to be stripped away, so that sub-strings across words are matched as well. Application Scenarios Sub-string search is useful for application scenarios like the following: For certain database applications it may be desirable to be able to search for sub-strings within individual fields, such as product code fields, or name fields. Many languages combine several individual words into new words. German, for instance, uses this mechanism a lot. Sub-string search allows your end-user to find documents containing the word "Staatsanwaltschaft" using the query "anwalt". In text written in Chinese, Japanese and Korean, there are commonly no spaces separating individual words. To tokenize documents that are written in these languages, FAST ESP uses a specific, language sensitive tokenizing document processor in order to find logical places for word boundaries. However, sometimes the process of finding word boundaries is ambiguous, as your end-users may want to search for words going beyond what this tokenizer can output. In these cases, the sub-string search functionality enables queries that are not sensitive to this tokenizer, but go across the word boundaries the tokenizer has come up with, thus matching any sequence of characters. Besides Chinese, Japenese and Korean, there are other languages that do not use space between words either. Dedicated tokenizers are not provided for these. In this case, sub-string search still allows the end-user to search for individual words. In certain scenarios, documents have insufficient logic and don t allow for useful word splitting. Examples are DNA-strings and musical midi-descriptions. In these cases, sub-string search allows the end-user to search through these types of documents. Sometimes separating characters in a word with spaces is used to emphasize the word, as in the phrase "His name was E L V I S". In this case, sub-string search allows the end-user to find a document containing this phrase by searching for "Elvis". Some acronyms may have different spellings, like "D.N.A." and DNA or dna. Sub-string search can be one of the alternatives to allow the end-user to find documents containing "D.N.A by searching for "dna" and vice-versa. A side effect of this tokenization is that word boundaries are not detected. This means that a query "*erni*" would also match the text "Midsummer Night". This can sometimes be desired, but may also create undesired matches. This means that sub-string search is not always applicable for usual text documents of reasonable 89

90 FAST Enterprise Search Platform size, as the probability for such undesired matches across word boundaries increases with document size. For structured data and Asian language encoded documents, though, sub-string search is a reasonable solution. Applying Sub-String Search Sub-string search is enabled by defining the relevant fields as subject to sub-string search in the index profile. You may configure the length of the sub-strings into which a word or token is to be split. For Western languages, the recommended length is at least four characters. For Asian languages the recommended length is two to three. In addition, you may configure whether white space or other non-word characters functioning as word separators are to be stripped away, so that sub-strings across words are matched as well. Wildcards are implicit, meaning that you get the same results by searching for "summer" as you get when you search for "*summer*". Wildcard Search FAST ESP supports single character, prefix and suffix wildcards. With full wildcard support it is possible to use '*' and '?' when specifying a query-term, where '*' indicates any number of wildcard characters and '?' indicates a single wildcard character. The wildcard characters may be anywhere in the query term. Sub-string search is a related feature. For details, refer to section Sub-String Search. Wildcard search is enabled by defining the relevant fields as subject to wildcard search in the index profile. For details refer to the Configuration Guide. Wildcard support is defined for an individual string field or a composite field in the Index Profile. It must be configured explicitly as it will have impact on disk usage. Restriction: Proximity and context based ranking does not apply to wildcard terms in queries. This means that if you have a query only containing wildcard terms, there will be no rank value in the result set. If you have a query containing both normal (without wildcards) terms and wildcard terms, the ranking will be based on the non-wildcard terms only. Special Characters and Accents By default, special characters, such as characters with accents or language specific characters, are preserved in both documents, dictionaries, and queries. This means that words that contain special characters, are treated as different words than their normalized variants. It is possible to configure FAST ESP to normalize words with respect to accents and special character sequences (such as C++ ). You can enable character normalization in the tokenizer configuration. For documents, you can enable character normalization by using the Normalizer document processor in the according pipeline. Refer to the Advanced Linguistics Guide for details. 90

91 Chapter 13 Operation and System Administration Topics: This chapter introduces you to the basic concepts of operating and administrating a FAST ESP installation. Operation Overview Note: While this chapter gives you a conceptual view of operation and ESP Administrator Interface administration within FAST ESP, the Operations Guide and the Deployment FAST Home and Search Guide give you detailed procedural information about individual operational Business Center and administrative tasks. Licensing Fault Tolerance Security

FAST Enterprise Search Platform Operation Overview This section introduces you to the basic concepts of operating and administrating a FAST ESP installation.

92 FAST Enterprise Search Platform Operation Overview This section introduces you to the basic concepts of operating and administrating a FAST ESP installation. It gives you a conceptual view of operation and administration within FAST ESP. Refer to the Operations Guide and the Deployment Planning Guide for procedural information about individual operational and administrative tasks. ESP Administrator Interface FAST ESP is administrated through the Administrator Interface (also referred to as the Admin GUI). This is a graphical user interface that is accessible through a common Web browser. Figure 11: FAST ESP Administrator Interface (Admin GUI) Main Views The FAST ESP Administrator Interface (Admin GUI) is a graphical user interface with different tabs that let you configure different areas of a search setup. The main views in the Admin GUI are: Collection Overview System Overview Document Processing Logs Search View, Search Front End Data Sources System Management Matching Engines WebAnalyzer Collection Overview The Collection Overview selection in the Admin GUI allows you to monitor, create, configure, and delete collections running on your FAST ESP implementation. For details on the concept of collections, refer to Collections. For procedural information about the tasks you can perform within the Collection Overview selection, refer to Basic Setup in the Configuration Guide. 92

Operation and System Administration Document Processing The Document Processing selection in the Admin GUI allows you to configure the document processing pipelines you want to use to process a

93 Operation and System Administration Document Processing The Document Processing selection in the Admin GUI allows you to configure the document processing pipelines you want to use to process a collection. It displays statistics, such as host, port, status, stages, or pipelines for each stage of a pipeline.you can view, create, add, edit, or remove stages of the pipeline through this selection. For details on: the concepts related to document processing, refer to Processing Documents. procedural information about the tasks you can perform within the Document Processing selection, refer to Configuring the FAST Document Processing Engine in the Configuration Guide. Search View, Search Front End The Search View selection in the Admin GUI allows you to view the default Search Front End (SFE) provided with FAST ESP. This front end lets you search the documents of your implementation for testing purposes. Figure 12: Search Front End, showing Contextual Search tab For information about the Search Front End, refer to Processing Queries and Results, section Query Processing, and to the Query Integration Guide. Refer also to the Search Front End User's Guide, and the Search Front End Developer's Guide for information on the SFE. System Management The System Management selection allows you to view status information for any node controller that is configured in the system. Information includes the node name, date and time of creation, general system information such as the name of the home directory and memory currently being used on the disk, and a list of all installed modules. This selection allows you to stop or restart any or all of the installed modules as well as add or remove an available processor server or FAST Crawler. You can also stop an entire node from this page. For procedural information about the tasks you can perform within the System Management selection, refer to the Operations Guide. 93

94 FAST Enterprise Search Platform Matching Engines The Matching Engines selection in the Admin GUI allows you to view hostname, port number and type for each Search Engine in your FAST ESP installation, and to add a new Search Engine. For details on the FAST Search Engine, refer to Processing Queries and Results. For procedural information about how to configure the Search Engine, refer to Index Profile Management in the Configuration Guide. Data Sources The Data Sources selection in the Admin GUI provides a list of available data sources and allows you to view the collections a data source is associated with. For details on: individual data source modules, refer to Processing Queries and Results. how to configure individual data source modules, refer to Basic Setup in the Configuration Guide. Logs The Logs selection in the Admin GUI allows you to view all system generated log files by file name, category, module, or collection. Log entries include the time the entry was generated, the log level, the module, host and port where the activity occurred, collection, and a text message of the activity. Archived logs can also be accessed from this page. For logging information refer to: Configuring Logging in the Operations Guide, which explains configuring logging, log levels, and log destinations, and Troubleshooting Guide, for individual log and error messages System Overview The System Overview selection in the Admin GUI gives you a total overview of your FAST ESP installation as well as status information for any modules that are configured in the system. Information includes the module name, host, port number, and status as well as the option to view detailed information for a specific module. WebAnalyzer Overview The WebAnalyzer is a module that uses links between documents to improve search relevancy. The WebAnalyzer Overview tab in the Admin GUI allows you to monitor, create, configure, and delete WebAnalyzer views running on your FAST ESP implementation. Refer to the the WebAnalyzer Guide for information about the WebAnalyzer, including procedural information about the tasks you can perform from the WebAnalyzer Overview tab. FAST Home and Search Business Center FAST Home and Search Business Center are applications for setting up and tuning search profiles, and for setting up and tuning the search experience for each of the search profiles. From a single FAST ESP installation, you can manage multiple search profiles and search experiences. FAST Home is your personal portal to the FAST ESP installation, with links to the other FAST applications, such as Search Business Center and the Administration GUI. FAST Home is where you create and set up the initial search profiles, and where you manage the users and groups that should have access to work with the search profiles. 94

Operation and System Administration Figure 13: Example of the Search Business Center interface Search Business Center is the central hub for all tuning, monitoring, administering, and reporting of

95 Operation and System Administration Figure 13: Example of the Search Business Center interface Search Business Center is the central hub for all tuning, monitoring, administering, and reporting of your search environment. You can manage ranking, relevancy, synonyms, navigators and more. Search Business Center is where you tune and configure the search experience for the search profile before you publish it to your production environment. In Search Business Center you can monitor the end-users query behavior (query logs). You can make changes to the search profile settings and test them out in the internal Preview before publishing the changes to the Published Search Front End. Once the search profile is up and running in your production environment, updated reporting information from the production system starts flowing back into Search Business Center, and you can see if your changes have had the desired effect. This way, you can continuously tune and improve the search experience for your search users. Refer to the FAST Home Guide and the Search Business Center Guide for information on these interfaces. Licensing FAST ESP is a system of individually licensed capabilities. These capabilities are either features, modules, or data amount capacities. Some of these capabilities are included in the standard delivery of FAST ESP, while others are additional modules for which you can purchase separate licenses and include them in your FAST ESP solution. Based on the agreement with the customer, FAST generates a license file, which - together with the FAST ESP license management system - ensures that the purchased capabilities are enabled. The license management system is based on FLEXlm from Macrovision Corporate. Note: FAST ESP is provided to you with the understanding that only those components that have been purchased will be used. Time limited evaluation licenses for FAST ESP are available upon request. The License Management System 95

FAST InStream. version 4.3 Product Overview Guide

FAST InStream. version 4.3 Product Overview Guide FAST InStream version 4.3 Product Overview Guide Document Number: INS1041, Document Revision: A, May 5, 2006 Copyright 1997-2006 Fast Search & Transfer ASA ( FAST ). Some portions may be copyrighted by