rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

Similar documents
Soir 1.4 Enterprise Search Server

EPL660: Information Retrieval and Search Engines Lab 3

An Application for Monitoring Solr

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Relevancy Workbench Module. 1.0 Documentation

A short introduction to the development and evaluation of Indexing systems

Enterprise Search with ColdFusion Solr. Dan Sirucek cf.objective 2012 May 2012

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

Improving Drupal search experience with Apache Solr and Elasticsearch

Apache Solr Cookbook. Apache Solr Cookbook

Mastering phpmyadmiri 3.4 for

Apache Lucene - Query Parser Syntax

Search and Time Series Databases

Road to Auto Scaling

Search Engines and Time Series Databases

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Realtime visitor analysis with Couchbase and Elasticsearch

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Collective Intelligence in Action

Alfresco Developer Guide

Goal of this document: A simple yet effective

Click to add text IBM Collaboration Solutions

Apache Solr Reference Guide. Covering Apache Solr 4.5

Oracle Fusion Middleware 11g: Build Applications with ADF I

Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)

Fusion Registry 9 SDMX Data and Metadata Management System

Parallel SQL and Streaming Expressions in Apache Solr 6. Shalin Shekhar Lucidworks Inc.

Query Parsing. Presented by Erik Hatcher 27 February 2013

Building Search Applications

Oracle Fusion Middleware 11g: Build Applications with ADF Accel

Oracle APEX 18.1 New Features

Oracle Fusion Middleware 11g: Build Applications with ADF I

Apache Lucene 4. Robert Muir

Course Content MongoDB

Workbench User's Guide

Red Hat JBoss Data Grid 7.1 Migration Guide

Hibernate Search Googling your persistence domain model. Emmanuel Bernard Doer JBoss, a division of Red Hat

Web Applications. Software Engineering 2017 Alessio Gambi - Saarland University

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

MEAP Edition Manning Early Access Program Solr in Action version 1

Elasticsearch Search made easy

Chapter 2. Architecture of a Search Engine

Final Report CS 5604 Fall 2016

No Schema Type For Mysql Type Date Drupal

Thinking Beyond Search with Solr Understanding How Solr Can Help Your Business Scale. Magento Expert Consulting Group Webinar July 31, 2013

SharePoint 2013 Search Inside Out

X100 ARCHITECTURE REFERENCES:

VK Multimedia Information Systems

fpackfl Drupal 6 JavaScript and jquery L PUBLISHING Putting jquery, AJAX, and JavaScript effects into your Drupal 6 modules and themes Matt Butcher

Digital Factory 7 Search and Query API under the hood

cominvent as Migrating FAST to Solr by Jan Høydahl cominvent as Enterprise Search Specialists

Adobe Experience Manager

Google Search Appliance

Microsoft. Inside Microsoft. SharePoint Ted Pattison. Andrew Connell. Scot Hillier. David Mann

Fusing Corporate Thesaurus Management with Linked Data using PoolParty

mysolr Documentation Release Rubén Abad, Miguel Olivares

EPL660: Information Retrieval and Search Engines Lab 8

FAST Enterprise Search Platform

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

NYC Apache Lucene/Solr Meetup

BUILDING A WEBSITE FOR THE NUMBER ONE CHILDREN S HOSPITAL IN THE U.S. May 10, 2011

BEAWebLogic Server. Introduction to BEA WebLogic Server and BEA WebLogic Express

MarkLogic Server. Administrator s Guide. MarkLogic 9 May, Copyright 2017 MarkLogic Corporation. All rights reserved.

ForeScout Open Integration Module: Data Exchange Plugin

Information Retrieval

Oracle WebLogic Server 11g: Administration Essentials

Apache Lucene - Overview

Indexing HTML files in Solr 1

FAST& SCALABLE SYSTEMS WITH APACHESOLR. Arnon Yogev IBM Research

How to Build a Digital Library

Apache Lucene Eurocon: Preview

Drupal 7 Sql Schema Api Datetime

The main differences with other open source reporting solutions such as JasperReports or mondrian are:

Istat s Pilot Use Case 1

Language Support, Linguistics, and Text Analytics in Solr

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

API Gateway Version September Key Property Store User Guide

ForeScout CounterACT. Configuration Guide. Version 3.4

Developing Applications with Java EE 6 on WebLogic Server 12c

André Angelantoni Thanks to France Telecom for allowing me to demo their project.

Agent-Enabling Transformation of E-Commerce Portals with Web Services

Full-Text Indexing For Heritrix

High Performance Solr. Shalin Shekhar Mangar

In this brief tutorial, we will be explaining the basics of Elasticsearch and its features.

open source community experience distilled

CERA GUI Usage. Revision History. Contents

Yonik Seeley 29 June 2006 Dublin, Ireland

8KMiles Software Services, Inc

Enterprise Data Catalog for Microsoft Azure Tutorial

run your own search engine. today: Cablecar

Hibernate Search: A Successful Search, a Happy User Make it Happen!

Govt. of Karnataka, Department of Technical Education Diploma in Computer Science & Engineering. Fifth Semester. Subject: Web Programming

Delivery Options: Attend face-to-face in the classroom or via remote-live attendance.

Professional SharePoint 2010 Development

Homework 4: Comparing Search Engine Ranking Algorithms

FAST InStream. version 4.3 Product Overview Guide

Building the News Search Engine

Application Services for Knowledge Organisation and System Integration

CONTEXT-BASED AUTOSUGGEST ON GRAPH DATA

PROCE55 Mobile: Web API App. Web API.

Transcription:

Apache Solr 3 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more David Smiley Eric Pugh rpaf ktl Pen I I riv IV I J community exp<= publishing"" - birmingham mumbai source experience distilled

Preface 1 Chapter 1: Quick Starting Solr 7 An introduction to Solr 7 Lucene, the underlying engine 8 Solr, a Lucene-based search server 9 Comparison to database technology 10 Getting started 11 Solr's installation directory structure 12 Solr's home directory and Solr cores 14 Running Solr 15 A quick tour of Solr 16 Loading sample data 18 A simple query 20 Some statistics 23 The sample browse interface 24 Configuration files 25 Resources outside this book 27 Summary 28 Chapter 2: Schema and Text Analysis 29 MusicBrainz.org 30 One combined index or separate indices 31 One combined index 32 Problems with using a single combined index 33 Separate indices 34 Schema design 35 Step 1: Determine which searches are going to be powered by Solr 36 Step 2: Determine the entities returned from each search 36 Step 3: Denormalize related data 37

Denormalizing 'one-to-one' associated data 37 Denormalizing 'one-to-many' associated data 38 Step 4: (Optional) Omit the inclusion of fields only used in search results 39 The schema.xml file 40 Defining field types 41 Built-in field type classes 42 Numbers and dates 42 Geospatial 43 Field options 43 Field definitions 44 Dynamic field definitions 45 Our MusicBrainz field definitions 46 Copying fields 48 The unique key 49 The default search field and query operator 49 Text analysis 50 Configuration 51 Experimenting with text analysis 54 Character filters 55 Tokenization 57 WordDelimiterFilter 59 Stemming 61 Correcting and augmenting stemming 62 Synonyms 63 Index-time versus query-time, and to expand or not 64 Stop words 65 Phonetic sounds-like analysis 66 Substring indexing and wildcards 67 ReversedWildcardFilter 68 N-grams 69 N-gram costs 70 Sorting Text 71 Miscellaneous token filters 72 Summary 73 Chapter 3: Indexing Data 75 Communicating with Solr 76 Direct HTTP or a convenient client API 76 Push data to Solr or have Solr pull it 76 Data formats 76 HTTP POSTing options to Solr 77 Remote streaming 79 Solr's Update-XML format 80

Deleting documents 81 Commit, optimize, and rollback 82 Sending CSV formatted data to Solr 84 Configuration options 86 The Data Import Handler Framework 87 Setup 88 The development console 89 Writing a DIH configuration file 90 Data Sources 90 Entity processors 91 Fields and transformers 92 Example DIH configurations 94 Importing from databases 94 Importing XML from a file with XSLT 96 Importing multiple rich document files (crawling) 97 Importing commands 98 Delta imports 99 Indexing documents with Solr Cell 100 Extracting text and metadata from files 100 Configuring Solr 101 Solr Cell parameters 102 Extracting karaoke lyrics 104 Indexing richer documents 106 Update req uest processors 109 Summary 110 Chapter 4: Searching 111 Your first search, a walk-through 112 Solr's generic XML structured data representation 114 Solr's XML response format 115 Parsing the URL 116 Request handlers 117 Query parameters 119 Search criteria related parameters 119 Result pagination related parameters 120 Output related parameters 121 Diagnostic related parameters 121 Query parsers and local-params 122 Query syntax (the lucene query parser) 123 Matching all the documents 125 Mandatory, prohibited, and optional clauses 125 Boolean operators 126 Sub-queries 127

Limitations of prohibited clauses in sub-queries 128 Field qualifier 128 Phrase queries and term proximity 129 Wildcard queries 129 Fuzzy queries 131 Range queries 131 Date math 132 Score boosting 133 Existence (and non-existence) queries 134 Escaping special characters 134 The Dismax query parser (parti) 135 Searching multiple fields 137 Limited query syntax 137 Min-should-match 138 Basic rules 138 Multiple rules 139 What to choose 140 A default search 140 Filtering 141 Sorting 142 Geospatial search 143 Indexing locations 143 Filtering by distance 144 Sorting by distance 145 Summary 146 Chapter 5: Search Relevancy 147 Scoring 148 Query-time and index-time boosting 149 Troubleshooting queries and scoring 149 Dismax query parser (part 2) 151 Lucene's DisjunctionMaxQuery 152 Boosting: Automatic phrase boosting 153 Configuring automatic phrase boosting 153 Phrase slop configuration 154 Partial phrase boosting 154 Boosting: Boost queries 155 Boosting: Boost functions 156 Add or multiply boosts? 157 Function queries 158 Field references 159 Function reference 160 Mathematical primitives 161 Other math 161

Table ofcontents ord and rord 162 Miscellaneous functions 162 Function query boosting 164 Formula: Logarithm 164 Formula: Inverse reciprocal 165 Formula: Reciprocal 167 Formula: Linear 168 How to boost based on an increasing numeric field 168 Step by step... 169 External field values 170 How to boost based on recent dates 170 Step by step... 170 Summary 171 Chapter 6: Faceting 173 A quick example: Faceting release types 174 MusicBrainz schema changes 176 Field requirements 178 Types of faceting 178 Faceting field values 179 Alphabetic range bucketing 181 Faceting numeric and date ranges 182 Range facet parameters 185 Facet queries 187 Building a filter query from a facet 188 Field value filter queries 189 Facet range filter queries 189 Excluding filters (multi-select faceting) 190 Hierarchical faceting 194 Summary 196 Chapter 7: Search Components 197 About components 198 The Highlight component 200 A highlighting example 200 Highlighting configuration 202 The regex fragmenter 205 The fast vector highlighter with multi-colored highlighting 205 The SpellCheck component 207 Schema configuration 208 Configuration in solrconfig.xml 209 Configuring spellcheckers (dictionaries) 211 Processing of the q parameter 213 Processing of the spellcheck.q parameter 213 Building the dictionary from its source 214

Issuing spellcheck requests 215 Example usage for a misspelled query 217 Query complete / suggest 219 Query term completion via facet.prefix 221 Query term completion via the Suggester 223 Query term completion via the Terms component 226 The QueryElevation component 227 Configuration 228 The MoreLikeThis component 230 Configuration parameters 231 Parameters specific to the MLT search component 231 Parameters specific to the MLT request handler 231 Common MLT parameters 232 MLT results example 234 The Stats component 236 Configuring the stats component 237 Statistics on track durations 237 The Clustering component 238 Result grouping/field collapsing 239 Configuring result grouping 241 The TermVector component 243 Summary 243 Chapter 8: Deployment 245 Deployment methodology for Solr 245 Questions to ask 246 Installing Solr into a Servlet container 247 Differences between Servlet containers 248 Defining solr.home property 248 ' Logging 249 HTTP server request access logs 250 Solr application logging 251 Configuring logging output 252 Logging using Log4j 253 Jetty startup integration 253 Managing log levels at runtime 254 A SearchHandler per search interface? 254 Leveraging Solr cores 256 Configuring solr.xml 256 Property substitution 258 Include fragments of XML with Xlnclude 259 Managing cores 259 Why use multicore? 261

Monitoring Soir performance 262 Stats.jsp 263 JMX 264 Starting Soir with JMX 265 Securing Soir from prying eyes 270 Limiting server access 270 Securing public searches 272 Controlling JMX access 273 Securing index data 273 Controlling document access 273 Other things to look at 274 Summary 275 Chapter 9: Integrating Soir 277 Working with included examples 278 Inventory of examples 278 Solritas, the integrated search Ul 279 Pros and Cons of Solritas 281 SolrJ: Simple Java interface 283 Using Heritrix to download artist pages 283 SolrJ-based client for Indexing HTML 285 SolrJ client API 287 Embedding Soir 288 Searching with SolrJ 289 Indexing 290 When should I use embedded Soir? 294 In-process indexing 294 Standalone desktop applications 295 Upgrading from legacy Lucene 295 Using JavaScript with Soir 296 Wait, what about security? 297 Building a Soir powered artists autocomplete widget with jquery and JSONP 298 AJAX Soir 303 Using XSLT to expose Soir via OpenSearch 305 OpenSearch based Browse plugin 306 Installing the Search MB Artists plugin 306 Accessing Soir from PHP applications 309 solr-php-client 310 Drupal options 311 Apache Soir Search integration module 312 Hosted Soir by Acquia 312 Ruby on Rails integrations 313 The Ruby query response writer 313

sunspot_rails gem 314 Setting up MyFaves project 315 Populating MyFaves relational database from Solr 316 Build Solr indexes from a relational database 318 Complete MyFaves website 320 Which Rails/Ruby library should I use? 322 Nutch for crawling web pages 323 Maintaining document security with ManifoldCF 324 Connectors 325 Putting ManifoldCF to use 325 Summary 328 Chapter 10: Scaling Solr 329 Tuning complex systems 330 Testing Solr performance with SolrMeter 332 Optimizing a single Solr server (Scale up) 334 Configuring JVM settings to improve memory usage 334 MMapDirectoryFactory to leverage additional virtual memory 335 Enabling downstream HTTP caching 335 Solr caching 338 Tuning caches 339 Indexing performance 340 Designing the schema 340 Sending data to Solr in bulk 341 Don't overlap commits 342 Disabling unique key checking 343 Index optimization factors 343 Enhancing faceting performance 345 Using term vectors 345 Improving phrase search performance 346 Moving to multiple Solr servers (Scale horizontally) 348 Replication 349 Starting multiple Solr servers 349 Configuring replication 351 Load balancing searches across slaves 352 Indexing into the master server 352 Configuring slaves 353 Configuring load balancing 354 Sharding indexes 356 Assigning documents to shards 357 Searching across shards (distributed search) 358 Combining replication and sharding (Scale deep) 360 Near real time search 362 Where next for scaling Solr? 363 Summary 364

Appendix: Search Quick Reference 365 Quick reference 366 Index 369 [ix]