Apache Solr 3 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more David Smiley Eric Pugh rpaf ktl Pen I I riv IV I J community exp<= publishing"" - birmingham mumbai source experience distilled
Preface 1 Chapter 1: Quick Starting Solr 7 An introduction to Solr 7 Lucene, the underlying engine 8 Solr, a Lucene-based search server 9 Comparison to database technology 10 Getting started 11 Solr's installation directory structure 12 Solr's home directory and Solr cores 14 Running Solr 15 A quick tour of Solr 16 Loading sample data 18 A simple query 20 Some statistics 23 The sample browse interface 24 Configuration files 25 Resources outside this book 27 Summary 28 Chapter 2: Schema and Text Analysis 29 MusicBrainz.org 30 One combined index or separate indices 31 One combined index 32 Problems with using a single combined index 33 Separate indices 34 Schema design 35 Step 1: Determine which searches are going to be powered by Solr 36 Step 2: Determine the entities returned from each search 36 Step 3: Denormalize related data 37
Denormalizing 'one-to-one' associated data 37 Denormalizing 'one-to-many' associated data 38 Step 4: (Optional) Omit the inclusion of fields only used in search results 39 The schema.xml file 40 Defining field types 41 Built-in field type classes 42 Numbers and dates 42 Geospatial 43 Field options 43 Field definitions 44 Dynamic field definitions 45 Our MusicBrainz field definitions 46 Copying fields 48 The unique key 49 The default search field and query operator 49 Text analysis 50 Configuration 51 Experimenting with text analysis 54 Character filters 55 Tokenization 57 WordDelimiterFilter 59 Stemming 61 Correcting and augmenting stemming 62 Synonyms 63 Index-time versus query-time, and to expand or not 64 Stop words 65 Phonetic sounds-like analysis 66 Substring indexing and wildcards 67 ReversedWildcardFilter 68 N-grams 69 N-gram costs 70 Sorting Text 71 Miscellaneous token filters 72 Summary 73 Chapter 3: Indexing Data 75 Communicating with Solr 76 Direct HTTP or a convenient client API 76 Push data to Solr or have Solr pull it 76 Data formats 76 HTTP POSTing options to Solr 77 Remote streaming 79 Solr's Update-XML format 80
Deleting documents 81 Commit, optimize, and rollback 82 Sending CSV formatted data to Solr 84 Configuration options 86 The Data Import Handler Framework 87 Setup 88 The development console 89 Writing a DIH configuration file 90 Data Sources 90 Entity processors 91 Fields and transformers 92 Example DIH configurations 94 Importing from databases 94 Importing XML from a file with XSLT 96 Importing multiple rich document files (crawling) 97 Importing commands 98 Delta imports 99 Indexing documents with Solr Cell 100 Extracting text and metadata from files 100 Configuring Solr 101 Solr Cell parameters 102 Extracting karaoke lyrics 104 Indexing richer documents 106 Update req uest processors 109 Summary 110 Chapter 4: Searching 111 Your first search, a walk-through 112 Solr's generic XML structured data representation 114 Solr's XML response format 115 Parsing the URL 116 Request handlers 117 Query parameters 119 Search criteria related parameters 119 Result pagination related parameters 120 Output related parameters 121 Diagnostic related parameters 121 Query parsers and local-params 122 Query syntax (the lucene query parser) 123 Matching all the documents 125 Mandatory, prohibited, and optional clauses 125 Boolean operators 126 Sub-queries 127
Limitations of prohibited clauses in sub-queries 128 Field qualifier 128 Phrase queries and term proximity 129 Wildcard queries 129 Fuzzy queries 131 Range queries 131 Date math 132 Score boosting 133 Existence (and non-existence) queries 134 Escaping special characters 134 The Dismax query parser (parti) 135 Searching multiple fields 137 Limited query syntax 137 Min-should-match 138 Basic rules 138 Multiple rules 139 What to choose 140 A default search 140 Filtering 141 Sorting 142 Geospatial search 143 Indexing locations 143 Filtering by distance 144 Sorting by distance 145 Summary 146 Chapter 5: Search Relevancy 147 Scoring 148 Query-time and index-time boosting 149 Troubleshooting queries and scoring 149 Dismax query parser (part 2) 151 Lucene's DisjunctionMaxQuery 152 Boosting: Automatic phrase boosting 153 Configuring automatic phrase boosting 153 Phrase slop configuration 154 Partial phrase boosting 154 Boosting: Boost queries 155 Boosting: Boost functions 156 Add or multiply boosts? 157 Function queries 158 Field references 159 Function reference 160 Mathematical primitives 161 Other math 161
Table ofcontents ord and rord 162 Miscellaneous functions 162 Function query boosting 164 Formula: Logarithm 164 Formula: Inverse reciprocal 165 Formula: Reciprocal 167 Formula: Linear 168 How to boost based on an increasing numeric field 168 Step by step... 169 External field values 170 How to boost based on recent dates 170 Step by step... 170 Summary 171 Chapter 6: Faceting 173 A quick example: Faceting release types 174 MusicBrainz schema changes 176 Field requirements 178 Types of faceting 178 Faceting field values 179 Alphabetic range bucketing 181 Faceting numeric and date ranges 182 Range facet parameters 185 Facet queries 187 Building a filter query from a facet 188 Field value filter queries 189 Facet range filter queries 189 Excluding filters (multi-select faceting) 190 Hierarchical faceting 194 Summary 196 Chapter 7: Search Components 197 About components 198 The Highlight component 200 A highlighting example 200 Highlighting configuration 202 The regex fragmenter 205 The fast vector highlighter with multi-colored highlighting 205 The SpellCheck component 207 Schema configuration 208 Configuration in solrconfig.xml 209 Configuring spellcheckers (dictionaries) 211 Processing of the q parameter 213 Processing of the spellcheck.q parameter 213 Building the dictionary from its source 214
Issuing spellcheck requests 215 Example usage for a misspelled query 217 Query complete / suggest 219 Query term completion via facet.prefix 221 Query term completion via the Suggester 223 Query term completion via the Terms component 226 The QueryElevation component 227 Configuration 228 The MoreLikeThis component 230 Configuration parameters 231 Parameters specific to the MLT search component 231 Parameters specific to the MLT request handler 231 Common MLT parameters 232 MLT results example 234 The Stats component 236 Configuring the stats component 237 Statistics on track durations 237 The Clustering component 238 Result grouping/field collapsing 239 Configuring result grouping 241 The TermVector component 243 Summary 243 Chapter 8: Deployment 245 Deployment methodology for Solr 245 Questions to ask 246 Installing Solr into a Servlet container 247 Differences between Servlet containers 248 Defining solr.home property 248 ' Logging 249 HTTP server request access logs 250 Solr application logging 251 Configuring logging output 252 Logging using Log4j 253 Jetty startup integration 253 Managing log levels at runtime 254 A SearchHandler per search interface? 254 Leveraging Solr cores 256 Configuring solr.xml 256 Property substitution 258 Include fragments of XML with Xlnclude 259 Managing cores 259 Why use multicore? 261
Monitoring Soir performance 262 Stats.jsp 263 JMX 264 Starting Soir with JMX 265 Securing Soir from prying eyes 270 Limiting server access 270 Securing public searches 272 Controlling JMX access 273 Securing index data 273 Controlling document access 273 Other things to look at 274 Summary 275 Chapter 9: Integrating Soir 277 Working with included examples 278 Inventory of examples 278 Solritas, the integrated search Ul 279 Pros and Cons of Solritas 281 SolrJ: Simple Java interface 283 Using Heritrix to download artist pages 283 SolrJ-based client for Indexing HTML 285 SolrJ client API 287 Embedding Soir 288 Searching with SolrJ 289 Indexing 290 When should I use embedded Soir? 294 In-process indexing 294 Standalone desktop applications 295 Upgrading from legacy Lucene 295 Using JavaScript with Soir 296 Wait, what about security? 297 Building a Soir powered artists autocomplete widget with jquery and JSONP 298 AJAX Soir 303 Using XSLT to expose Soir via OpenSearch 305 OpenSearch based Browse plugin 306 Installing the Search MB Artists plugin 306 Accessing Soir from PHP applications 309 solr-php-client 310 Drupal options 311 Apache Soir Search integration module 312 Hosted Soir by Acquia 312 Ruby on Rails integrations 313 The Ruby query response writer 313
sunspot_rails gem 314 Setting up MyFaves project 315 Populating MyFaves relational database from Solr 316 Build Solr indexes from a relational database 318 Complete MyFaves website 320 Which Rails/Ruby library should I use? 322 Nutch for crawling web pages 323 Maintaining document security with ManifoldCF 324 Connectors 325 Putting ManifoldCF to use 325 Summary 328 Chapter 10: Scaling Solr 329 Tuning complex systems 330 Testing Solr performance with SolrMeter 332 Optimizing a single Solr server (Scale up) 334 Configuring JVM settings to improve memory usage 334 MMapDirectoryFactory to leverage additional virtual memory 335 Enabling downstream HTTP caching 335 Solr caching 338 Tuning caches 339 Indexing performance 340 Designing the schema 340 Sending data to Solr in bulk 341 Don't overlap commits 342 Disabling unique key checking 343 Index optimization factors 343 Enhancing faceting performance 345 Using term vectors 345 Improving phrase search performance 346 Moving to multiple Solr servers (Scale horizontally) 348 Replication 349 Starting multiple Solr servers 349 Configuring replication 351 Load balancing searches across slaves 352 Indexing into the master server 352 Configuring slaves 353 Configuring load balancing 354 Sharding indexes 356 Assigning documents to shards 357 Searching across shards (distributed search) 358 Combining replication and sharding (Scale deep) 360 Near real time search 362 Where next for scaling Solr? 363 Summary 364
Appendix: Search Quick Reference 365 Quick reference 366 Index 369 [ix]