EPL660: Information Retrieval and Search Engines Lab 3

Similar documents
An Application for Monitoring Solr

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

Soir 1.4 Enterprise Search Server

Apache Solr Cookbook. Apache Solr Cookbook

Search Engines and Time Series Databases

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

Apache Solr Reference Guide. Covering Apache Solr 4.5

Advanced Database : Apache Solr

Search and Time Series Databases

Improving Drupal search experience with Apache Solr and Elasticsearch

Search Evolution von Lucene zu Solr und ElasticSearch. Florian

Relevancy Workbench Module. 1.0 Documentation

Datacenter Simulation Methodologies Web Search. Tamara Silbergleit Lehman, Qiuyun Wang, Seyed Majid Zahedi and Benjamin C. Lee

Datacenter Simulation Methodologies Web Search

Using Elastic with Magento

Goal of this document: A simple yet effective

Indexing HTML files in Solr 1

Realtime visitor analysis with Couchbase and Elasticsearch

Enterprise Search with ColdFusion Solr. Dan Sirucek cf.objective 2012 May 2012

Yonik Seeley 29 June 2006 Dublin, Ireland

Thinking Beyond Search with Solr Understanding How Solr Can Help Your Business Scale. Magento Expert Consulting Group Webinar July 31, 2013

LAB 7: Search engine: Apache Nutch + Solr + Lucene

EPL660: Information Retrieval and Search Engines Lab 8

Road to Auto Scaling

Zookeeper ensemble in the Target cluster. The Zookeeper ensemble is provided as a configuration parameter in the Source configuration.

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Parallel SQL and Streaming Expressions in Apache Solr 6. Shalin Shekhar Lucidworks Inc.

Apache Solr Out Of The Box (OOTB)

FAST& SCALABLE SYSTEMS WITH APACHESOLR. Arnon Yogev IBM Research

Final Report CS 5604 Fall 2016

Homework 4: Comparing Search Engine Ranking Algorithms

Sitecore Search Scaling Guide

EPL660: Information Retrieval and Search Engines Lab 2

elasticsearch The Road to a Distributed, (Near) Real Time, Search Engine Shay Banon

ElasticSearch in Production

Cross Data Center Replication in Apache Solr. Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer

Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)

Oracle SQL Developer & REST Data Services

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

RavenDB & document stores

Adobe ColdFusion 11 Enterprise Edition

run your own search engine. today: Cablecar

The main differences with other open source reporting solutions such as JasperReports or mondrian are:

Elasticsearch Search made easy

Building and Running a Solr-as-a-Service SHAI ERERA IBM

Apache Lucene Eurocon: Preview

CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Views for Developers. What is Views? (and data geeks) "a tool for making lists of stuff" Bec White DrupalCamp WI, July 2010

MEAP Edition Manning Early Access Program Solr in Action version 1

Large-Scale Web Applications

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Tutorial 8 Sharing, Integrating and Analyzing Data

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

10 ways to reduce your tax bill. Amit Nithianandan Senior Search Engineer Zvents Inc.

UIMA Simple Server User Guide

CISC 7610 Lecture 2b The beginnings of NoSQL

Side by Side with Solr and Elasticsearch

Creating an Online Catalogue Search for CD Collection with AJAX, XML, and PHP Using a Relational Database Server on WAMP/LAMP Server

Uber Push and Subscribe Database

#IoT #BigData. 10/31/14

Setting Up the Development Environment

C-JDBC Tutorial A quick start

Web-based File Upload and Download System

Fusion Registry 9 SDMX Data and Metadata Management System

MPLEMENTATION OF DIGITAL LIBRARY USING HDFS AND SOLR

Active Endpoints. ActiveVOS Platform Architecture Active Endpoints

Developing Applications with Business Intelligence Beans and Oracle9i JDeveloper: Our Experience. IOUG 2003 Paper 406

Building your own BMC Remedy AR System v7 Applications. Maruthi Dogiparthi

Test On Line: reusing SAS code in WEB applications Author: Carlo Ramella TXT e-solutions

Globalbrain Administration Guide. Version 5.4

Azure-persistence MARTIN MUDRA

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Comparing SQL and NOSQL databases

Application Architecture

Intellicus Getting Started

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider. Group 5 Ajantha Ramineni, Sahil Tiwari, Rishabh Jain, Shivang Gupta

In this brief tutorial, we will be explaining the basics of Elasticsearch and its features.

This lab will introduce you to MySQL. Begin by logging into the class web server via SSH Secure Shell Client

Real Life Web Development. Joseph Paul Cohen

DatabaseRESTAPI

Implementation Architecture

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Important Notice Cloudera, Inc. All rights reserved.

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

LucidWorks: Searching with curl October 1, 2012

Introduction to Web Application Development Using JEE, Frameworks, Web Services and AJAX

REAL TIME BOM EXPLOSIONS WITH APACHE SOLR AND SPARK. Andreas Zitzelsberger

Oracle NoSQL Database 3.0

DATABASE SYSTEMS. Database programming in a web environment. Database System Course, 2016

Developing Web Sites with Free Software

Excel4apps Wands 5 Architecture Excel4apps Inc.

Documenting APIs with Swagger. TC Camp. Peter Gruenbaum

TipsandTricks. Jeff Smith Senior Principal Product Database Tools, Oracle Corp

IBM Maximo Anywhere Version 7 Release 6. Planning, installation, and deployment IBM

How to install and configure Solr v4.3.1 on IBM WebSphere Application Server v8.0

Configuring Artifactory

APIs - what are they, really? Web API, Programming libraries, third party APIs etc

Developing with Google App Engine

Session V-STON Stonefield Query: The Next Generation of Reporting

<Insert Picture Here> MySQL Cluster What are we working on

Transcription:

EPL660: Information Retrieval and Search Engines Lab 3 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science

Apache Solr Popular, fast, open-source search platform built on Apache Lucene from the Apache Lucene project Written in Java and runs as a standalone full-text search server with standalone or distributed (SolrCloud) operation Solr uses the Lucene Java search library at its core for full-text indexing and search

Apache Solr Features XML/HTTP and JSON APIs Hit highlighting Faceted Search and Filtering Near real-time indexing Database integration Rich document (e.g., Word, PDF) handling Geospatial Search Fast Incremental Updates and Index Replication Caching Replication Web administration interface etc

Apache Solr vs Apache Lucene Relationship between Solr and Lucene is that of a car and its engine. You can't drive an engine, but you can drive a car. Lucene is a library which you can't use as-is, whereas Solr is a complete application which you can use out-of-box. Unlike Lucene, Solr is a web application (WAR) which can be deployed in any servlet container, e.g. Jetty, Tomcat, Resin, etc. single JAR file needed to deploy application on server Solr can be installed and used easily by nonprogrammers. Lucene needs programming skills.

When to use Lucene? Need for embedded search functionality into a desktop application for example Need very customized requirements requiring low-level access to the Lucene API classes Solr may be more a hindrance than a help, since it is an extra layer of indirection.

SolrCloud Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability: SolrCloud SolrCloud allows for distributed search and indexing SolrCloud features: Central configuration for the entire cluster Automatic load balancing and fail-over for queries ZooKeeper integration for cluster coordination and configuration

SolrCloud Concepts A Cluster is made up of one or more Solr Nodes, which are running instances of the Solr server process

SolrCloud Concepts A Cluster can host multiple Collections of Solr Documents A collection can be partitioned into multiple Shards (pieces), which contain a subset of the Documents in the Collection Each Shard can be replicated (Leader & Replicas)

SolrCloud Concepts The number of Shards that a Collection has determines: The theoretical limit to the number of Documents that Collection can reasonably contain. The amount of parallelization that is possible for an individual search request. The number of Replicas that each Shard has determines: The level of redundancy built into the Collection and how fault tolerant the Cluster can be in the event that some Nodes become unavailable. The theoretical limit in the number concurrent search requests that can be processed under heavy load.

Getting Started Download Apache Solr from http://www.eu.apache.org/dist/lucene/solr/7.2.0/solr- 7.2.0.tgz (or zip for windows) Extract zip and go to solr directory Open a terminal and type: bin/solr start -e cloud -noprompt This will start up a SolrCloud cluster with embedded ZooKeeper (cloud management service) on local workstation with 2 nodes First node listens on port 8983 & second on port 7574 You can see that the Solr is running by loading http://localhost:8983/solr/ in your web browser.

Solr web interface

SolrCloud Preview collections on tab One collection created automatically gettingstarted Collection is partioned into 2 shards First node stores 2 leader shards / Second stores 2 replicas Solr server is up and running, with one collection but no data indexed Important files configuration files: solrconfig.xml, managed-schema solr-dir/server/solr/configsets/_default/conf/solrconfig.xml solr-dir/server/solr/configsets/_default/conf/managed-schema

How Solr Sees the World Document: basic unit of information set of data that describes something E.g. document about a person, for example, might contain the person s name, biography, favorite color, and shoe size documents are expected to be composed of fields, which are more specific pieces of information E.g. "first_name":"pavlos", "shoe_size":42 fields can contain different types of data first_name text, shoe_size number User defines type of each field Field type tells Solr how to interpret the field and how it can be queried When document added into a collection, Solr takes values from document fields and add them to index Queries consult index, return matching docs

Field Analysis Process How does Solr process document fields when building an index? Example: biography field in a person document "biography": "He received his Ph.D. from Department of Computer Science of the University of Cyprus, in 2012" Index every word of biography in order to find quickly people whose lives have had anything to do with university, or computer. Any issues? What if biography contains a lot of common words you don t really care about like "he", "the", "a", "to", "for", "is" (stop words)? What if biography contains the word "University" and a user makes a query for "university"? Solution: field analysis

Field Analysis Process For each field, you can tell Solr: how to break apart the text into words (tokenization) E.g. split at whitespaces, commas, etc. to remove stop words (filtering) to make lower case normalization to remove accents marks Read more here: Understanding Analyzers, Tokenizers, and Filters

Schema files and manipulation Solr stores details about the field types and fields it is expected to understand in a schema file: managed-schema is the name for the schema file Solr uses by default to support making Schema changes at runtime via the Schema API (via HTTP), or Schemaless Mode / avoid hand editing of the managed schema file schema.xml is the traditional name for a schema file which can be edited manually by users who use the ClassicIndexSchemaFactory If you are using SolrCloud you may not be able to find any file by these names on the local filesystem. You will only be able to see the schema through the Schema API (if enabled) or through the Solr Admin UI s Cloud Screens

Field Analysis Schema defines The kind of fields available for indexing The type of analysis to be applied when indexing or querying each field Available field types such as float, long, double, date, text Explore the schema using Schema tab (see next slide) Example: choose *_txt field to see how solr behaves to field names ending by _txt

Field Analysis indexed fields are fields which pass through analysis phase, and are added to the index so as to be searchable/sortable by queries stored fields are fields whose the original text is stored in the index somewhere so as to be retrievable by queries Schema tab

Field Analysis Go to the Analysis Tab (see next slide) to see how a text value is broken down into words by Index and Query time analysis Field Value (Index): He received his Ph.D. from Department of Computer Science of the University of Cyprus, in 2012 Analyse Fieldname / FieldType: text_en

Field Analysis Insert text to Analyze Analysis tab

Field Analysis The word of has been stopped

Indexing XML Data Solr includes a simple command line tool for POSTing various types of content to a Solr server /bin/post in UNIX, different usage in Windows Let's first index two XML files UNIX: remain into solr directory bin/post c gettingstarted example/exampledocs/solr.xml example/exampledocs/monitor.xml Windows: go to examples/exampledocs dir java -Dc=gettingstarted -jar post.jar solr.xml monitor.xml You have now indexed two documents in Solr Browse the documents indexed at http://localhost:8983/solr/gettingstarted/browse

Collection browsing

Collection querying

Querying Data via Solr Admin UI Solr can be queried via REST clients, curl, wget, Chrome POSTMAN, etc., as well as via native clients available for many programming languages. Solr Admin UI includes a query builder interface In Admin interface choose gettingstarted collection In "Query" tab click button to display results RequestHandlers are specified in solrconfig.xml Search for anything <requesthandler name="/select class="solr.searchhandler"> <lst name="defaults"> <str name="echoparams">explicit</str> <int name="rows">10</int> </lst> </requesthandler> <initparams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse"> <lst name="defaults"> <str name="df">_text_</str> </lst> </initparams> Default search field: text

Querying Data via Solr Admin UI Enter "solr" in the "q" text box, to search for "solr" in the index Why no results returned? Default field for searching the word solr is text. No text field includes solr Change df to name and press button again Results can be also previewed in browser: http://localhost:8983/solr/gettingstarted/select?q=solr&df= name (response in JSON format) http://localhost:8983/solr/gettingstarted/select?q=solr&df= name&wt=xml (response in XML format)

Querying Data via Solr Admin UI RESTful url to query Solr. Can be used when querying Solr from custom apps.

Querying Data Index all.xml documents in example/exampledocs UNIX: /bin/post -c gettingstarted example/exampledocs/*.xml Windows: java -Dc=gettingstarted -jar post.jar *.xml...and now you can search for all sorts of things using the default Solr Query Syntax (a superset of the Lucene query syntax)... video name:*video* address_s:*ist* +video +price:[* TO 400] docs having video in searchable fields and price up to 400 -address_s:* docs that do not have address_s field

Updating Data Although solr.xml has been POSTed to the server twice q : solr " { "numfound": 1, "start": 0, Why? "docs": [ { "id": "SOLR1000", This is because the example schema.xml specifies a "uniquekey" field called "id". Whenever you POST commands to Solr to add a document with the same value for the uniquekey as an existing document, it automatically replaces it for you.

Updating Data You can see that that has happened by looking at the values for numdocs and maxdoc in the "CORE"/searcher section of the statistics page... http://localhost:8983/solr/index.html#/gettingstarte d/plugins?entry=searcher&type=core

Deleting Data You can delete data by POSTing a delete command to the update URL and specifying the value of the document's unique key field, or a query that matches multiple documents java -Dc=gettingstarted -Ddata=args - Dcommit=false -jar post.jar "<delete><id>sp2514n</id></delete>" Delete documents that match a specific query java -Dc=gettingstarted -Dcommit=false - Ddata=args -jar post.jar "<delete><query>name:*ddr*</query></delete>"

Querying Data via REST API Searches are done via HTTP GET on the select URL with the query string in the q parameter. You can pass a number of optional request parameters to the request handler to control what information is returned. use the "fl" parameter to control what stored fields are returned, and if the relevancy score is returned: q=video&fl=name,id (return only name and id fields) q=video&fl=name,id,score (return relevancy score as well) q=video&fl=*,score (return all stored fields, as well as relevancy score) q=video&sort=address_s desc&fl=name,id,price (add sort specification: sort by address_s descending) q=video&wt=json (return response in JSON format)

Sorting Solr provides a simple method to sort on one or more indexed fields. Use the "sort' parameter to specify "field direction" pairs, separated by commas if there's more than one sort field: q=video&sort=price desc q=video&sort=price asc q=video&sort=instock asc, price desc "score" can also be used as a field name when specifying a sort: q=video&sort=score desc q=video&sort=instock asc, score desc Complex functions may also be used to sort results: q=video&sort=div(popularity,add(price,1)) desc If no sort is specified, the default is score desc to return the matches having the highest relevancy

Indexing Rich Data Index local "rich" files including HTML, PDF, Microsoft Office formats (such as MS Word), plain text and many other formats found in /docs UNIX: bin/post -c gettingstarted docs/

Index Data There are many other different ways to import your data into Solr... one can: Import records from a database using the Data Import Handler (DIH) see tutorial here for MySQL or SQL Server database import Load a CSV file (comma separated values), including those exported by Excel or MySQL. POST JSON documents Index binary documents such as Word and PDF with Solr Cell (ExtractingRequestHandler). Use SolrJ for Java or other Solr clients to programatically create documents to send to Solr.

Stopping SolrCloud Stop SolrCloud nodes bin/solr stop -all Delete Solr home for nodes (if needed): rm -rf example/cloud/node1 rm -rf example/cloud/node2

Useful Links http://lucene.apache.org/solr/index.html http://lucene.apache.org/solr/quickstart.html http://wiki.apache.org/solr/solrresources Next Week: ElasticSearch