Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Similar documents
Chapter 2. Architecture of a Search Engine

Hibernate Search Googling your persistence domain model. Emmanuel Bernard Doer JBoss, a division of Red Hat

EPL660: Information Retrieval and Search Engines Lab 2

A short introduction to the development and evaluation of Indexing systems

ElasticSearch in Production

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

Chapter 27 Introduction to Information Retrieval and Web Search

Improving Drupal search experience with Apache Solr and Elasticsearch

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Search Engine Architecture II

A Scotas white paper September Scotas Push Connector

Information Retrieval. (M&S Ch 15)

Information Retrieval

Apache Lucene 7 What s coming next?

Creating a Recommender System. An Elasticsearch & Apache Spark approach

VK Multimedia Information Systems

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

LucidWorks: Searching with curl October 1, 2012

Session 10: Information Retrieval

Information Retrieval

Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Parallel SQL and Streaming Expressions in Apache Solr 6. Shalin Shekhar Lucidworks Inc.

DotNetNuke. Easy to Use Extensible Highly Scalable

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Turbocharge your MySQL analytics with ElasticSearch. Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

RINGTAIL A COMPETITIVE ADVANTAGE FOR LAW FIRMS. Award-winning visual e-discovery software delivers faster insights and superior case strategies.

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Realtime visitor analysis with Couchbase and Elasticsearch

modern database systems lecture 4 : information retrieval

Semantic Search at Bloomberg

EPL660: Information Retrieval and Search Engines Lab 3

I'm Charlie Hull, co-founder and Managing Director of Flax. We've been building open source search applications since 2001

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

Natural Language Processing

COMP6237 Data Mining Searching and Ranking

Apache Lucene Eurocon: Preview

BM25 is so Yesterday. Modern Techniques for Better Search Relevance in Solr. Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer

Searching the Web for Information

Viewpoint Review & Analytics

BA Insight Federator. How the BA Insight Federator Extends SharePoint Search

Using Elastic with Magento

Homework: Building an Apache-Solr based Search Engine for DARPA XDATA Employment Data Due: November 10 th, 12pm PT

Starting a Search Application. Mark Krellenstein CTO Lucid Imagination

Apache Solr Learning to Rank FTW!

Ninja Level Infrastructure Monitoring. Defensive Approach to Security Monitoring and Automation

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

60-538: Information Retrieval

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval

Evaluating Cloud Databases for ecommerce Applications. What you need to grow your ecommerce business

Midterm Exam Search Engines ( / ) October 20, 2015

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Using ElasticSearch to Enable Stronger Query Support in Cassandra

Information Retrieval

Soir 1.4 Enterprise Search Server

Istat s Pilot Use Case 1

extensible Text Framework (XTF): Building a Digital Publishing Framework

CS54701: Information Retrieval

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

JBoss Enterprise Middleware

Fusing Corporate Thesaurus Management with Linked Data using PoolParty

Search Engines and Time Series Databases

Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)

Building the News Search Engine

vector space retrieval many slides courtesy James Amherst

Multi-domain Predictive AI. Correlated Cross-Occurrence with Apache Mahout and GPUs

Information Retrieval: Retrieval Models

Goal of this document: A simple yet effective

Apache Lucene 4. Robert Muir

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MEAP Edition Manning Early Access Program Solr in Action version 1

Building A Billion Spatio-Temporal Object Search and Visualization Platform

Information Retrieval

Search and Time Series Databases

Indexing and Query Processing. What will we cover?

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

MuseKnowledge Hybrid Search

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Information Retrieval

Text Analytics (Text Mining)

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

REAL TIME BOM EXPLOSIONS WITH APACHE SOLR AND SPARK. Andreas Zitzelsberger

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Today we show how a search engine works

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Apache Lucene - Overview

The Performance Study of Hyper Textual Medium Size Web Search Engine

Text Analytics (Text Mining)

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Efficiency vs. Effectiveness in Terabyte-Scale IR

The official TYPO3 partner program

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CISC689/ Information Retrieval Midterm Exam

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Distributed computing: index building and use

Transcription:

Open Source Search Andreas Pesenhofer max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

max.recall information systems max.recall is a software and consulting company enabling enterprises to capitalize on the hidden value in the rapidly growing amount of textual data Customized Solutions for Intelligent data analytics Vertical search Products and Services quantalyze: quantity analytics technology smart.coder: open-ended question coding tool for market researchers Founded 2010 and located in Vienna, Austria Operates worldwide with int l customers from sectors such as IP, market research, news and media, IT services 2

Recall and precision Recall Percent of relevant documents (items) returned 50 good answers in system, 25 returned = 50% recall Precision Percent of documents returned that are relevant 100 returned, 25 are relevant = 25% precision Ideal is 100% recall and 100% precision: return all relevant documents and only those 100% recall is easy return all documents, but precision is low, relevant documents can t be found Need adequate recall & enough precision for the task - that will vary by application (data & users) 3

How to get good recall Collect, index and search all the data Check for missing or corrupt data Index everything Search everything limit results by category AFTER the search (clustering/faceting) Normalize the data Convert to lower case, strip/handle special characters, stemming,... Use spell-checking, synonyms to match users vocabulary with content Adaptive spell-checking, application-specific synonyms Light (or real) natural language processing for abstract concepts 4

How to get good precision Term frequency (TF) more occurrences of query terms is better Inverse document frequency (IDF) rarer query terms are more important Phrase boost query terms near each other is better Field boost where the query term is in doc matters (e.g., in 'title' better) Length normalization avoid penalizing short docs Recency all things being equal, recent is better Authority items linked to, clicked on or bought by others may be better Implicit and explicit relevance feedback, more-like-this expand query Clustering/faceting intent is not specific Lots of data 5

Every minute http://www.domo.com/ 6

Groth of patent applications 7

Big Data Open Source Tools 8

Apache Lucene Apache Lucene TM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Scalable, High-Performance Indexing over 150GB/hour on modern hardware small RAM requirements - only 1MB heap incremental indexing as fast as batch indexing index size roughly 20-30% the size of text indexed 9

Apache Lucene (2) Powerful, Accurate and Efficient Search Algorithms ranked searching - best results returned first many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more fielded searching (e.g. title, author, contents) sorting by any field multiple-index searching with merged results allows simultaneous update and searching flexible faceting, highlighting, joins and result grouping fast, memory-efficient and typo-tolerant suggesters pluggable ranking models, including the Vector Space Model and Okapi BM25 configurable storage engine (codecs) Cross-Platform Solution 10

Apache Lucene (3) Cross-Platform Solution Available as Open Source software under the Apache License - Lucene in both commercial and Open Source programs 100%-pure Java Implementations in other programming languages available, the index is compatible Apache Lucene 4.5.0 was released on October 5 th, 2013. 11

Apache SOLR Apache SOLR is an open source enterprise search platform from the Apache Lucene project. major features: full-text search hit highlighting faceted search dynamic clustering database integration handling of rich documents (e.g., Word, PDF) providing distributed search and index replication, Solr is highly scalable. Apache SOLR 4.5.0 was released on October 5 th, 2013. 12

elasticsearch elasticsearch is a distributed, RESTful, open source search server based on Apache Lucene. It is developed by Shay Banon and is released under the terms of the Apache License. major features: fully supports the near real-time search of Apache Lucene cluster setup needs no additional software features of Lucene are made available through the JSON and Java API JSON in / JSON out (and YAML) elasticsearch 0.90.5 was released on September 17 th, 2013, based on Lucene 4.4. 13

All Time Top Committers

Active Contributors

Lines of Code

The Mailing Lists

Interest over time

Case study - quantalyze patents.quantalyze.com Runs in your browser Filter and keyword search Physical quantity search Interval search Print view Visualization: Physical quantity distributions Cross-tabulations (e.g. concepts vs. quantity type) Different chart types 19

Case study - StumbleUpon Create a world-class customer experience A Stumble provides real-time recommendations to 30 million customers per day Intelligent search is key to providing fast and more informed recommendations Update your searches immediately with newly posted content Develop and scale easily Build in intelligent search to scale with millions of users and interactions Take advantage of powerful and flexible APIs for easy data integration Use easy to use but powerful solutions for your big data search and analytics needs 20

Strengths of open source search Best practice segmented index (like Google, Fast) Scalability Best practice, flexible ranking (term/field/doc boosts, function queries, custom scoring ) Best overall query performance and complete query capabilities (unlimited Boolean operations, wildcards, findsimilar, synonyms, spell-check ) Multilingual, query filters, geo search, memory mapped indexes, near real-time search, advanced proximity operators Rapid innovation Extensible architecture, complete control (open source) No license fees (open source) 21

Weaknesses of open source search Those typical of open source No formal support Limited access to training, consulting Lack of stringent integrated QA Speed of development and open source environment too complex for some (e.g., what version should I download? What patches? GUI?) Others Lucene/Solr/Elasicsearch development has tended to focus on core capabilities, so missing certain features for enterprise search (e.g., connectors, security, alerts, advanced query operations) 22

Addressing open source weaknesses Community Community has a wealth of information on web sites, wikis and mailing lists Community members usually respond quickly to questions Consultants May be especially helpful for systems integration or addressing gaps Commercialization Companies commercializing open source provide commercial support, certified versions, training and consulting Internal resources 23

Product strengths of commercial competitors Well established players tend to be full-featured Some organizations have focused on a particular application or domain (e.g., ecommerce, publishing, legal, help desk) Some competitors have focused on appliance 24

Weaknesses of top commercial competitors Usually expensive, especially at scale Platform or portability limitations Limited transparency Limited flexibility, especially for other than intended application or domain Limited customization, especially for appliance-like products Sometimes limited scalability Technical debt and/or lack of rapid innovation Customers are dependent on the company s continued business success 25

Competitive landscape Last years commercial companies have felt increasing competition from Lucene/Solr/Elasticsearch because of the combination of its capability and price Some competitors have responded with diversification Some have been acquired Need for good, affordable, flexible search remains 26

Questions

Credits 28