FAST& SCALABLE SYSTEMS WITH APACHESOLR. Arnon Yogev IBM Research

Similar documents
High Performance Solr. Shalin Shekhar Mangar

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

How to tackle performance issues when implementing high traffic multi-language search engine with Solr/Lucene

EPL660: Information Retrieval and Search Engines Lab 3

Realtime visitor analysis with Couchbase and Elasticsearch

Apache Lucene Eurocon: Preview

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Hypertable: The Storage Infrastructure behind Rediffmail - one of the World s Largest Services. Introduction. Current Architecture

Road to Auto Scaling

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

Thinking Beyond Search with Solr Understanding How Solr Can Help Your Business Scale. Magento Expert Consulting Group Webinar July 31, 2013

Building the News Search Engine

Workflow for web archive indexing and search using limited resources. Sara Elshobaky & Youssef Eldakar

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

Final Report CS 5604 Fall 2016

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

REAL TIME BOM EXPLOSIONS WITH APACHE SOLR AND SPARK. Andreas Zitzelsberger

Building and Running a Solr-as-a-Service SHAI ERERA IBM

Improving Drupal search experience with Apache Solr and Elasticsearch

OPS-23: OpenEdge Performance Basics

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

The Road to a Complete Tweet Index

The webinar will start soon... Elasticsearch Performance Optimisation

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Is Your Project in Trouble on System Performance?

HBase Solutions at Facebook

Database Level 100. Rohit Rahi November Copyright 2018, Oracle and/or its affiliates. All rights reserved.

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

Elasticsearch Scalability and Performance

Consistency Without Transactions Global Family Tree

Scaling for Humongous amounts of data with MongoDB

elasticsearch The Road to a Distributed, (Near) Real Time, Search Engine Shay Banon

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

November 7, DAN WILSON Global Operations Architecture, Concur. OpenStack Summit Hong Kong JOE ARNOLD

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

RED HAT CEPH STORAGE ROADMAP. Cesar Pinto Account Manager, Red Hat Norway

Nimble Storage vs HPE 3PAR: A Comparison Snapshot

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Yandex.Classifieds. Vadim Tsesko

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Windows Servers In Microsoft Azure

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 20 th, 2016 Percona Technical Webinars

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Advanced Database : Apache Solr

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

A Software Architecture for Progressive Scanning of On-line Communities

Building A Billion Spatio-Temporal Object Search and Visualization Platform

Search and Time Series Databases

ORACLE ENTERPRISE MANAGER 10g ORACLE DIAGNOSTICS PACK FOR NON-ORACLE MIDDLEWARE

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

Tuning Enterprise Information Catalog Performance

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

CSE-E5430 Scalable Cloud Computing Lecture 9

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Search Engines and Time Series Databases

Distributed Data Analytics Partitioning

Aerospike Scales with Google Cloud Platform

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Oracle Database 18c and Autonomous Database

ElasticSearch in Production

Traditional Scaling. Sharding. Replication shard 1. shard. shard. shard. shard 1. shard 2. shard 4. shard 3. shard. shard. shard.

Scalability and Performance

Soir 1.4 Enterprise Search Server

Session 4112 BW NLS Data Archiving: Keeping BW in Tip-Top Shape for SAP HANA. Sandy Speizer, PSEG SAP Principal Architect

10 Million Smart Meter Data with Apache HBase

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Lecture 9: MIMD Architectures

Delegates must have a working knowledge of MariaDB or MySQL Database Administration.

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

EsgynDB Enterprise 2.0 Platform Reference Architecture

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

Architecting & Tuning IIB / extreme Scale for Maximum Performance and Reliability

Pivotal Greenplum Text. Documentation

Tools for Social Networking Infrastructures

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Integrated hardware-software solution developed on ARM architecture. CS3 Conference Krakow, January 30th 2018

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

To Shard or Not to Shard That is the question! Peter Zaitsev April 21, 2016

High Availability Distributed (Micro-)services. Clemens Vasters Microsoft

IBM FlashSystem. IBM FLiP Tool Wie viel schneller kann Ihr IBM i Power Server mit IBM FlashSystem 900 / V9000 Storage sein?

Introduction to Database Services

Introduction to NoSQL Databases

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Pass IBM C Exam

Transcription:

FAST& SCALABLE EMAIL SYSTEMS WITH APACHESOLR Arnon Yogev IBM Research

Background IBM Verse is a cloud based business email system

Background cont. Verse backend is based on Apache Solr Almost every user interaction with the email system is translated into Solr queries: Refresh inbox Search by keyword / person Filter emails with attachments Compose or reply to an email Delete an email

Mission Support IBM Verse s future growth in terms of stability, scalability and cost-effectiveness, by optimizing its Solr infrastructure and usage Improved Performance Increased Server Capacity Reduced Cost per User

Premise Email search web search Mailboxes are private Main interactions are performed on latest emails Results are typically sorted by date

SETUP & ENVIRONMENT

Benchmark Framework Email Data Generator Generate synthetic email data that simulates an email corpus characteristics and distributions Indexer Module Index data into Solr using different indexing strategies Apache JMeter Execute test scenarios based on real life usage statistics Email Data Generator Mailboxes Store (Lucene Index per owner) Solr Cloud Instances (per indexing strategy and #users)

Benchmark Architecture SolrCloud Cluster Driver Replication ZooKeeper Ensemble

Approach Short & self-contained improvement cycles From low-hanging fruits to advanced functionality Analyze Assess Implement

SCHEMA OPTIMIZATIONS

Schema Optimizations Replace wildcard queries with boolean(hasx) fields Before attachmentname:* After hasattachment:true Avoids iterating over millions of posting lists Result: Some queries improved by 20X (e.g. hasurl) Some queries were not affected (e.g. hasstatus)

Schema Optimizations Cont. DocValues for faceting / sorting Densely packed column-oriented fields Reduces memory consumption by field cache Result: OOM problems solved! Field1 Field2 Field3 Doc1 1 2 3 Doc2 2 3 4 Doc3 4 3 2 Row-oriented (Stored Fields) Doc1 Doc2 Doc3 Field1 1 2 4 Field2 2 3 3 Field3 3 4 2 Column-oriented (docvalues)

QUERY OPTIMIZATIONS

Query Optimizations Filter Queries (fq) instead of main query (q) Cached independently in filter cache Score is not affected by the filtering action Before q = content:apache AND sender:bob AND deleted:false After q = content:apache & fq = sender:bob & fq = deleted:false Use of Field List (fl) Specifies the list of fields to be returned Saves bandwidth and simplifies parsing

Query Optimizations Cont. Avoid grouping queries when possible Simple & useful (e.g. group emails by threads), yet time consuming Consider query splitting Before q = content:apache & group.query = folder:inbox & group.query = *:* After Query 1: q = content:apache & fq = folder:inbox Query 2: q = content:apache Result: Query runtime improvement of up to 33%.

INDEXING STRATEGY

Indexing Strategy -Motivation Observation:Indexing emails is unique in the sense that mailboxes are private Data was indexed using various indexing strategies and several parameters were tested: Indexing scenarios running time Search scenarios running time Server side behavior: CPU, Memory, GC

User Mailboxes and Shards Data is divided into 5-node clusters User emails are assigned into a single cluster Initial designs indexed a user mailbox across all shards Solution: Each mailbox is assigned to a single shard Reduces load on each Solrnode Significant improvement in query runtime

Indexing Strategy Alternatives 1 Collection / 1 Shard Collection per Mailbox Shard per Mailbox 1 Collection / N Shards N Collections / 1 Shard N-1 N-1

Indexing Strategy -Results 1 Collection / 1 Shard performed poorly Collection / Shard per mailbox strategies had the best performance, but failed to scale above ~2000 mailboxes Solr/ Zookeeper limitations Multi-collection strategies perform better than multi-shard strategies, and had better utilization of resources (CPU & Memory) Query running time difference of 8-32%

SORTED INDEX

Sorted Index -Background Observation: Most queries require sorting by date Example: Display emails in inbox Default sorting is by doc ids Solution: Keep the index sorted by email dates Sorting is not performed at query time Early termination when requested #results is reached The tradeoff: additional work during indexing

Sorted Index -Implementation Two modules were developed to extend Solr: 1. Sorting MergePolicy 2. Early termination (with / without grouping queries) Contributed to Solr, available in Solr 6 SOLR-5730-Make Lucene's SortingMergePolicyand EarlyTerminatingSortingCollector configurable in Solr SOLR-8621-solrconfig.xml: deprecate/replace <mergepolicy> with <mergepolicyfactory>

Sorted Index -Results Runtime of queries that perform sorting by date improved by 6-23% Early termination in grouping queries did not show significant improvement No observable indexing time slowdown

MULTI-TIERED INDEX

Multi-Tiered Index -Background Observation: Users are often interested in their recent emails Example: Inbox refresh, closest people etc. Solution: Index emails into tiers Archive tier contains all emails Recent tier contains recent emails Last week, last month, last quarter etc. Tiers can be placed on different HW

Non-Tiered Strategy N Collections Single-shard Each collection contains multiple mailboxes N Collections / 1 Shard N-1

Tiered Strategy N Archive collections + N Last Month collections Indexing is performed on both tiers Search is performed on the relevant tier per query date constraints 2N Collections / 1 Shard N Collections / 1 Shard N-1 N-1 Archive Recent N-1

Tiered Strategy -Alternatives Archive N-1 Archive w/o Recent N-1 N-1 Recent Recent Pros: Less collections to manage Cons: Does not perform as well as multi-collection recent tier Pros: Saves disk space Indexing to a single collection Cons: Archive queries must merge results Emails should move from recent to archive tiers

Multi-Tiered Index -Results Queries that target the recent tier ran 2-15X faster Indexing scenarios experience a ~2X slowdown. However, since indexing running time is < 25ms, the effect on user experience is minor Client side changes are required in order to have more queries use the recent tier Example: Keyword search

ADDITIONAL RECOMMENDATIONS

Hardware Configuration Observation: SSD infrastructure leads to better performance, but higher server costs As improved performance results in a larger server capacity, and SSD gets cheaper, the question is which configuration optimizes cost per user Result:Storing the Solr instances on SSDs results in a significant runtime improvement, which translates to major cost per user savings

Solr Version Recommendation: Upgrade Solr version (at least) with every major release Ongoing support by the community Bug fixes and improved stability New features and APIs Enables to extend Solr and contribute new code Result: Improved stability and performance

SUMMARY

Monthly Cost per User MONTHLY COST PER USER x 0.5x 0.375x 0.25x 0.1x Gen 1 Gen2 (Nontiered, HDD) Tiered, HDD Tiered, Mixed HW Non-tiered, SSD Tiered, SSD Tiered, HDD, Recent Tier Only Tiered, Mixed HW, Recent Tier Only

Lessons Learned Make Solr perfect for youruse case! Know your data What does it contain? How is it structured? Know your user Who uses the system? How is it used? Start simple, then go advanced Schema & query optimizations Indexing strategy Advanced features Benchmark every change you make Make an effort to build a reliable and flexible benchmarking framework

THANK YOU! ARNON YOGEV ARNONY@IL.IBM.COM