Apache Solr Learning to Rank FTW!

Similar documents
Learning to Rank for Faceted Search Bridging the gap between theory and practice

Building the News Search Engine

Search Quality Evaluation Tools and Techniques

We built the Elasticsearch LTR Plugin! then came the hard part

Information Retrieval

UMass at TREC 2017 Common Core Track

Learning Ranking Functions with SVMs

Adaptive Search Engines Learning Ranking Functions with SVMs

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Learning Ranking Functions with SVMs

Relevancy Workbench Module. 1.0 Documentation

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

Learning to Rank. from heuristics to theoretic approaches. Hongning Wang

LAB 7: Search engine: Apache Nutch + Solr + Lucene

BM25 is so Yesterday. Modern Techniques for Better Search Relevance in Solr. Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer

Session Based Click Features for Recency Ranking

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

10 ways to reduce your tax bill. Amit Nithianandan Senior Search Engineer Zvents Inc.

Fall Lecture 16: Learning-to-rank

Elasticsearch Search made easy

Evaluating the Accuracy of. Implicit feedback. from Clicks and Query Reformulations in Web Search. Learning with Humans in the Loop

LucidWorks: Searching with curl October 1, 2012

CS646 (Fall 2016) Homework 1

Side by Side with Solr and Elasticsearch

Retrieval Evaluation

Search Evolution von Lucene zu Solr und ElasticSearch. Florian

Soir 1.4 Enterprise Search Server

Information Retrieval

Datacenter Simulation Methodologies Web Search. Tamara Silbergleit Lehman, Qiuyun Wang, Seyed Majid Zahedi and Benjamin C. Lee

Automatic people tagging for expertise profiling in the enterprise

Datacenter Simulation Methodologies Web Search

Learning Ranking Functions with Implicit Feedback

arxiv: v1 [cs.ir] 2 Feb 2015

A short introduction to the development and evaluation of Indexing systems

Optimizing Search Engines using Click-through Data

WebSci and Learning to Rank for IR

Apache Lucene 4. Robert Muir

A Scotas white paper September Scotas Push Connector

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

EPL660: Information Retrieval and Search Engines Lab 3

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Performance Measures for Multi-Graded Relevance

CSCI 5417 Information Retrieval Systems. Jim Martin!

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

SMART CONNECTOR TECHNOLOGY FOR FEDERATED SEARCH

Lecture 3: Improving Ranking with

Northeastern University in TREC 2009 Million Query Track

Information Retrieval

Learning to rank, a supervised approach for ranking of documents Master Thesis in Computer Science - Algorithms, Languages and Logic KRISTOFER TAPPER

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Part I: Data Mining Foundations

Fractional Similarity : Cross-lingual Feature Selection for Search

Jan Pedersen 22 July 2010

Re-contextualization and contextual Entity exploration. Sebastian Holzki

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task

CS 6740: Advanced Language Technologies April 2, Lecturer: Lillian Lee Scribes: Navin Sivakumar, Lakshmi Ganesh, Taiyang Chen.

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Creating a Recommender System. An Elasticsearch & Apache Spark approach

FAST& SCALABLE SYSTEMS WITH APACHESOLR. Arnon Yogev IBM Research

A Comparative Analysis of Cascade Measures for Novelty and Diversity

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Open Source development for students.

Search and Time Series Databases

Search Engines and Time Series Databases

USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING

Hortonworks Cybersecurity Platform

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

CSCI572 Hw2 Report Team17

Semantic Search at Bloomberg

Arama Motoru Gelistirme Dongusu: Siralamayi Ogrenme ve Bilgiye Erisimin Degerlendirilmesi. Retrieval Effectiveness and Learning to Rank

A new click model for relevance prediction in web search

I'm Charlie Hull, co-founder and Managing Director of Flax. We've been building open source search applications since 2001

#IoT #BigData. 10/31/14

Language Support, Linguistics, and Text Analytics in Solr

Using Apache Spark for generating ElasticSearch indices offline

ElasticSearch in Production

Road to Auto Scaling

Realtime visitor analysis with Couchbase and Elasticsearch

International Journal of Advancements in Research & Technology, Volume 2, Issue 6, June ISSN

Active Evaluation of Ranking Functions based on Graded Relevance (Extended Abstract)

Search Engine Architecture II

TREC OpenSearch Planning Session

MuseKnowledge Hybrid Search

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

EPL660: Information Retrieval and Search Engines Lab 2

MeSH-based dataset for measuring the relevance of text retrieval

NUS-I2R: Learning a Combined System for Entity Linking

A Machine Learning Approach for Improved BM25 Retrieval

Learning to Rank. Kevin Duh June 2018

Signals Documentation

International Journal of Computer Engineering and Applications, Volume IX, Issue X, Oct. 15 ISSN

Large-Scale Validation and Analysis of Interleaved Search Evaluation

Information Retrieval

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Transcription:

Apache Solr Learning to Rank FTW! Berlin Buzzwords 2017 June 12, 2017 Diego Ceccarelli Software Engineer, News Search dceccarelli4@bloomberg.net Michael Nilsson Software Engineer, Unified Search mnilsson23@bloomberg.net

News Search at Bloomberg 2

What we did in the last few years Bye-bye existing proprietary system o Inflexible, no scalable relevance sorting Enter Solr/Lucene! o Rich in features, extensible and actively maintained o Free software, we are involved and contribute back! o From-scratch alerting backend based on Lucene and Luwak o Scalable with load and data: Just add machines! Learning to Rank plugin upstreamed in Apache Solr 6.4 o https://cwiki.apache.org/confluence/display/solr/learning+to+rank o https://issues.apache.org/jira/browse/solr-8542

Learning to Rank? Machine Learned Ranking

Why Learning to Rank? score = 2.3 * BM25

Why Learning to Rank? score = 2.3 * BM25 + 4.5 * BM25(title) + 5.2 * BM25(desc)

Why Learning to Rank? score = 2.3 * BM25 + 4.5 * BM25(title) + 5.2 * BM25(desc) + 0.1 * doc-length + 1.3 * freshness

Problem setup It s hard to manually tweak the ranking o You must be an expert in the domain query = solr query = lucene query = london query = bloomberg query = score = 2.3 * BM25 + 4.5 * BM25(title) + 5.2 * BM25(desc) + 0.1 * doc-length + 1.3 * freshness

Learning to Rank plugin: Goals Automatically optimize for relevancy using machine learning Make different machine learning models pluggable Access to rich internal Solr search functionality for feature building

Steps for Machine Learned Ranking I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Re-rank results [Solr] VI. Evaluate results [Offline]

Steps for Machine Learned Ranking I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Re-rank results [Solr] VI. Evaluate results [Offline]

I. Collect Judgments Curated relevance of documents per query Judgement (good/bad) Judgement (5 stars) 3/5 5/5 0/5

I. Collect Judgments Explicit judges assess search results manually o Experts o Crowd sourced Implicit infer assessments through user behavior o Aggregated result clicks o Query reformulation o Dwell time

Steps for Machine Learned Ranking I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Re-rank results [Solr] VI. Evaluate results [Offline]

II. Extract Features Signals that give an indication of a result s importance Query matches the title Freshness Is the document from Bloomberg.com? Popularity 0 0.7 0 3583 1 0.9 1 625 0 0.1 0 129

II. Extract Features Define features to extract in myfeatures.json Deploy features definition file to Solr curl -XPUT 'http://localhost:8983/solr/mycollection/schema/featurestore' --data-binary "@/path/myfeatures.json" -H 'Content-type:application/json' [ ] { "name": "matchtitle", "type": "org.apache.solr.ltr.feature. SolrFeature", "params": { "q": "{!field f=title}${text}" }, { "name": "freshness", "type": "org.apache.solr.ltr.feature. SolrFeature", "params": { "q": "{!func}recip(ms(now,timestamp),3.16e-11,1,1)" }, { "name": "isfrombloomberg", }, { "name": "popularity", }

II. Extract Features Add transformer to Solr config <!-- Document transformer adding feature vectors with each retrieved document --> <transformer name="features" class= "org.apache.solr.ltr.response.transform.ltrfeatureloggertransformerfactory" /> Request features for document http://localhost:8983/solr/mycollection/query?q= &fl=*,[features efi.text="appl US"] { } "title": "Tim Cook", "url ": "https://en.wikipedia.org/wiki/tim_cook", "[features]": "matchtitle:0.0, freshness:0.7, isfrombloomberg:0.0, popularity:3583.0"

Steps for Machine Learned Ranking I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Re-rank results [Solr] VI. Evaluate results [Offline]

III. Train Model Combine query-document judgments & features into training data file Train ranking model offline o RankSVM 1 [liblinear] o LambdaMART 2 [ranklib] 1 T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. 2 C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-TR-2010-82, 2010.

Steps for Machine Learned Ranking I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Re-rank results [Solr] VI. Evaluate results [Offline]

IV. Deploy Model Generate trained output model in mymodel.json Deploy model definition file to Solr curl -XPUT 'http://localhost:8983/solr/techproducts/schema/mo del-store' --data-binary "@/path/mymodel.json" -H 'Content-type:application/json' { } "class": "org.apache.solr.ltr.model.multipleadditivetreesmodel", "name": "mymodelname", "features": [ { "name": "freshness"}, { "name": "matchtitle"}, ], "params": { "trees": [ { "weight": 1, "tree": { "feature": "matchedtitle", "threshold": 0.5, "left": { "value": -100 }, "right": { "feature": "freshness", "threshold": 0.5, "left": { "value": 50 }, "right": { "value": 75 } } } } ] }

Steps for Machine Learned Ranking I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Re-rank results [Solr] VI. Evaluate results [Offline]

V. Re-rank Results Add LTR query parser to Solr config <!-- Query parser used to rerank top docs with a provided model --> <queryparser name="ltr" class="org.apache.solr.ltr.search.ltrqparserplugin" /> Search and re-rank results http://localhost:8983/solr/mycollection/query?q= & rq={!ltr model="mymodelname" rerankdocs=100 efi.text="appl US"} o ltr name of parser in config o model name of model in mymodel.json o rerankdocs number of top K documents to re-rank o efi.key list of arbitrary key values to pass in to features

Steps for Machine Learned Ranking I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Re-rank results [Solr] VI. Evaluate results [Offline]

VI. Evaluate quality of search Precision how many relevant results I returned divided by total number of results returned Recall how many relevant results I returned divided by total number of relevant results for the query F-Score NDCG Image credit: https://commons.wikimedia.org/wiki/file:precisionrecall.svg by User:Walber

How do I do this with real code?!? Demo time! Code for all steps available in GitHub https://github.com/bloomberg/lucene-solr branch:ltr-demo-lucene-solr

Configure solrconfig.xml <!-- Query parser used to rerank top docs with a provided model --> <queryparser name="ltr" class="org.apache.solr.ltr.search.ltrqparserplugin" /> <!-- Document transformer adding feature vectors with each retrieved document --> <transformer name="features" class= "org.apache.solr.ltr.response.transform.ltrfeatureloggertransformerfactory" />

Setup Simple Wikipedia Json-dump (~150k articles) Index it into Solr Simple schema setting copy-field text containing all the text fields in the article The query hits text by default

Example of query Document content: http://localhost:8983/solr/wikipedia/select?indent=on&q=berlin&wt=json Top 10 results: http://localhost:8983/solr/wikipedia/select?indent=on&q=berlin&wt=json&fl=title,score

Collect query-document judgments Run demo.py

Write a Solr feature description file Provided in the demo (features.json) Example of feature: { } "name": "freshness", "type": "org.apache.solr.ltr.feature. SolrFeature", "params": { "q": "{!func}recip(ms(now,timestamp),3.16e-11,1,1)" Current features: http://localhost:8983/solr/wikipedia/schema/feature-store/_default_

Extract query-document features Using the Learning to Rank doc transformer http://localhost:8983/solr/wikipedia/select?indent=on&q=berlin&wt=json& fl=title,score,[features efi.query=berlin]

Train models Deploy Evaluate results train_linear_model.py train_tree_model.py

Test a query outside the training set Rome http://localhost:8983/solr/wikipedia/select?indent=on&q=rome&wt=json&fl=title,score LTR Rome http://localhost:8983/solr/wikipedia/select?indent=on&q=rome&wt=json&fl=title,score &rq={!ltr model=... rerankdocs=30}

Q&A