Search and Time Series Databases

Similar documents
Search Engines and Time Series Databases

NewSQL Databases. The reference Big Data stack

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack

Improving Drupal search experience with Apache Solr and Elasticsearch

Apache Storm: Hands-on Session A.A. 2016/17

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.

Realtime visitor analysis with Couchbase and Elasticsearch

Using Prometheus with InfluxDB for metrics storage

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

@InfluxDB. David Norton 1 / 69

Road to Auto Scaling

EPL660: Information Retrieval and Search Engines Lab 3

Effecient monitoring with Open source tools. Osman Ungur, github.com/o

Goal of this document: A simple yet effective

Challenges in Data Stream Processing

Fog Computing. The scenario

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

The Internet of Things:

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

ElasticSearch in Production

E l a s t i c s e a r c h F e a t u r e s. Contents

CrateDB for Time Series. How CrateDB compares to specialized time series data stores

Container-based virtualization: Docker

Introduction to Big Data

Survey and Comparison of Open Source Time Series Databases

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider. Group 5 Ajantha Ramineni, Sahil Tiwari, Rishabh Jain, Shivang Gupta

Log Analytics with Amazon Elasticsearch Service. Christoph Schmitter

Using ElasticSearch to Enable Stronger Query Support in Cassandra

Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica. Hadoop Ecosystem

#IoT #BigData. 10/31/14

Introduction to NoSQL Databases

Tuning Enterprise Information Catalog Performance

Graph and Timeseries Databases

Tungsten Replicator for Kafka, Elasticsearch, Cassandra

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

A Generic Microservice Architecture for Environmental Data Management

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data

Microservices log gathering, processing and storing

Course Content MongoDB

Kafka Streams: Hands-on Session A.A. 2017/18

Ninja Level Infrastructure Monitoring. Defensive Approach to Security Monitoring and Automation

Azure-persistence MARTIN MUDRA

Using Elastic with Magento

Data pipelines with PostgreSQL & Kafka

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

Energy Management with AWS

Time Series Live 2017

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect

NoSQL: Redis and MongoDB A.A. 2016/17

CIB Session 12th NoSQL Databases Structures

Chapter 24 NOSQL Databases and Big Data Storage Systems

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

Hadoop Ecosystem. Why an ecosystem

Backing Up And Restoring Nagios Log Server. This document describes how to backup and restore a Nagios Log Server cluster.

SUMMARY LAYERED ARCHITECTURE

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Monitor your containers with the Elastic Stack. Monica Sarbu

README file for TICKpy (CogSys) Container v0.9.4

Creating a Recommender System. An Elasticsearch & Apache Spark approach

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

Developing Microsoft Azure Solutions (70-532) Syllabus

Soir 1.4 Enterprise Search Server

Understanding NoSQL Database Implementations

Elasticsearch. Presented by: Steve Mayzak, Director of Systems Engineering Vince Marino, Account Exec

DEVOPS COURSE CONTENT

End to End Analysis on System z IBM Transaction Analysis Workbench for z/os. James Martin IBM Tools Product SME August 10, 2015

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Big Data Hadoop Stack

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Introduction to Data Intensive Computing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Web Applications. Software Engineering 2017 Alessio Gambi - Saarland University

Stream Processing on IoT Devices using Calvin Framework

Using AWS to Build a Large Scale Dockerized Microservices Architecture. Dr. Oliver Wahlen moovel Group GmbH Frankfurt, 30.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

Oracle NoSQL Database Enterprise Edition, Version 18.1

Kafka Connect the Dots

MapReduce and Hadoop

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

Oracle NoSQL Database Enterprise Edition, Version 18.1

Real-Time & Big Data GIS: Best Practices. Josh Joyner Adam Mollenkopf

CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Oracle Endeca Information Discovery

Spark Streaming: Hands-on Session A.A. 2017/18

Scaling. Marty Weiner Grayskull, Eternia. Yashh Nelapati Gotham City

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Thales PunchPlatform Agenda

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Scaling. Yashh Nelapati Gotham City. Marty Weiner Krypton. Friday, July 27, 12

Transcription:

Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini

The reference Big Data stack High-level Interfaces Data Processing Data Storage Resource Management Support / Integration Valeria Cardellini - SABD 2016/17 1

Why search platforms? How to find documents that match queries? With text search faster than RDBMs How to obtain specific features? Such as highlighting, spatial search, suggestions, guided navigation, Valeria Cardellini - SABD 2016/17 2

Search engines Most popular search platforms: Apache Solr ElasticSearch ETL process Valeria Cardellini - SABD 2016/17 3

Apache Solr Scalable, highly reliable and open-source framework for searching data Built on Apache Lucene Open-source library for indexing and search Used by Solr for full-text search Can index documents written in: XML, JSON, CSV and binary formats Runs as Java Web application Provides a REST-like web service that exposes services to manage the lifecycle of documents in the index (indexing, querying, ) Used by most popular Web apps (Apple, Instagram, LinkedIn, ) Valeria Cardellini - SABD 2016/17 4

Solr: key features Faceting To group the results based on specific field or defined criteria, providing the count of each subset Example: shopping site can provide facets to narrow search results by manufacturer or price Auto-suggest To present list of possible query terms Spell check To suggest corrected spelling of query terms Highlighting Document clustering To group related documents in the search results Spatial search To filter search results based on location Valeria Cardellini - SABD 2016/17 5

Solr: key features Pagination and ranking of search results Results grouping To group the results based on a grouping field and return the top documents in each group Near real-time search To search documents immediately after they have been indexed; useful for apps with dynamic changing content (e.g., news) More Like This identifies other documents that are similar to one in a result set Valeria Cardellini - SABD 2016/17 6

Solr feature example Valeria Cardellini - SABD 2016/17 7

Solr components Valeria Cardellini - SABD 2016/17 8

Solr components Request Handlers: handle a request at a URL E.g.: /select! Search Components: part of a Search Handler, a componentized request handler Includes: Query, Faceting, Highlighting, Debug, Distributed Search capable Update Handlers: handle an indexing request Update Processors chain: per-handler componentized chain that handles updates Query Parser plugins Mix and match query types in a single request Function plugins for Function Query Text Analysis plugins: Analyzers, Tokenizers, TokenFilters Response Writers: serialize and stream response to client Valeria Cardellini - SABD 2016/17 9

Scaling Solr: SolrCloud How to provide distributed indexing and search capabilities? Up to millions of users and millions of indexed documents SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers Enables and simplifies horizontal scaling of a search index through replication and sharding Sharding: incoming queries are distributed to to shards in the collection, which respond with merged results Replication: to handle higher concurrent query load by spreading the requests to multiple servers No master node to allocate nodes, shards and replicas SolrCloud uses ZooKeeper for storing shared configuration files and for coordination Valeria Cardellini - SABD 2016/17 10

Solr distributed architecture Valeria Cardellini - SABD 2016/17 11

Elasticsearch Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents Search engine based on Apache Lucene Developed in Java Distributed Indices can be divided into shards and each shard can have zero or more replicas Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s) Rebalancing and routing are done automatically Valeria Cardellini - SABD 2016/17 12

Elastic (ELK) Stack Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack, previously known as ELK) Logstash Server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch Kibana Data visualization platform Valeria Cardellini - SABD 2016/17 13

Solr vs. Elasticsearch Elasticsearch vs Solr on Google Trends Solr Mature, widely deployed product Active and large developer community Provides highly detailed functional environment wide range of plug-ins are available Elasticsearch Newer, but already very widely used Focus on extracting value from data generally, and not just on search Part of ELK stack Schema-free and document-oriented Valeria Cardellini - SABD 2016/17 14

Time series data base (TSDB) How to analyze DevOps monitoring, application metrics, IoT sensor data? Time series databases (TSDBs) provides an effective and lightweight solution Optimized for handling high-volume time series data Time series: a sequence of data points (arrays of numbers) indexed by time (a date time or a date time range), e.g.: Time series of stock prices (price curve) Time series of energy consumption (load profile) Log of temperature values (temperature trace) Optimized for providing complex logic to analyze time series data Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in DBMS Valeria Cardellini - SABD 2016/17 15

TSDB: overview Create, enumerate, update and destroy various time series and organize them in some fashion Series may be organized hierarchically and have companion metadata Provide basic calculations on a series as a whole, (e.g., multiplying, adding, or combining various time series into a new time series) Filter on arbitrary patterns (e.g., day of the week, low value, high value) Provide additional statistical functions that are targeted to time series data Valeria Cardellini - SABD 2016/17 16

TSDB: some products Some open-source products CrateDB https://crate.io Chronix http://www.chronix.io Graphite https://graphiteapp.org Stores numeric time-series data and render graphs of this data on demand InfluxDB https://www.influxdata.com KairosDB https://kairosdb.github.io Stores its time series in Cassandra OpenTSDB http://opentsdb.net Stores its time series in HBase Riak-TS http://basho.com/products/riak-ts/ Valeria Cardellini - SABD 2016/17 17

Valeria Cardellini - SABD 2016/17 InfluxDB Written in Go Supports high write loads and large data set storage Conserves space through downsampling By automatically expiring and deleting unwanted data as well as backup and restore Provides easy-to-use SQL-like query language for interacting with data Provides simple, high performing write and query HTTP(S) APIs, e.g.: To create a database url -i -XPOST http://localhost:8086/query --dataurlencode "q=create DATABASE mydb! To write data curl -i -XPOST 'http://localhost:8086/write? db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'! 18

InfluxDB datastore Data organized by time series, which contain a measured value, like cpu_load or temperature Time series have zero to many points, one for each discrete sample of the metric Points consist of: time (a timestamp) a measurement (e.g., cpu_load ) at least one key-value field (the measured value itself, e.g. value=0.64, or temperature=21.2 ) and zero to many key-value tags containing any metadata about the value (e.g. host=server01, region=emea, dc=frankfurt ) Valeria Cardellini - SABD 2016/17 19

InfluxDB datastore General format of points: <measurement>[,<tag-key>=<tag-value>...] <field- key>=<field-value>[,<field2-key>=<field2- value>...] [unix-nano-timestamp]! Examples of points:! cpu,host=servera,region=us_west value=0.64! payment,device=mobile,product=notepad,method=credit! billed=33,licenses=3i 1434067467100293230! stock,symbol=aapl bid=127.46,ask=127.48! temperature,machine=unit42,type=assembly external=25,internal=37 1434067467000000000! Valeria Cardellini - SABD 2016/17 20

InfluxDB datastore A measurement is like a SQL table, where the primary index is time With respect to DBMS: No need to define schemas up-front Null values are not stored Valeria Cardellini - SABD 2016/17 21

InfluxDB stack Integrated with Telegraph, Chronograf and Kapacitor (TICK stack) Valeria Cardellini - SABD 2016/17 22

References Apache Solr Reference Guide, http://bit.ly/2scksqf InfluxDB Version 1.2 Documentation, http://bit.ly/2ryagft Dunning and Friedman, Time Series Databases, O Reilly, 2015. Valeria Cardellini - SABD 2016/17 23