SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

Similar documents
Introduction to BigData, Hadoop:-

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Big Data Architect.

Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Comparing SQL and NOSQL databases

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Realtime visitor analysis with Couchbase and Elasticsearch

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Couchbase Architecture Couchbase Inc. 1

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

CIB Session 12th NoSQL Databases Structures

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

10 Million Smart Meter Data with Apache HBase

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

A GridGain Systems In-Memory Computing White Paper

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Ghislain Fourny. Big Data 5. Column stores

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

Study of NoSQL Database Along With Security Comparison

Big Data Hadoop Course Content

Introduction to NoSQL Databases

TALK 1: CONVINCE YOUR BOSS: CHOOSE THE "RIGHT" DATABASE. Prof. Dr. Stefan Edlich Beuth University of Technology Berlin (App.Sc.)

A Fast and High Throughput SQL Query System for Big Data

Column-Family Databases Cassandra and HBase

Using space-filling curves for multidimensional

Introduction to HDFS and MapReduce

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

Using ElasticSearch to Enable Stronger Query Support in Cassandra

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Search Engines and Time Series Databases

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CISC 7610 Lecture 2b The beginnings of NoSQL

Migrating Oracle Databases To Cassandra

Data Lake Based Systems that Work

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Stages of Data Processing

Big Data Analytics using Apache Hadoop and Spark with Scala

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Data Informatics. Seon Ho Kim, Ph.D.

Shen PingCAP 2017

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MySQL Cluster Web Scalability, % Availability. Andrew

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Microsoft Big Data and Hadoop

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

High-Performance Distributed DBMS for Analytics

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

/ Cloud Computing. Recitation 8 October 18, 2016

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

CA485 Ray Walshe NoSQL

Challenges for Data Driven Systems

Course Content MongoDB

Oracle Big Data SQL High Performance Data Virtualization Explained

Oracle Big Data Connectors

Strategies for Incremental Updates on Hive

A Glimpse of the Hadoop Echosystem

Innovatus Technologies

HBase Solutions at Facebook

Presented by Sunnie S Chung CIS 612

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Security and Performance advances with Oracle Big Data SQL

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Understanding NoSQL Database Implementations

/ Cloud Computing. Recitation 10 March 22nd, 2016

Improving Drupal search experience with Apache Solr and Elasticsearch

Accessing other data fdw, dblink, pglogical, plproxy,...

Advanced Database Technologies NoSQL: Not only SQL

Capabilities of Cloudant NoSQL Database IBM Corporation

Typical size of data you deal with on a daily basis

Search and Time Series Databases

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

BIS Database Management Systems.

MIS Database Systems.

Hadoop An Overview. - Socrates CCDH

Cloud Computing & Visualization

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

DATABASE SCALE WITHOUT LIMITS ON AWS

The Road to a Complete Tweet Index

Introduction to Hive Cloudera, Inc.

Tuning the Hive Engine for Big Data Management

Using Elastic with Magento

Time Series Live 2017

Transcription:

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013 1

WHO AM I? Ryan Tabora Think Big Analytics - Senior Data Engineer Lover of dachshunds, bass, and zombies 2

OVERVIEW Primers What are product logs? How do they apply to big data? Real use case Real issues and designs Conclusion 3

PRODUCT LOGS? Device data IT, Energy, Healthcare, Manufacturing, Telecom... These devices are pushing data back home (pull works too!) As more devices are sold/installed, more and more data comes back to home base 4

POWER OF DEVICE DATA Realtime Visualization Realtime Response DEVICE DATA Ad Hoc Analysis Full Historical Capture Blended Data Sets 5

TRADITIONAL APPROACHES SQL: PostGres, MySQL, Oracle, Microsoft SQL provides many of the search features required for typical search applications Joins, regex, group by, sorting, etc But the these technologies can only scale so far... 6

NEW TECHNIQUES STORING DATA Hadoop HBase/Cassandra/Accumulo Search features are very limited HBase row scans, primary key index Cassandra limited secondary indexing 7

NEW TECHNIQUES INDEXING DATA What is an index? Lucene Paralleling Index Creation MapReduce/Flume/Storm Real Time Search Searching before it hits disk 8

NEW TECHNIQUES SEARCHING DATA Solr/ElasticSearch Both build on top of Lucene Search servers RESTful HTTP APIs Easy to administer Add powerful text/numerical search capabilities 9

BASIC SEARCH FEATURES Boolean logic (AND, OR + -) Sorting and Group By Range queries Phrase/Prefix/Fuzzy queries 10

ADVANCED SEARCH FEATURES Custom ranking/scoring More like this Auto suggest Faceting/Highlighting Geo-spacial search 11

SCALING SEARCH ElasticSearch and SolrCloud both have distributed features built in Auto-sharding Replication Query routing Transaction log 12

USE CASE Problem Sample Solution Core Design Issues Other Solutions 13

THE PROBLEM Client A NetApp NetApp Filer Device Filer Latest Log Applications Client B All Full SQL Access NetApp NetApp Filer Device Filer Log Home Base Logs REST API Log Client C Flat File Access NetApp NetApp Filer Device Filer Engineers & Analysts 14

SEARCH APPLICATION FEATURES Find last three days of raw logs from an entire cluster Group capacity available grouped by machine serial number and show the largest capacities first Search all device header lines for FAILURE View all hard disk objects that have product number 2341AB Find all motherboards with an associated customer ticket 15

SAMPLE SOLUTION Logs HDFS Ingestion MapReduce Parsing/Loading Custom RESTful Search API Indexing Queries 16

INGESTION 17

PARSING, LOADING, AND INDEXING 0 1534 4562 5323 7232 /ingestion/sequencefile1 Load HBase with parsed objects 0 4601 5105 /ingestion/sequencefile2 0 1492 2987 4767 5987 /ingestion/sequencefile3 Store HBase ROW_ID Store pointer to raw file in HDFS Index a number of desired fields 18

INSIDE OF HBASE rowkey object1 object2 object3 object4 object5............................................. 19

THE SOLR DOCUMENT Solr Document rowkey sequencefile sfoffset cluster_id system_id date_sent file_name contents header... /ingest/file2 2343 1333-2241-3411 42ADFF-BZMM 2013/05/12 configs.log... WARNING: DISK DEAD... 20

SEARCH APPLICATION User Query Results 1 8 Search Application Query 2 3 Row_id Data locations 4 5 Stored Object Raw Data Location 6 7 Raw Data 21

CORE DESIGN ISSUES Changing the Solr schema (manual reindex) Elastic shard scaling (manual reindex) No distributed joining (denormalizing the data) Replication* Manually managing Solr partitioning/sharding* Write durability* 22

SOLRCLOUD Automatic shard creation, routing Replication Limited to a fixed number of shards defined on initial creation ZooKeeper for coordination Large community 23

ELASTICSEARCH Similar feature set to Solr Purpose built for easily managing a distributed index Rapidly growing community Custom built coordination mechanism JSON based API 24

DATASTAX ENTERPRISE Integrates Cassandra and Solr Automatic indexing in Solr/storing in Cassandra Automatic partitioning Automatic reindexing + Not limited to fixed number of shards Proprietary and costs money 25

CONCLUSION Collecting and analyzing device data/product logs can be a very difficult challenge You can use NoSQL and search technologies like Solr or ElasticSearch in unison......but it is not always easy to integrate search with NoSQL 26

QUESTIONS? Feel free to reach out if you have any questions or need help with big data/ search! http://ryantabora.com http://thinkbiganalytics.com http://www.slideshare.net/ratabora ryan.tabora@thinkbiganalytics.com @ryantabora 27

BONUS SLIDES 28

HBASE AND SOLR Automatic partitioning/reindexing Automatic index updates on HBase inserts/deletes Mapping HBase cells to a Solr schema No perfect commercial/open source solution yet Many many many more... 29

HBASE + SOLR AUTOMATIC INDEXING HBase coprocessors are like storedprocs/triggers New, powerful, and dangerous Triggers on HBase puts/deletes Mapping data to a schema? 30

HBASE + SOLR WRITE DURABILITY MapReduce Indexing Application 1 Create SolrDocument from raw ASUP SolrDocument 2 Add SolrDocument to HBase for durability HBase Table - SOLR_QUEUE Query Solr, if SolrDocument was added, then remove it from the SOLR_QUEUE 5 3 Get oldest SolrDocument from HBase Queue Table Solr Queue Reader 4 Use custom hash algorithm to determine which shard to add SolrDocument to SolrDocument Solr Shard 1 Solr Shard 2 Solr Shard 3 31

HBASE + SOLR ELASTIC SHARDING HBase s distributing mechanism uses the concept of regions to split data across many nodes Region splitting can be automatic or manual (performance degradation as regions split) Piggybacking Solr sharding on HBase Region splitting 32