Putting together the platform: Riak, Redis, Solr and Spark. Bryan Hunt

Similar documents
#IoT #BigData. 10/31/14

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

relational Relational to Riak Why Move From Relational to Riak? Introduction High Availability Riak At-a-Glance

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA

What am I? Bryan Hunt Basho Client Services Engineer Erlang neophyte JVM refugee Be gentle

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Advanced Databases ( CIS 6930) Fall Instructor: Dr. Markus Schneider. Group 17 Anirudh Sarma Bhaskara Sreeharsha Poluru Ameya Devbhankar

From Relational to Riak

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

CIB Session 12th NoSQL Databases Structures

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Key-Value Stores: RiakKV

Introduction to Database Services

Key-Value Stores: RiakKV

Relational databases

Introduction to MySQL InnoDB Cluster

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Fluentd + MongoDB + Spark = Awesome Sauce

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Key-Value Stores: RiakKV

Which compute option is designed for the above scenario? A. OpenWhisk B. Containers C. Virtual Servers D. Cloud Foundry

Shen PingCAP 2017

Distributed Data Management Replication

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

Cloud Computing & Visualization

MariaDB MaxScale 2.0, basis for a Two-speed IT architecture

Scott Meder Senior Regional Sales Manager

Using MySQL for Distributed Database Architectures

Introduction Storage Processing Monitoring Review. Scaling at Showyou. Operations. September 26, 2011

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

MySQL & NoSQL: The Best of Both Worlds

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

Scaling for Humongous amounts of data with MongoDB

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

DATABASE SCALE WITHOUT LIMITS ON AWS

ebay s Architectural Principles

MySQL High Availability

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Architecture of a Real-Time Operational DBMS

State of the Dolphin Developing new Apps in MySQL 8

VOLTDB + HP VERTICA. page

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

CSE Traditional Operating Systems deal with typical system software designed to be:

Couchbase Architecture Couchbase Inc. 1

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Project Midterms: March 22 nd : No Extensions

Caching patterns and extending mobile applications with elastic caching (With Demonstration)

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Atlassian s Journey Into Splunk

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Introduction to Distributed Data Systems

Presented by Sunnie S Chung CIS 612

Case Study Ecommerce Store For Selling Home Fabrics Online

Session 7: Configuration Manager

Amazon ElastiCache 8/1/17. Why Amazon ElastiCache is important? Introduction:

MongoDB. David Murphy MongoDB Practice Manager, Percona

Cloud Analytics and Business Intelligence on AWS


Distributed Systems. 29. Distributed Caching Paul Krzyzanowski. Rutgers University. Fall 2014

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

High-Performance Distributed DBMS for Analytics

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

NOSQL OPERATIONAL CHECKLIST

Case Study. Performance Optimization & OMS Brainvire Infotech Pvt. Ltd Page 1 of 1

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Distributed Filesystem

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

Introduction to Oracle NoSQL Database

Amazon Aurora Relational databases reimagined.

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Time Series Live 2017

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MySQL HA Solutions Selecting the best approach to protect access to your data

OPEN SOURCE DB SYSTEMS TYPES OF DBMS

Hands-On with Mendix 7

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Distributed Architectures & Microservices. CS 475, Spring 2018 Concurrent & Distributed Systems

Kafka Connect the Dots

Eventually Consistent HTTP with Statebox and Riak

TECHED USER CONFERENCE MAY 3-4, 2016

NPTEL Course Jan K. Gopinath Indian Institute of Science

Datacenter replication solution with quasardb

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Distributed Systems Principles and Paradigms. Chapter 01: Introduction. Contents. Distributed System: Definition.

Building Highly Available and Scalable Real- Time Services with MySQL Cluster

Transcription:

Putting together the platform: Riak, Redis, Solr and Spark Bryan Hunt 1

$ whoami Bryan Hunt Client Services Engineer @binarytemple 2

Minimum viable product - the ideologically correct doctrine 1. Start with what you have 2. Solve problems as they arise 3. Get to the point where you can t scale 4. Use a more scalable solution 3

4

Small ecommerce site Sessions stored in browser cookie As the cart is mutated it is writen to the database Abandoned carts timeout and items are returned to inventory Web Client Product Pipeline RDBMS 5

We re getting slow that means we ve succeeded 6

Billions of Data Points Millions of Users, 10 Million Meters in EU Only 4 people devops team 2000 4 DATA POINTS PER YEAR (traditional gas meter) 2013 35,000+ DATA POINTS PER YEAR (digital meter) CONFIDENTIAL 7

More users == slow read performance if!cache $item db.fetch($item) Latency increases when not cached Caching layer added with above logic Need for cache invalidation logic Cache warming. Oh my! Web Client Cache Product Catalogue Pipeline RDBMS 8

Web store - now support mobile clients Mobile Client Web Client Product Catalogue Pipeline Mobile client added RDBMS write locks increase Latency still increases Caching layer added capacity but invalidation is a pain Write locks are still an issue Cache Cache RDBMS 9

More and more demands make it searchable How can it take 3 weeks to add a simple feature? 10

Web store - growing pains Mobile Client Web Client Product Pipeline SQL profiling/ optimization Search is separated Don t forget to populate the index Never, never run *that* query Cache Cache Search Services RDBMS 11

Product Catalogue The Feature Backlog Continues Performance Mobile Scale and Availability Analytics Search 12

Enter the Riak 13

Scalable, Available, Fault Tolerant Riak has a masterless architecture in which every node in a cluster is capable of serving read and write requests. 14

Write Availability and Consistency Are they mutually exclusive? 15

Conflicts? Riak Data Types (Convergent Replicated Data Types or CRDTs) are a developer-friendly way to keep track of updates in an eventually consistent environment: Counter Keeps tracks of increments and decrements on an integer Flag Values limited to enable or disable Set A collection of unique binary values that supports add and remove operations on one or more values Map Supports the nesting of Riak Data Types Register A named binary field that can only be used as part of a Map 16

Multi-cluster Replication CONFIDENTIAL 17 Basho Technologies 16

Dealing with Search at scale Distributed Full-Text Search Standard full-text Solr queries automatically expand into distributed search queries for a complete result set across instances. Index Synchronization Data is automatically synchronized between Riak KV and Solr using intelligent monitoring to detect changes, and propagates those to Solr indexes. Auto-Restart Monitor Solr OS processes continuously and automatically start or restart them whenever failures are detected. 18

Make Search Better we need to get a competitive advantage 19

Web store: One step closer to success Legacy Infrastructure Product Catalogue Relational DB Middleware Python/Ruby/Node JSON Web / Mobile Product Catalogue Pipeline Riak KV Store JSON Index & Search via integrated Solr Cache Solr Results 20

Basho Data Platform 21

What it provides.. Easy to use caching layer Automatically indexed search Scalable, highly available persistence layer Simple pipelining to data analysis 22

Client Application Write Read Write Read/Solr Riak Client Read Basho Redis Proxy Write Read Redis Redis Riak Riak Read Query Spark Connector Redis Services Service Manager Leader Election Spark Spark Basho Data Platform Riak Ensemble Spark Services Spark Client CONFIDENTIAL 23

Caching 24

Redis - why? 25

Why Use Redis With The Data Platform? Redis, by itself, is approximately 6x faster than Riak for reads Redis is not built to scale With the Basho Data Platform: Approximately 5x faster read performance compared to Riak for buckets at an 80% cache hit ratio. 26

Basho Data Proxy & Redis 27

Resilience CONFIDENTIAL 28

Basho Data Proxy & Redis Redis v2.8.21 Redis Cluster (3.0+) need not apply Cache Proxy using only GET and SET SET uses a TTL, configured as CACHE_TTL Client may use other commands such as: PEXPIRE to invalidate a cached value INFO to integrate with monitoring solutions MONITOR to watch traffic to Redis hosts Written in C Basho Data Proxy Read-through cache in BDP 1.0 Write-through cache in a future BDP release EE only, until write-through cache is complete and released Pre-sharding & connection aggregation built-in 29

Analytics 30

Connecting to BDP with Spark Spark Client Spark Connector 31

Data Platform Spark Cluster 32

Spark & Spark Cluster Manager V1.4.1 Spark Dependencies Java 8 Scala 2.10.5 Spark Cluster Manager Shipped as a JAR in the BDP extras package Enables running a HA Spark cluster Spark Master Spark Stand-by Master(s) Implemented as a custom Spark recovery mode Stores its metadata in a strongly consistent Riak bucket 33

Why Basho Data Platform? 34

35

Product Catalogue with Basho Data Platform Mobile Client Web Client Product Catalogue Pipeline Analytics Services Cache Proxy 36

Faster Analysis More IOPS/TBs Faster Reads 37

QUESTIONS? 38

Spend Time getting to know us OPEN SOURCE Basho Data Platform (code) Riak KV with parallel extract Basho Data Platform Add-on s (code) Spark + Spark Connector Download a build ENTERPRISE Basho Data Platform, Enterprise Riak EE with multi-cluster replication Spark Leader Election Service Basho Data Platform Add-on s Redis + Cache Proxy Spark Workers + Spark Master Contact us to get started @basho @binarytemple 39

40

Thanks! 41