Invitation to a New Kind of Database. Sheer El Showk Cofounder, Lore Ai We re Hiring!

Similar documents
Building a high-performance, scalable ML & NLP platform with Python. Sheer El Showk CTO, Lore Ai

MongoDB Schema Design

Complexity. Out of the Tar Pit Moseley and Marks (2006) Complexity caused by state and control Close the loop - process

Scaling. Marty Weiner Grayskull, Eternia. Yashh Nelapati Gotham City

Realtime visitor analysis with Couchbase and Elasticsearch

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CSE 344 Final Review. August 16 th

Azure-persistence MARTIN MUDRA

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Database Solution in Cloud Computing

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

MongoDB - a No SQL Database What you need to know as an Oracle DBA

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

NoSQL Databases Analysis

1

Introduction to NoSQL Databases

Whitepaper. 4 Ways to Improve ASP.NET Performance. Under Peak Loads. Iqbal Khan. Copyright 2015 by Alachisoft

Dr. Chuck Cartledge. 3 Dec. 2015

1

Course Content MongoDB

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Shen PingCAP 2017

Microsoft Azure Storage

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Scaling. Yashh Nelapati Gotham City. Marty Weiner Krypton. Friday, July 27, 12

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Relational databases

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

SCALABLE WEB PROGRAMMING. CS193S - Jan Jannink - 2/02/10

ICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System ADC

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Rule 14 Use Databases Appropriately

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

The NoSQL movement. CouchDB as an example

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Data Informatics. Seon Ho Kim, Ph.D.

RethinkDB. Niharika Vithala, Deepan Sekar, Aidan Pace, and Chang Xu

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

<Insert Picture Here> MySQL Cluster What are we working on

Distributed File Systems II

In-Memory Data Management

Main-Memory Databases 1 / 25

SCALABLE DATABASES. Sergio Bossa. From Relational Databases To Polyglot Persistence.

Time Series Live 2017

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

NoSQL, NewSQL, Status Quo, Newcomers and some Future Predictions. Stefan Edlich

CS November 2018

Guest Lecture. Daniel Dao & Nick Buroojy

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

CS November 2017

BEYOND THE RDBMS: WORKING WITH RELATIONAL DATA IN MARKLOGIC

Map-Reduce. Marco Mura 2010 March, 31th

Unifying Big Data Workloads in Apache Spark

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Bigtable. Presenter: Yijun Hou, Yixiao Peng

The Value of Values. Rich Hickey Datomic, Clojure

Using ElasticSearch to Enable Stronger Query Support in Cassandra

Cassandra- A Distributed Database

An Information Asset Hub. How to Effectively Share Your Data

Percona Live Updated Sharding Guidelines in MongoDB 3.x with Storage Engine Considerations. Kimberly Wilkins

NoSQL: Redis and MongoDB A.A. 2016/17

MongoDB Web Architecture

Datacenter replication solution with quasardb

High-Level Data Models on RAMCloud

Introduction to Database Services

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

LazyBase: Trading freshness and performance in a scalable database

Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Outline. Spanner Mo/va/on. Tom Anderson

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Cloudera Kudu Introduction

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

File Processing Approaches

Azure Development Course

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Harvesting Logs and Events Using MetaCentrum Virtualization Services. Radoslav Bodó, Daniel Kouřil CESNET

Understanding Impact of J2EE Applications On Relational Databases. Dennis Leung, VP Development Oracle9iAS TopLink Oracle Corporation

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

/ Cloud Computing. Recitation 6 October 2 nd, 2018

A Practical Guide to Migrating from Oracle to MySQL. Robin Schumacher

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Hustle Documentation. Release 0.1. Tim Spurway

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Extreme Computing. NoSQL.

Big Data Facebook

Social Science Text Analysis with Python (&..)

Material You Need to Know

Redis as a Reliable Work Queue. Percona University

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

Transcription:

Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore Ai www.lore.ai We re Hiring!

Overview 1. Problem statement (~2 minute) 2. (Proprietary) Solution: Datomics (~10 minutes) 3. Proposed Open Source Solution: (~8 minutes) a. Easy Version(s) b. Hard Version(s) 4. Discussion & More Brain Storming (after talk & all of tomorrow)

The hard thing about hard things... NoSQL vs SQL Data-model vs database type Load Balancer Django Web Django API Celery Workers Cron-jo bs Logging Column vs Row vs Document Backups: consistent snapshots across different (types of) DBs. Versioning: changes over time Web scraper s KG Service Entity Generat or Input Process or Service Broker Pipeline NLP Services Trade-offs: Scaling Consistency MySQL (soon ElasticSearch) Mongo Cluster Minio (distributed blob storage) Redis (distributed caching)

Deconstructing the Database Rich Hickey (inventor of Clojure) Outdated, Bad Ideas: Changing data in place (mutability) Fixed schema structure Disk locality is an outdated idea (networks are faster than disks!) Database must be a server (why not db as a library?) Reads & Writes done by same system https://youtu.be/cym4tzwtcnu Even modern SQL/NoSQL DBs make out-dated trade-offs (e.g. consistency vs scalability) Can we come up with more

Disclaimers 1. I have no affiliation with Datomic 2. I ve never used Datomic 3. I m compressing Hickey s very nice talk into 10 min Missing lots of important ideas/details 4. I ve only been thinking about this for a few weeks (on & off)

Hickey s Solution I: Fact-Based Schema Atomic unit of data (datom) is a fact : Transaction Time Entity (subject) Property/Relation Value 12234 334 firstname sheer Equivalent JSON: 12235 334 cofounderof 445 (Lore) Fact values can be literals, lists, types or Ids of other entities (i.e. relations). Encode arbitrary schema but additionally includes time dimension. Don t confound data with encoding. { } _id: 334, first_name: sheer,... company: { name: Lore,. }.

Hickey s Solution II: Immutable Data All writes add/retract a fact from DB Writes are appended to a log append only system Deletions are just a series of retractions Operations (add/del) are time-stamped Time Op Ent Rel Value 123 add sheer email sheer@gmail.com 124 add sheer email sheer@yahoo.com 125 del sheer email sheer@gmail.com Data is never deleted! 126 add sheer livesin Paris The database is just a log of very granular updates of facts. Consistency is trivial: select a timestamp and read log up to timestamp.

Hickey s Solution III: Structural Sharing Consistent reads require versioning (timestamps) so what about an index? Hickey (+Phil Bagwell) invented/improved HMAT Hash-mapped Array Trie Structural sharing for efficiently versioning indices High (32-way) branching ratio for shallow trees Querying element at fixed time-stamp only requires accessing simple path on tree Cache/access subset of index as required. https://hypirion.com/musings/understanding-persistent-vector-pt-1 Universal fact-based schema: only need fixed number (6) of indices composite index on ent-rel-value, rel-ent-value, etc.

Hickey s Solution IV: Separate Reads & Writes Reads, queries and index lookups happen in client process (i.e. here client means app server not user/browser). Client-side library provides in-process indexing, queries, aggregations, etc Granular data model (fact-based) and incremental index efficient caching of working set (facts & sub-trees) Consistency is trivially assured each timestamp is consistent snapshot of DB. Trivial scalability every app server is a processing peer. Only bottleneck is Transactor but it just appends updates to a log (can be made very fast).

How much of this can we replicate? Improve?

1. Versioned Document Store ( easy ) Proposal a. Update log via JSON-patch events b. Instantiate Documents in JSON-store (mongo) for near real-time indexing 2. Versioned Relational Document Store ( hardish ) a. Add real-time indexing to the above and restrict to shallow json linked by relations. 3. Fully Versioned Indices ( hard ) a. Implement HMAT or similar (see https://github.com/tobgu/pyrsistent) 4. Make it all efficient ( very hard ) a. Cythonize b. Tweak caching, disk layout, etc.

Versioned JSON Store Two ideas for write server ( Transactor ): Append-only Mongo Collection Redis-queue persisted to S3/Minio flat file Format: Json-patch files ( transaction ) Json-schema library to validate Read/Index servers: Cluster of Mongo/ES servers instantiating docs using jsonpatch library. Indexing almost real-time but only current timestamp. Read Cluster Mongo/ ES Cluster Writer options Mongo/ ES Cluster ETL via JSONPATCH Mongo/ ES Cluster Redis Only capturing some benefits (little contention, versioning, some schema independence). Only server-side processing. No versioned index. Essentially: just versioned replication Mongo Minio (distributed blob storage)

Versioned Relational JSON Store Enforce shallow JSON Only strings, ints, lists, etc as values Dicts/subdocs replaced by ids Index in real-time: Directly index append-log collection Manual index in redis Index individual facts not just docs Transactor manually validates patched JSON: Enforce existence on IDs (foreign keys) Ensure patch respects schema Sample patch: { } _id: 334, prev_checksum: 6d96617b37f4f662783c957, patches: [ { op : add, path : /first_name, value : sheer }, { op : add, path : /company, value : Id(443)}, ] Notes JSON schemas are JSON docs so can be versioned as special collection in DB. Subtle issue: do we enforce FK constraints if doc/entity retracted? Indexing/reads are still largely centralized (redis/mongo index on append-log).

Fully Versioned Index Core Ideas: Can we move in-process? HMAT or other persistent index? (see https://github.com/tobgu/pyrsistent) Cache/access individual facts or index subtrees directly in process (with two tiered external cache like Datomics). Need to implement in-memory query engine using lazy-access cached persistent index. Pandas (like?) See https://github.com/tobgu/qcache Transactor or Indexer has to be able to merge live index (recent facts) with persistent index.

Make It All Very Efficient Cythonize (what parts?) Jsonpatch Schema validation Use dicts/native-types instead of JSON Fast serialization: msgpack, pickle, etc.. Can we borrow existing OSS technologies for indexing, etc..

Some Comments An index is a tree: Represent in JSON/dict and version/store like log? Schemas are also just documents Can version and manage as a special collection Transactions are also first class so can carry additional metadata More complex ops (+=1) require Transactor to translate op into atomic updates.

Interested in working on this? Contact me! Slides will go on blog.lore.ai

Bye!