Application development with relational and non-relational databases

Similar documents
Relational databases

CIB Session 12th NoSQL Databases Structures

CompSci 516 Database Systems

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Introduction to NoSQL Databases

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

1

Advanced Database Technologies NoSQL: Not only SQL

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

CSE 530A. Non-Relational Databases. Washington University Fall 2013

Perspectives on NoSQL

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Study of NoSQL Database Along With Security Comparison

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

CIT 668: System Architecture. Distributed Databases

Introduction to Graph Databases

Distributed Data Store

Non-Relational Databases. Pelle Jakovits

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

CSE 344 Final Review. August 16 th

Data Informatics. Seon Ho Kim, Ph.D.

The NoSQL movement. CouchDB as an example

Database Architectures

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

BIS Database Management Systems.

MIS Database Systems.

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

/ Cloud Computing. Recitation 6 October 2 nd, 2018

AN introduction to nosql databases

CSE 530A ACID. Washington University Fall 2013

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 10: Transaction processing. November 14, Lecturer: Rasmus Pagh

A Study of NoSQL Database

Motivation Overview of NoSQL space Comparing technologies used Getting hands dirty tutorial section

DATABASE DESIGN II - 1DL400

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Using space-filling curves for multidimensional

Database Solution in Cloud Computing

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Introduction to Data Management. Lecture #25 (Transactions II)

Shen PingCAP 2017

CISC 7610 Lecture 2b The beginnings of NoSQL

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Accessing other data fdw, dblink, pglogical, plproxy,...

Databases : Lectures 11 and 12: Beyond ACID/Relational databases Timothy G. Griffin Lent Term 2013

CS 245: Principles of Data-Intensive Systems. Instructor: Matei Zaharia cs245.stanford.edu

Hands-on immersion on Big Data tools

NoSQL Databases Analysis

Haridimos Kondylakis Computer Science Department, University of Crete

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Migrating Oracle Databases To Cassandra

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

Rule 14 Use Databases Appropriately

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Chapter 8: Working With Databases & Tables

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

DIVING IN: INSIDE THE DATA CENTER

Understanding NoSQL Database Implementations

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Advanced Data Management Technologies

Challenges for Data Driven Systems

10 Million Smart Meter Data with Apache HBase

NoSQL : A Panorama for Scalable Databases in Web

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

NewSQL Database for New Real-time Applications

Engineering Robust Server Software

A Review to the Approach for Transformation of Data from MySQL to NoSQL

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Why NoSQL? Why Riak?

SQL: Transactions. Announcements (October 2) Transactions. CPS 116 Introduction to Database Systems. Project milestone #1 due in 1½ weeks

NoSQL Databases. an overview

Developing SQL Databases (762)

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Introduction to NoSQL

Presented by Sunnie S Chung CIS 612

COMP9321 Web Application Engineering

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2018

COMP9321 Web Application Engineering

Introduction to Column Oriented Databases in PHP

Transcription:

Application development with relational and non-relational databases Mario Lassnig European Organization for Nuclear Research (CERN) mario.lassnig@cern.ch

About me Software Engineer Data Management for the ATLAS Experiment, CERN, 2006-ongoing Automotive navigation, AIT Vienna, 2004-2006 Avionics for autonomous robots, Austrian Space Forum, 2008-ongoing Education Cryptography (Undergrad) Graph theory (Master s) Multivariate statistics and machine learning (PhD) Largest 24/7 database built yet 2 billion rows 25 000 IOPS 2015-09-08 GridKa School 2015 Relational and Non-relational databases 2

About this course For every topic 1. We will do some theory 2. We will do a hands-on session Please don t blindly copy and paste the session codes from the wiki during the hands-on sessions; there ll be exercises later where you ll have to use what you ve learned! Please interrupt me whenever necessary! 2015-09-08 GridKa School 2015 Relational and Non-relational databases 3

Part I Introduction Relational primer Non-relational primer Data models

CAP Theorem It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees [Brewer, 2000] All clients always see the same data Consistency All clients can always read and write File systems, single-instance databases, Availability Distributed databases Choose two. Web caching, DNS, Partition tolerance The data can be split across the system 2015-09-08 GridKa School 2015 Relational and Non-relational databases 5

ACID and BASE ACID Atomicity all or nothing operations Consistency always valid state Isolation operations can be serialised Durability data is never lost BASE Basically available more often than not Soft state data might be lost Eventually consistent might return old data 2015-09-08 GridKa School 2015 Relational and Non-relational databases 6

So what is this NoSQL thing? Carlo Strozzi, 1998 Term invented for a relational database without a SQL interface Term re-coined 2009 by last.fm At an open-source distributed databases workshop Deal with the exponential increase in storage requirements Improve programmer productivity relational model might not map well to application native data structures use non-relational stores instead as application backend Improve performance for web-scale applications remember the CAP theorem there is no free lunch 2015-09-08 GridKa School 2015 Relational and Non-relational databases 7

Types of databases Row Stores Oracle, PostgreSQL, MySQL, SQLite, Column Stores Hbase, Cassandra, Hypertable, MonetDB Document Stores / Data Structure Stores ElasticSearch, MongoDB, CouchDB, Redis, PostgreSQL Key/Value Stores Dynamo, Riak, LevelDB, BerkeleyDB, Kyoto, Graph Stores Neo4j, Titan, Hypergraph, Multimodel Stores ArangoDB, CortexDB, Object Stores Versant, Relational Non-relational Many actually have overlapping concepts Get confused here: http://nosql-database.org/ 2015-09-08 GridKa School 2015 Relational and Non-relational databases 8

Relational model Proposed by Edgar F Codd, 1969 Concept: Relations Tuples Attributes DBMS: Table Row Column Relation Attribute Tuple http://www.ibm.com/developerworks/library/x-matters8/relat.gif 2015-09-08 GridKa School 2015 Relational and Non-relational databases 9

Structured Query Language Proposed by Edgar F Codd, 1970 Interaction with DBMS using declarative programming language ANSI/ISO Standard since 1986 Ess Que Ell? Sequel? CREATE TABLE table_name; SELECT column_name FROM table_name; INSERT INTO table_name(column_name) VALUES (value); UPDATE table_name SET column_name = value; DELETE FROM table_name; DROP TABLE table_name; 2015-09-08 GridKa School 2015 Relational and Non-relational databases 10

Row Stores Your classic RDBMS Physically stores data row-by-row Easy joining of data between tables one-to-one one-to-many many-to-many Normalization procedures to reduce duplicate data and complexity Not so good for aggregation (RDBMS vendors compete here) http://www.tomjewett.com/dbdesign/dbdesign.php?page=manymany.php 2015-09-08 GridKa School 2015 Relational and Non-relational databases 11

Column Stores (the most confusing of all) Many applications do not need relations, think analytics Row-based systems like traditional relational databases are ill-suited for aggregation queries Things like SUM/AVG of a column? Needs to read full row unnecessarily Physical layout of data column-wise instead saves IO and improves compression, facilitates parallel IO makes joins between columns harder Organize columns in column-groups/families to save joins Most column stores have native support for column-families http://www.tutorialspoint.com/hbase/images/table.jpg 2015-09-08 GridKa School 2015 Relational and Non-relational databases 12

Key/Value Stores Hashmap for efficient insert and retrieval of data You might know this as associative array, or dictionary, or hashtable Keys and value usually are bytestreams, but practically just strings Usually there are some performance guarantees, via options like sorted keys length restrictions hash functions Simple and easy to use Either as compile-time library Or as server, usually via wrapped native protocols, or via REST Key 1 Key 2 Key 3 Value 1 Value 2 Value 3 First one: dbm, 1979 Key 4 Value 4 2015-09-08 GridKa School 2015 Relational and Non-relational databases 13

Document Stores / Data Structure Stores Basically key/value stores, with the added twist that the store knows something about the internal structure of the value Very easy to use as backend for application When people think NoSQL, this is usually what they mean This flexibility comes at a price though we ll discuss this later http://docs.mongodb.org/v3.0/_images/data-model-denormalized.png 2015-09-08 GridKa School 2015 Relational and Non-relational databases 14

Graph Stores In the relational model actual n-to-n relations are cumbersome (that name though!) http://blog.octo.com/wp-content/uploads/2012/07/requestinsql.png 2015-09-08 GridKa School 2015 Relational and Non-relational databases 15

Graph Stores Make relations first-class citizens Physical layout optimised for distance between data points leads to easy & fast traversal for graph database engine http://blog.octo.com/wp-content/uploads/2012/07/requestingraph.png 2015-09-08 GridKa School 2015 Relational and Non-relational databases 16

Hands-on session 1 Create, read, update, delete data Using C/C++ and Python On PostgreSQL MonetDB LevelDB Redis MongoDB Neo4j (relational row-based) (relational column-based) (nonrelational key/value) (nonrelational data structure) (nonrelational document) (nonrelational graph) https://wiki.scc.kit.edu/gridkaschool/index.php/relational_and_non-relational_databases 2015-09-08 GridKa School 2015 Relational and Non-relational databases 17

Part II Fun and profit Query plans and performance tuning Transactional safety in multi-threaded environments Sparse metadata Competitive locking and selection

Query plans The single most important thing you learn today You want to avoid going to disk, to reduce number of IOPS and CPU In order of excessiveness FULL TABLE SCAN PARTITION SCAN INDEX RANGE SCAN PARTITION INDEX RANGE SCAN INDEX UNIQUE SCAN PARTITION INDEX UNIQUE SCAN Not all FULL TABLE SCANs are bad If you need to retrieve a lot of data, and it is indexed, you will get random IO on the disk prefer serial scan (FULL, PARTITION) in such cases If your data is of low cardinality (few values, lots of rows), then indexes will not help 2015-09-08 GridKa School 2015 Relational and Non-relational databases 19

Query plans How to optimize? Understand EXPLAIN PLAN statement, then decide Partitions Physical separation of data Costly to introduce afterwards (usually requires schema migration) Indexes Either global or partition local Log-n access to data https://docs.oracle.com/cd/b19306_01/server.102/b14220/img/cncpt158.gif http://www.mattfleming.com/files/images/example.gif 2015-09-08 GridKa School 2015 Relational and Non-relational databases 20

Transactional safety PostgreSQL prohibits this by design In multi-threaded environments concurrent access to the same data is likely this can cause serious problems Dirty Read Read data by uncommitted transaction Non-repeatable Read Reads previously read data again, but it has changed in the meantime by another transaction Phantom Read Repeated query of the same conditions yields different results due to intermediate other transaction Different transaction isolation levels provide safeguards By locking of rows and thus making other transactions wait The more you lock, the slower you are Can lead to deadlocks if careless always lock rows in the same order! 2015-09-08 GridKa School 2015 Relational and Non-relational databases 21

Sparse metadata Think tagging/labelling datasets Difficult problem in relational model Keep extra columns or implement a relational key-value store Extra columns are bad for physical disk layout Relational key-value store requires lots of joins not good for CPU Will you ever search on metadata? No - Store the metadata as a JSON encoded string in a single column Yes - Many different kinds of metadata? No - Maintain a separate metadata table, with pre-created columns Yes - Use the built-in JSON support of Oracle or PostgreSQL, or as a last resort: use a non-relational database There is some really bad advice on StackOverflow promoting a generic approach to metadata please don t do this 2015-09-08 GridKa School 2015 Relational and Non-relational databases 22

Competitive locking Many applications have the following use case Many processes write something in a queue the backlog of things to do Many processes read from that queue process them in parallel Scheduling problem Do things in order? Prioritise certain things? How to avoid that multiple workers process the same things? Repeat after me: a database is not a queuing system There are two potential solutions each with their own drawback Database-level (row read lock, easy): BEGIN; SELECT row FOR UPDATE ; COMMIT/ROLLBACK Application-level (no lock, complex) When selecting work, compute row-hash, convert, modulo #workers Only work on rows that match worker-id 2015-09-08 GridKa School 2015 Relational and Non-relational databases 23

Part III Survival Distributed transactions SQL injection Application-level woes

Distributed transactions Sometimes you really need two different data stores Sometimes you need to be consistent between both Consensus protocols are needed Two-phase commit Paxos Needs operational support by database (pending writes) But you still have to code it in the application http://gemsres.com/photos/story/res/43755/fig1.jpg 2015-09-08 GridKa School 2015 Relational and Non-relational databases 25

SQL Injection https://xkcd.com/327/ 2015-09-08 GridKa School 2015 Relational and Non-relational databases 26

SQL Injection https://hackadaycom.files.wordpress.com/2014/04/18mpenleoksq8jpg.jpg?w=636 2015-09-08 GridKa School 2015 Relational and Non-relational databases 27

Application level woes Handling sessions this will be your major source of pain Connection Session Transaction Most databases only have a limited number of available connections Some pay with CPU for logons, others with RAM for sessions, etc E.g, have to channel 100 concurrent transactions across 10 connections SessionPooling/QueuePool every language/database has it s own idea Use an abstraction, don t code this yourself SQLAlchemy, Django (Python) Also come with an Object-Relational Mapper Makes relations into transparent Python objects CodeSynthesisODB (C++) Also supports BOOST datatypes(!) 2015-09-08 GridKa School 2015 Relational and Non-relational databases 28

Part IV The challenge

Challenge description Write a Twitter clone before time runs out 18:00 Choose any database you like (after you thought about the design!) Stick to the following UX you will write four programs Inserter: periodically inserts new random tweet into the database Latest: periodically prints the latest 10 tweets Random: periodically prints random 5 tweets Stats: periodically displays statistics How many tweets were added in the last minute by each user and overall (insertion rate) How often a given tweet was displayed (popularity of a particular tweet) Start 10 inserter, 10 latest, 10 random, 1 stats Use this random sentence generator (pip install loremipsum) import loremipsum loremipsum.generate_sentence() When stuck, ask me when done, show me! Good luck and have fun! 2015-09-08 GridKa School 2015 Relational and Non-relational databases 30

Application development with relational and non-relational databases Mario Lassnig European Organization for Nuclear Research (CERN) mario.lassnig@cern.ch