NoSQL Databases Vincent Leroy 1
Database Large-scale data processing First 2 classes: Hadoop, Spark Perform some computacon/transformacon over a full dataset Process all data SelecCve query Access a specific part of the dataset Manipulate only data needed (1 record among millions) à Database system 2
Processing / Database Link Batch Job (Hadoop, Spark) Stream Job (Spark, Storm) e.g. sencment analysis Serve queries Load data Write results Write results Database Insert records ApplicaCon 1 ApplicaCon 2 ApplicaCon 3 e.g. TwiSer trends page 3
Different types of databases So far we used HDFS A file system can be seen as a very basic database Directories / files to organize data Very simple queries (file system path) Very good scalability, fault tolerance Other end of the spectrum: RelaConal Databases SQL query language, very expressive Limited scalability (generally 1 server) 4
Size / Complexity Graph DB Complexity RelaConal DB Document DB Column DB Key/Value DB Filesystem Size 5
The NoSQL Jungle 6
Goal of these slides Present an overview of the NoSQL landscape Trade-off in choosing a solucon Theorems and principles Not a manual to learn specific DBs Too many of them Not that complicated (especially K/V stores) Focus on Neo4j graph DB in lab work 7
RelaConal Databases: SQL SQL language born 1974 SCll used by most data processing systems (including Spark) à Learn it! Don t be a viccm of the NoSQL hype! 8
RelaConal Databases model Data organized as tables Row = record Column = asribute RelaCons between tables Integrity constraints Select Ctle from courses natural join takes_courses group by ClassID having count(*) > 10 9
ACID properces Atomicity TransacCon are all or nothing (e.g. when adding a bi-direcconal friendship relacon, it s added both ways or not at all) Consistency Only valid data wrisen (e.g. cannot say a student takes a course that is not in the courses table) IsolaCon When mulcple transaccons execute simultaneously, they appear as if they were executed sequencally (aka serializability) Durability When data has been wrisen and validated, it is permanent (i.e. no data loss, even in the case of some failures) à Easy life for the developer 10
Why NoSQL then? What does NoSQL mean? No SQL New SQL Not only SQL SQL strong properces limit its ability to scale to very large datasets Relax some properces to deal with larger datasets (ACID) But at what cost? SQL is very structured (each record has the same columns ), Web data ooen is not Semi-structured data Unstructured data Graph data 11
CAP Consistency When mulcple operacons execute simultaneously, it appears as if they were executed one aoer the other (A of ACID) Availability Every request received by a non failed node must be answered ParCCon tolerance System must respond correctly even if network fails 12
CAP theorem Impossible to have 3 simultaneously Choose CA, CP, or AP In a centralized system, no need for P RelaConal databases have CA In a distributed system, you cannot ignore P Distributed databases choose CP or AP 13
CAP intuicon 2 solucons: Refuse to answer in case of parccon Answer but risk inconsistencies A: 3 A: 2 A: 3 B: 5 B: 6 Client 1 ParCCon Client 2 14
NoSQL and CAP 15
Weaker consistency models Eventual consistency When there is no parccon, DB is consistent In case of parccon, DB can return stale data Once parccon is gone, there is a Cme limit on how long it takes for consistency to return Different levels of consistency (consistency / cost tradeoff) Causal consistency Read-your-writes consistency Session consistency Monotonic read consistency Monotonic write consistency à Again, many choices, so many different systems 16
Vector clocks & conflict deteccon 17
Vector clocks & conflict deteccon 18
Vector clocks & conflict deteccon 19
Vector clocks & conflict deteccon 20
Vector clocks & conflict deteccon 21
Vector clocks & conflict deteccon 22
Vector clocks & conflict deteccon 23
Vector clocks & conflict deteccon 24
Vector clocks & conflict deteccon 25
Key/Value store 2 basic operacons, similar to the HashMap data structure Put(K,V) Get(K) Ooen used for caching informacon in memory Facebook uses them a lot 26
Column/Tabular DB Data organized as rows with a primary key Flexible format, each row can have different fields in a column family Relies on HDFS for fault tolerance 27
Document DB Data stored as documents (ooen JSON) Richer than K/V stores Insert Delete Update Find AggregaCon funccons (Map, Reduce ) Indexes 28
Document DB 29
Document DB 30
Graph DB Represent data as graphs Nodes / relaconships with properces as K/V pairs 31
Graph DB: Neo4j Rich data format Query language as pasern matching Limited scalability ReplicaCon to scale reads, writes need to be done to every replica 32