Data Informatics. Seon Ho Kim, Ph.D.

Similar documents
CIB Session 12th NoSQL Databases Structures

DS Introduction to SQL Part 1 Single-Table Queries. By Michael Hahsler based on slides for CS145 Introduction to Databases (Stanford)

INF 212 ANALYSIS OF PROG. LANGS SQL AND SPREADSHEETS. Instructors: Crista Lopes Copyright Instructors.

Introduction to SQL Part 1 By Michael Hahsler based on slides for CS145 Introduction to Databases (Stanford)

Lecture 2: Introduction to SQL

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

CompSci 516 Database Systems

Lecture 3: SQL Part II

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

DS Introduction to SQL Part 2 Multi-table Queries. By Michael Hahsler based on slides for CS145 Introduction to Databases (Stanford)

Introduction to NoSQL Databases

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Before we talk about Big Data. Lets talk about not-so-big data

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Introduction to NoSQL

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Advanced Database Technologies NoSQL: Not only SQL

SQL. Assit.Prof Dr. Anantakul Intarapadung

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Introduction to Database Systems CSE 444

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Database Systems CSE 303. Lecture 02

Database Systems CSE 303. Lecture 02

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Haridimos Kondylakis Computer Science Department, University of Crete

Modern Database Concepts

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2006 Lecture 3 - Relational Model

CA485 Ray Walshe NoSQL

Migrating Oracle Databases To Cassandra

Distributed Data Store

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Introduction to Database Systems CSE 414. Lecture 3: SQL Basics

Introduction to Database Systems CSE 414. Lecture 3: SQL Basics

CISC 7610 Lecture 2b The beginnings of NoSQL

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Distributed Databases: SQL vs NoSQL

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

Agenda. Discussion. Database/Relation/Tuple. Schema. Instance. CSE 444: Database Internals. Review Relational Model

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

CSE 344 JULY 9 TH NOSQL

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

COMP9321 Web Application Engineering

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Polls on Piazza. Open for 2 days Outline today: Next time: "witnesses" (traditionally students find this topic the most difficult)

Hands-on immersion on Big Data tools

Lecture 3 Additional Slides. CSE 344, Winter 2014 Sudeepa Roy

Principles of Database Systems CSE 544. Lecture #2 SQL The Complete Story

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Database Architectures

CAP and the Architectural Consequences

Database Systems CSE 414

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

SQL. Structured Query Language

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Advanced Data Management Technologies

Goals for Today. CS 133: Databases. Final Exam: Logistics. Why Use a DBMS? Brief overview of course. Course evaluations

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Challenges for Data Driven Systems

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Chapter 24 NOSQL Databases and Big Data Storage Systems

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Relational databases

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Warnings! Today. New Systems. OLAP vs. OLTP. New Systems vs. RDMS. NoSQL 11/5/17. Material from Cattell s paper ( ) some info will be outdated

CSE 344 JANUARY 5 TH INTRO TO THE RELATIONAL DATABASE

Integrity in Distributed Databases

Introduction to Relational Databases

Introduction to Graph Databases

Principles of Database Systems CSE 544. Lecture #1 Introduction and SQL

CSE 344 Final Review. August 16 th

Class Overview. Two Classes of Database Applications. NoSQL Motivation. RDBMS Review: Client-Server. RDBMS Review: Serverless

Five Common Myths About Scaling MySQL

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CSE 544 Principles of Database Management Systems

International Journal of Informative & Futuristic Research ISSN:

CS 655 Advanced Topics in Distributed Systems

CSE 530A. Non-Relational Databases. Washington University Fall 2013

A Study of NoSQL Database

COMP9321 Web Application Engineering

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Introduction to Databases CSE 414. Lecture 2: Data Models

Data Informatics. Seon Ho Kim, Ph.D.

NoSQL Databases. CPS352: Database Systems. Simon Miner Gordon College Last Revised: 4/22/15

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Announcements. Agenda. Database/Relation/Tuple. Schema. Discussion. CSE 444: Database Internals

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Announcements. Two Classes of Database Applications. Class Overview. NoSQL Motivation. RDBMS Review: Serverless

Principles of Database Systems CSE 544. Lecture #1 Introduction and SQL

Transcription:

Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu

NoSQL and Big Data Processing

Database Relational Databases mainstay of business Web-based applications caused spikes Especially true for public-facing e-commerce sites Issues with scaling up when the dataset is just too big RDBMS were not designed to be distributed Began to look at multi-node database solutions Known as scaling out or horizontal scaling

Short Review of SQL Standard language for querying and manipulating data Structured Query Language Data Definition Language (DDL) Create/alter/delete tables and their attributes Data Manipulation Language (DML) Query one or more tables discussed next! Insert/delete/modify tuples in tables

Table name Tables in SQL Attribute names Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Tuples or rows

Tables Explained The schema of a table is the table name and its attributes: Product(PName, Price, Category, Manfacturer) A key is an attribute whose values are unique; we underline a key Product(PName, Price, Category, Manfacturer)

SQL Query Basic form: (plus many many more bells and whistles) SELECT <attributes> FROM <one or more relations> WHERE <conditions>

Simple SQL Query Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi SELECT * FROM Product WHERE category= Gadgets selection PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks

Simple SQL Query Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi SELECT PName, Price, Manufacturer FROM Product WHERE Price > 100 selection and projection PName Price Manufacturer SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi

Joins Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Company Cname StockPrice Country GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan SELECT PName, Price FROM Product, Company WHERE Manufacturer=CName AND Country= Japan AND Price <= 200 PName Price SingleTouch $149.99

SQL Physical Layer Abstraction Applications specify what, not how Query optimization engine Physical layer can change without modifying applications Create indexes to support queries In Memory databases

Transactions ACID Properties Atomic All of the work in a transaction completes (commit) or none of it completes Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints. Isolated The results of any changes made during a transaction are not visible until the transaction has committed. Durable The results of a committed transaction survive failures

Trends

OLTP vs. OLAP We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it..

Challenges of Scale Differ

Motives Behind NoSQL Big data. Scalability. Data format. Manageability.

Big Data Collect. Store. Organize. Analyze. Share. Data growth outruns the ability to manage it so we need scalable solutions.

Scalability Scale up, Vertical scalability. Increasing server capacity. Adding more CPU, RAM. Managing is hard. Possible down times

Scalability Scale out, Horizontal scalability. Adding servers to existing system with little effort, aka Elastically scalable. Bugs, hardware errors, things fail all the time. It should become cheaper. Cost efficiency. Shared nothing. Use of commodity/cheap hardware. Heterogeneous systems. Service Oriented Architecture. Local states. Decentralized to reduce bottlenecks. Avoid single point of failures.

What is Wrong With RDBMS? Nothing. One size fits all? Not really. Rigid schema design à Efficiency issue in big scale. Harder to scale. Replication à Expensive. Joins across multiple nodes? Hard. How does RDMS handle data growth? Hard. Need for a DBA. Many programmers already familiar with it à Good. Transactions/ACID make development easy à Good. Lots of tools to use à Good.

ACID Semantics Atomicity, Consistency, Isolation, Durability Any data store can achieve Atomicity, Isolation and Durability but do you always need consistency? No. By giving up ACID properties, one can achieve higher performance and scalability.

The Perfect Storm Large datasets, acceptance of alternatives, and dynamically-typed data has come together in a perfect storm Not a backlash/rebellion against RDBMS SQL is a rich query language that cannot be rivaled by the current list of NoSQL offerings

CAP Theorem Three properties of a system: consistency, availability and partitions You can have at most two of these three properties for any shared-data system To scale out, you have to partition. That leaves either consistency or availability to choose from In almost all cases, you would choose availability over consistency

CAP Theorem (cont.) Consistency Availability Partition tolerance C: Once a writer has written, all readers will see that write A: System is available during software and hardware upgrades and node failures. P: A system can continue to operate in the presence of a network partitions. Theorem: You can have at most two of these properties for any shared-data system

BASE, an ACID Alternative Almost the opposite of ACID. Basically Available: Nodes in the a distributed environment can go down, but the whole system shouldn t be affected. Soft State (scalable): The state of the system and data changes over time. Eventual Consistency: Given enough time, data will be consistent across the distributed system.

A Clash of cultures ACID: Strong consistency. Less availability. Pessimistic concurrency. Complex. BASE: Availability is the most important thing. Willing to sacrifice for this (CAP). Weaker consistency (Eventual). Best effort. Simple and fast. Optimistic.

NoSQL Definition From www.nosql-database.org: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount, and more.

What is NoSQL? Stands for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor do they use the concept of joins All NoSQL offerings relax one or more of the ACID properties (following CAP theorem)

Why NoSQL? For data storage, an RDBMS cannot be the beall/end-all Just as there are different programming languages, need to have other data storage tools in the toolbox A NoSQL solution is more acceptable to a client now than even a year ago

NoSQL Distinguishing Characteristics Large data volumes Google s big data Scalable replication and distribution Potentially thousands of machines Potentially distributed around the world Queries need to return answers quickly Mostly query, few updates Asynchronous Inserts & Updates Schema-less ACID transaction properties are not needed BASE CAP Theorem Open source development

What kinds of NoSQL NoSQL solutions fall into two major areas: Key/Value or the big hash table. Amazon S3 (Dynamo) Voldemort Scalaris Memcached (in-memory key/value store) Redis Schema-less which comes in multiple flavors, columnbased, document-based or graph-based. Cassandra (column-based) CouchDB (document-based) MongoDB(document-based) Neo4J (graph-based) HBase (column-based)

NoSQL Databases The key-value data model is based on a structure composed of two data elements: a key and a value, in which every key has a corresponding value or set of values. The key value model is also referred to as the attribute-value or associative data model. Consider the example shown on the next slide which illustrates a small truck-driving company called Trucks-R-Us. Each of its three drivers has one or more certifications and other general information. Based on this example we can define the following important points:

NoSQL Databases Data stored using traditional relational model DID CERT1 CERT2 CERT3 DOB LICTYPE 2732 80 90 1/24/1962 P 2946 92 90 4/11/1970 3650 86 6/27/1968 C In the relational model: Each row represents one entity instance Each column represents one attribute of the entity The values in a column are of the same data type In the key-value model: Each row represents one attribute/value of one entity instance The key column could represent any entity s attribute The values in the value column could be of any data type and therefore it is generally assigned a long string data type Driver 2732 Data stored using key-value model DID KEY VALUE 2732 CERT1 80 2732 CERT3 90 2732 DOB 1/24/1962 2732 LICTYPE P 2946 CERT2 92 2946 CERT3 90 2946 DOB 4/11/1970 3650 CERT1 86 3650 DOB 6/27/1968 36500 LICTYPE C

NoSQL Databases In the key-value model, each row represents one attribute of one entity instance. The key column points to an attribute and the value column contains the actual value for the attribute. The data type of the value column is generally a long string to accommodate the variety of actual data types of the values placed in the column. To add a new entity attribute in the relational model, you need to modify the table definition (schema modification). To add a new attribute in the key-value model, you add a row to the key-value store, which is why it is said to be schema-less.

NoSQL Databases NoSQL databases do not store or enforce relationships among entities. The programmer is required to manage the relationships in the code. Also, all data and integrity validations must be done in the code. NoSQL databases use their own native application programming interface (API) with simple data access commands, such as put, read, and delete. Because there is no declarative SQL-like syntax to retrieve data, the program code must take care of retrieving related data in the correct way. Indexing and searches can be difficult. Because the value column in the key-value model could contain many different data types, it is often difficult to create indices on the data. At the same time, searches can become very complex.

NoSQL Databases One of the big advantages of NoSQL databases is that they generally use a distributedarchitecture. NoSQL databases can handle very large volumes of data. In particular, they are suited for sparse data. Sparse data occurs when the number of attributes is very large, but the number of actual data instances is low. Using the truck drivingcompany as an example, driverscan take any certification exam, but they are not required to take them all. In this case there were three drivers and three possible certificates for each driver, so there will be nine possible data points (we illustrated only 4). Extrapolate this to a hospital with 50,000 patients and more than 2500 possible medical tests that could be performed. We would not expect to see 50000 x 2500 data points, in reality we would see far less.

NoSQL Databases Most NoSQL databases are geared toward performance rather than transaction consistency. Enforcing dataconsistency on a distributed database is a very difficult problem. Distributed databases make copies of data elements at multiple nodes to ensure high availability and fault tolerance. Updating values and requiring consistency amongst all copies is a very tough problem to solve. NoSQL database sacrifice this consistency to attain high levels of performance. Some NoSQL databases provide a feature called eventual consistency, which means that updatesto datawill propagate through the system and eventually all copies will be consistent. With eventual consistency, data are not guaranteed to be consistent across all copies immediately after an update.

Key/Value Data Model Pros: very fast very scalable simple model able to distribute horizontally Cons: many data structures (objects) can't be easily modeled as key value pairs

Schema-Less Data Model Pros: Schema-less data model is richer than key/value pairs eventual consistency many are distributed still provide excellent performance and scalability Cons: typically no ACID transactions or joins

Common Advantages Cheap, easy to implement (open source) Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned Down nodes easily replaced No single point of failure Easy to distribute Don't require a schema Can scale up and down Relax the data consistency requirement (CAP)

What am I giving up? Features joins group by order by ACID transactions SQL is sometimes frustrating but still a powerful query language Easy integration with other applications that support SQL

Storing and Modifying Data Syntax varies HTML Java Script Etc. Asynchronous Inserts and updates do not wait for confirmation Versioned Optimistic Concurrency

Retrieving Data Syntax Varies No set-based query language Procedural program languages such as Java, C, etc. Application specifies retrieval path No query optimizer Quick answer is important May not be a single right answer

Open Source Small upfront software costs Suitable for large scale distribution on commodity hardware

NoSQL Summary NoSQL databases reject: Overhead of ACID transactions Complexity of SQL Burden of up-front schema design Declarative query expression Yesterday s technology Programmer responsible for Step-by-step procedural language Navigating access path

Summary SQL Databases Predefined Schema Standard definition and interface language Tight consistency Well defined semantics NoSQL Database No predefined Schema Per-product definition and interface language Getting an answer quickly is more important than getting a correct answer

Web References NoSQL -- Your Ultimate Guide to the Non - Relational Universe! http://nosql-database.org/links.html NoSQL (RDBMS) http://en.wikipedia.org/wiki/nosql Scalable SQL, ACM Queue, Michael Rys, April 19, 2011 http://queue.acm.org/detail.cfm?id=1971597 a practical guide to nosql, Posted by Denise Miura on March 17, 2011 at http://blogs.marklogic.com/2011/03/17/a-practicalguide-to-nosql/