DATABASE DESIGN II - 1DL400

Similar documents
DATA MINING II - 1DL460

CSE 530A. Non-Relational Databases. Washington University Fall 2013

CISC 7610 Lecture 2b The beginnings of NoSQL

CIB Session 12th NoSQL Databases Structures

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Introduction to NoSQL Databases

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Challenges for Data Driven Systems

Intro To Big Data. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

Non-Relational Databases. Pelle Jakovits

Hadoop An Overview. - Socrates CCDH

Presented by Sunnie S Chung CIS 612

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Stages of Data Processing

CompSci 516 Database Systems

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Hadoop Development Introduction

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Rule 14 Use Databases Appropriately

Why NoSQL? Why Riak?

Understanding NoSQL Database Implementations

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

CSE 344 JULY 9 TH NOSQL

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Big Data Analytics using Apache Hadoop and Spark with Scala

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MapReduce and Friends

Big Data Architect.

Getting to know. by Michelle Darling August 2013

MI-PDB, MIE-PDB: Advanced Database Systems

A BigData Tour HDFS, Ceph and MapReduce

Big Data Hadoop Stack

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Advanced Database Technologies NoSQL: Not only SQL

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Webinar Series TMIP VISION

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

Databases and Big Data Today. CS634 Class 22

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Extend NonStop Applications with Cloud-based Services. Phil Ly, TIC Software John Russell, Canam Software

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Distributed Data Store

Introduction to NoSQL by William McKnight

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

Distributed Databases: SQL vs NoSQL

A MapReduce Relational-Database Index-Selection Tool

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

High Performance NoSQL with MongoDB

Big Data Management and NoSQL Databases

BIG DATA COURSE CONTENT

MapReduce, Hadoop and Spark. Bompotas Agorakis

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Database Systems CSE 414

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Large-Scale GPU programming

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

New Approaches to Big Data Processing and Analytics

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Introduction to BigData, Hadoop:-

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Big Data Hadoop Course Content

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Cassandra- A Distributed Database

Data Partitioning and MapReduce

Chapter 24 NOSQL Databases and Big Data Storage Systems

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Microsoft Big Data and Hadoop

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Prototyping Data Intensive Apps: TrendingTopics.org

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect

Safe Harbor Statement

Introduction to Data Management CSE 344

MIS Database Systems.

BIS Database Management Systems.

Intro Cassandra. Adelaide Big Data Meetup.

Oracle NoSQL Database Enterprise Edition, Version 18.1

Parallel Techniques for Big Data. Patrick Valduriez

Database Architectures

Shark: Hive (SQL) on Spark

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

/ Cloud Computing. Recitation 8 October 18, 2016

Big Data with Hadoop Ecosystem

Transcription:

DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 13/12/16 1

Introduction to NoSQL Databases Kjell Orsborn and Tore Risch Department of Information Technology Uppsala University, Uppsala, Sweden 13/12/16 2

DataBase Management Systems, DBMS and Data managers, key/value stores DataBase Management Systems e.g. Oracle, DB2, SQL Server, MySQL Data managers, key/value stores e.g. BerkeleyDB, Hadoop, Riak, Hbase, Cassandra, DynamoDB Dynamic SQL Queries Java/C++/JavaScript programs DBMS Query Processor Data manager Data Manager Meta data Stored Data Stored Data 13/12/16 3

Evolution of DBMS technology Distributed databases MapReduce HIVE Google F1 SQL NoSQL SQL- 1960 1970 1980 1990 2000 2010 Files IMS CODASYL RDB Object Stores (key/value) 0011001.. ORDB Federated Streaming data Cloud databases Highly Distributed databases DSMS BIG table F1 Databases Web sources NoSQL SQL NoSQL SQL+ SQL+ New QL SQL- SQL+ 13/12/16 4

Kinds of DBMS support Query (SQL) Relational DBMSs Object-Relational DBMSs No Query (NoSQL) File systems (scalable) Storage managers Object Stores Simple Data Complex data 13/12/16 5

Classification of Modern Database applications Complex Queries SQL, New QL Business operations Business analytics Personal db Multimedia search Custom Datatypes Data streams Web analytics Simple Queries SQL- Web shop E-business Web search No Queries NoSQL Text, data logs Simple computations Web documents CAD system Complex computations Web surfing Simple Data Complex data 13/12/16 6

Kind of database support Complex Queries SQL, New QL Relational DBMS Object-Relational DBMS Data Stream Mgmt. Syst. Google F1 Simple Queries SQL- No Queries NoSQL Google App Engine (GQL) Amazon SimpleDB Microsoft Azure File systems Content Mgmt. Syst. Google search engine MongoDB Object Stores Google BigTable Yahoo Hadoop CouchDB Simple Data Complex data 13/12/16 7

What is a NoSQL Database? A key/value store Basic index manager, no complete query language E.g. Google BigTable, Amazon Dynamo An extensible column-based database E.g. Cassandra, Hbase A web document database For web documents, not for small business transactions E.g. MongoDB, CouchDB A DBMS with a limited query language Provides for high volume small business transactions Sometimes called cloud databases E.g. Google App Engine, Microsoft Azure, Amazon SimpleDB; MongoDB 13/12/16 8

Who is using them? 13/12/16 9

What is a NoSQL Database? A DBMS where MapReduce is used instead of queries Manual programs to iterate over entire data sets E.g. Hadoop, CouchDB, Dynamo A MapReduce engine with a query language on top: HIVE on top of Hadoop provides HiveQL Provides non-procedural data analytics (select from groupby) without detailed programming Executed in batch as parallel Hadoop jobs A DBMS with a new query language for new applications Streambase, Virtuoso, Neo4J, Amos II, Google F1 Other non-relational databases Including Object Stores, array databases, etc. 13/12/16 10

NoSQL Characteristics Highly distributed and parallel architectures Typically runs on data centers This is similar to parallel databases! Highly scalable systems by compromised consistency No 2-phase commit as in distributed databases Eventual consistency Or perhaps never consistency Similar options available in modern DBMSs too Puts burden on programmer to handle consistency! Race conditions Recovery Customized implementation of transactions WARNING: Consistency best handled in kernel! Mainly suitable for applications not needing consistency 13/12/16 11

NoSQL Characteristics New query languages for new applications SQL- for cloud databases Many simple web applications do not need full SQL Simple SQL permits high scalability Familiar model Full SQL too complex for new systems Graph query languages SPARQL for RDF RDF data model For searching linked data (http://linkeddata.org/) Stream query languages CQL Variant of SQL for streams AmosQL (UU) Functional parallel data stream query language 13/12/16 12

MapReduce Parallel batch processing using MapReduce Many NoSQL databases uses MapReduce for parallel batch processing of data stored in data centers Highly scalable implementation of parallel batch processing of same (e.g. Java) program over large amounts of data stored in different files based on a scalable file system (e.g. HDFS) The MapReduce function: Applies a (costly) user function mapper producing key/value pairs in parallel on many nodes accessing files in a cluster Applies a user aggregate function on the key/value pairs produced by the mapper Very similar to GROUP BY in SQL Read reference article on MapReduce 13/12/16 13

MapReduce File I/O Data file Map Partition Data file Data file Data file Map Map Map Partition Partition Partition Reduce Reduce Output Writer Result file Data file Map Partition Reduce Data file Map Partition 13/12/16 14

Hadoop MapReduce manager architecture 13/12/16 15

Hive system architecture and components. 13/12/16 16

The Hadoop v1 vs. Hadoop v2 schematic. 13/12/16 17

MapReduce code (pseudocode) function map(string name, String document): // name: document name, i.e. HDFS file contents // document: document contents, parsed HDFS file tokens // Can make own parser as preprocessor for each word w in document: emit (w, 1) function reduce(string word, Iterator partialcounts): // word: a word // partialcounts pc: a list of aggregated partial counts (word, cnt) sum = 0; for each pc in partialcounts: sum += ParseInt(pc); emit (word, sum) 13/12/16 18

Input reader MapReduce stages System component that reads files from scalable file system (e.g. HDFS) and sends to map functions applied in parallel Map function Applied in parallel on many different files Parses input file data from HDFS Does some (expensive) computation Emits key value pairs as result Result stored by MapReduce system as file Partition function (optional) Partitions output key/value pairs from map function into groups of key/value pairs to be reduced in parallel Usually hash partitioning Reduce function Iterates over set of key/value pairs to produce a reduced set of key value pairs stored in the file system Compare with: aggregate functions 13/12/16 19

Wordcount in HiveQL FROM ( MAP docs.doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) REDUCE word, cnt USING 'pythonwc_reduce.py'; 13/12/16 20

Wordcount in SQL Alt 1, assume words on documents stored in table: CREATE TABLE docs(id INTEGER, word VARCHAR(20), PRIMARY KEY(id)) The query becomes: SELECT d.word, COUNT(d.word) FROM docs d GROUP BY d.word Problem: DOCS is table, not stored in file Alt 2, use user-defined table function to access documents in file: SELECT d.word, COUNT(d.word) FROM mydocuments( C:/mydocuments ) AS d GROUP BY d.word 13/12/16 21

HIVEQL vs raw mapreduce Raw MapReduce: Java (Python, C++, etc.) program does map and reduce Very common use of MapReduce: Statistics collection over files (count, sum, stdev, etc) HIVEQL handles basic statistics 80% of applications When advanced statistics not supported in HIVEQL (or SQL): Alt 1: User defined aggregate functions in HIVE (UDAF) Can be generally used in other queries too Alt 2: Raw MapReduce Code may be complicated Code cannot be used in queries 13/12/16 22

Reading raw data files to RDBMS Major point with MapReduce: Save time to load database and build index Accessing CSV (Comma Separated Values) file Can treat CSV file as relational table Modern DBMS have bulk load facilities Never use insert to bulk load At least do not do commit after each insert (MySQL autocommit default) Bulk load speed approaches file copy time Orders of magnitude faster than naïve inserts Automatic parallel bulk loading Binary and formatted bulk loading supported Indexes are voluntary and can be built afterwards 13/12/16 23

Relational Databases vs MapReduce Modern RDBMSs have user defined table functions and aggregate functions Can read data from files using UDFs User defined aggregate functions are incremental and parallelized RDBMSs have indexing and high parallelism to provide scalability Notice that it takes time to build index during database loading => May slow down database loading considerable When is HIVE with MapReduce better than RDBs? One shot queries when no indexing is needed Massively parallel very expensive brute force computations Google F1: embarrassingly parallel computations Use highly distributed DBMS to select input to brute force MapReduce jobs 13/12/16 24

10 performance rules for simple databases (CACM June 2011) 1. Look for shared nothing Distributed database design 2. High level languages are good and need not hurt performance Query language with query optimization needed 3. Plan to carefully leverage main memory databases Just running on RAM disk only marginally faster than disk DBMS Need DBMS designed for MM (TimesTen, MySQL Cluster, SAP HANA) 4. High availability and automatic recovery essential for scalability Many nodes => Likelihood of node failure increases drastically Should be possible to recover failed nodes without bringing down DB (e.g. MySQL Cluster,) 5. On-line everything Schema changes (data independence), software upgrades without bringing down database. 13/12/16 25

10 performance rules for simple databases (CACM June 2011) 6. Avoid multi-node operations Expensive having operations over data from many nodes Distributed database design, logical fragmentation, sharding 7. Don t try to build ACID consistency yourself Don t build transactions yourself Eventual consistency may cause very severe headaches! E.g. updates of multiple-nodes may cause severe problems if not supported by transactions 8. Look for administrative simplicity Often very difficult to install DBMS 9. Pay attention to node performance Parallel brute force waists energy and other resources Index! 10. Open source gives you more control over your future 13/12/16 26