Advanced Database Technologies NoSQL: Not only SQL

Similar documents
TDDD43 HT2014: Advanced databases and data models Theme 4: NoSQL, Distributed File System, Map-Reduce

Modern Database Concepts

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

TDDD43 HT2015: Advanced databases and data models Theme 4: NoSQL, Distributed File System, Map-Reduce

CIB Session 12th NoSQL Databases Structures

The NoSQL Movement. FlockDB. CSCI 470: Web Science Keith Vertanen

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce: Simplified Data Processing on Large Clusters

The MapReduce Abstraction

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CISC 7610 Lecture 2b The beginnings of NoSQL

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Database System Architectures Parallel DBs, MapReduce, ColumnStores

The MapReduce Framework

Introduction to Data Management CSE 344

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases

Parallel Computing: MapReduce Jin, Hai

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CSE 344 MAY 2 ND MAP/REDUCE

Introduction to Data Management CSE 344

Modern Database Concepts

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Parallel Programming Concepts

MI-PDB, MIE-PDB: Advanced Database Systems

CS 655 Advanced Topics in Distributed Systems

Map Reduce. Yerevan.

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

NoSQL Concepts, Techniques & Systems Part 1. Valentina Ivanova IDA, Linköping University

MapReduce: A Programming Model for Large-Scale Distributed Computation

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Bioinforma)cs Resources - NoSQL -

Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

L22: SC Report, Map Reduce

Clustering Lecture 8: MapReduce

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Data Store

Large-Scale GPU programming

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

MapReduce-style data processing

Next-Generation Cloud Platform

Typical size of data you deal with on a daily basis

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Data Storage Infrastructure at Facebook

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Hadoop/MapReduce Computing Paradigm

Introduction to Data Management CSE 344

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Introduction to Map Reduce

CA485 Ray Walshe NoSQL

Introduction to Hadoop and MapReduce

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CS 102. Big Data. Spring Big Data Platforms

Data Informatics. Seon Ho Kim, Ph.D.

CompSci 516 Database Systems

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

6.830 Lecture Spark 11/15/2017

Introduction to MapReduce

Distributed File Systems II

A Review Paper on Big data & Hadoop

CSE 530A. Non-Relational Databases. Washington University Fall 2013

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

10 Million Smart Meter Data with Apache HBase

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Distributed Databases: SQL vs NoSQL

1. Introduction to MapReduce

Scalability of web applications

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Study of NoSQL Database Along With Security Comparison

CSE-E5430 Scalable Cloud Computing Lecture 9

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

HBase Solutions at Facebook

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Massive Online Analysis - Storm,Spark

A Fast and High Throughput SQL Query System for Big Data

Comparing SQL and NOSQL databases

Transcription:

Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group

NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at all! But both setups and demands have drastically changed: main memory and CPU speed have exploded, compared to the time when System R (the mother of all RDBMS) was developed at the same time, huge amounts of data are now handled in real-time both data and use cases are getting more and more dynamic social networks (relying on graph data) have gained impressive momentum full-texts have always been treated shabbily by relational DBMS 2

NoSQL: Facebook Statistics royal.pingdom.com/2010/06/18/the-software-behind-facebook 500 million users 570 billion page views per month 3 billion photos uploaded per month 1.2 million photos served per second 25 billion pieces of content (updates, comments) shared every month 50 million server-side operations per second 2008: 10,000 servers; 2009: 30,000, One RDBMS may not be enough to keep this going on! 3

NoSQL: Facebook Architecture Memcached distributed memory caching system caching layer between web and database servers based on a distributed hash table (DHT) HipHop for PHP developed by Facebook to improve scalability compiles PHP to C++ code, which can be better optimized PHP runtime system was reimplemented 4

NoSQL: Facebook Architecture Cassandra developed by Facebook for inbox searching data is automatically replicated to multiple nodes no single point of failure (all nodes are equal) Hadoop/Hive implementation of Google s MapReduce framework performs data-intensive calculations (initially) used by Facebook for data analysis 5

NoSQL: Facebook Architecture Varnish HTTP accelerator, speeds up dynamic web sites Haystack object store, used for storing and retrieving photos BigPipe web page serving system; serves parts of the page (chat, news feed, ) Scribe aggregates log data from different servers 6

NoSQL: Facebook Architecture: Hadoop Cluster hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html 21 PB in Data Warehouse cluster, spread across 2000 machines: 1200 machines with 8 cores, 800 machines with 16 cores 12 TB disk space per machine, 32 GB RAM per machine 15 map-reduce tasks per machine Workload daily: 12 TB of compressed data, and 800 TB of scanned data 25,000 map-reduce jobs and 65 million files per day 7

NoSQL: Facebook Architecture www.facebook.com/note.php?note_id=454991608919 Conclusion HBase Hadoop database, used for e-mails, IM and SMS has recently replaced MySQL, Cassandra and few others built on Google s BigTable model...to be continued classical database solutions have turned out to be completely insufficient heterogeneous software architecture is needed to match all requirements 8

NoSQL: Not only SQL do we need a Facebook-like architecture for conventional tasks? Thoughts Depends on problems to be solved: RDBMS are still a great solution for centralized, tabular data sets NoSQL gets interesting if data is heterogeneous and/or too large most NoSQL projects are open source and have open communities code bases are up-to-date (no 30 years old, closed legacy code) they are subject to rapid development and change cannot offer general-purpose solutions yet, as claimed by RDBMS 9

NoSQL: Not only SQL 10 Things: Five Advantages www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772 Elastic Scaling Economics scaling out: distributing data based on cheap commodity servers instead of buying bigger servers less costs per transaction/second Big Data Flexible Data Models opens new dimensions that non-existing/relaxed data schema cannot be handled with RDBMS structural changes cause no Goodbye DBAs (see you later?) overhead automatic repair, distribution, tuning, 10

NoSQL: Not only SQL 10 Things: Five Challengees www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772 Maturity Analytics and Business Intelligence still in pre-production phase, key focused on web apps scenarios features yet to be implemented limited ad-hoc querying Support Expertise mostly open source, start-ups, few number of NoSQL experts limited resources or credibility available in the market Administration require lot of skill to install and effort to maintain 11

Definition nosql-database.org Next Generation Databases mostly addressing some of the points: being non-relational, distributed, opensource and horizontally scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent/base (not ACID), a huge data amount, and more. History Stefan Edlich 1998 term was coined by NoSQL RDBMS, developed by Carlo Strozzi 2009 reintroduced on a meetup for open source, distributed, non relational databases, organized by last.fm developer Johan Oskarsson 12

Scalability System can handle growing amounts of data without losing performance. Vertical Scalability (scale up) add resources (more CPUs, more memory) to a single node using more threads to handle a local problem Horizontal Scalability (scale out) add nodes (more computers, servers) to a distributed system gets more and more popular due to low costs for commodity hardware often surpasses scalability of vertical approach 13

CAP Theorem: Consistency, Availability, Partition Tolerance Brewer [2000]: Towards Robust Distributed Systems Consistency after an update, all readers in a distributed system see the same data all nodes are supposed to contain the same data at all times Example single database instance will always be consistent if multiple instances exist, all writes must be duplicated before write operation is completed 14

CAP Theorem: Consistency, Availability, Partition Tolerance Brewer [2000]: Towards Robust Distributed Systems Availability all requests will be answered, regardless of crashes or downtimes Example a single instance has an availability of 100% or 0% two servers may be available 100%, 50%, or 0% Partition Tolerance system continues to operate, even if two sets of servers get isolated Example system gets partitioned if connection between server clusters fails failed connection won t cause troubles if system is tolerant 15

CAP Theorem: Consistency, Availability, Partition Tolerance Brewer [2000]: Towards Robust Distributed Systems Theorem (proven 2002): only 2 of the 3 guarantees can be given in a shared-data system (Positive) consequence: we can concentrate on two challenges ACID properties needed to guarantee consistency and availability BASE properties come into play if availability and partition tolerance is favored 16

ACID: Atomicity, Consistency, Isolation, Durability Härder & Reuter [1983]: Principles of Transaction-Oriented Database Recovery Atomicity all operations in the transaction will complete, or none will Consistency before and after transactions, database will be in a consistent state Isolation operations cannot access data that is currently modified Durability data will not be lost upon completion of a transaction 17

BASE: Basically Available, Soft State, Eventual Consistency Fox et al. [1997]: Cluster-Based Scalable Network Services Basically Available an application works basically all the time (despite partial failures) Soft State is in flux and non-deterministic (changes all the time) Eventual Consistency will be in some consistent state (at some time in future) Why is BASE a reasonable paradigm for e.g. web search? 18

MapReduce Framework Dean & Ghemawat [2004]: MapReduce: Simplified Data Processing on Large Clusters developed by Google to replace old, centralized index structure supports distributed, parallel computing on large data sets inspired by ( not equal to ) the map and fold functions in functional programming: no side effects, deadlocks, race conditions divide-and-conquer paradigm: Map recursively breaks down a problem into sub-problems and distributes it to worker nodes; Reduce receives and combines the sub-answers to solve the problem 19

Map/Fold in XQuery 3.0 map() function: let $func:=function($a){$a*$a} return map($func,1 to 4) Result: 1 4 9 16 fold-left() function: let $func:=function($a,$b){$a*$b} return fold-left($func,1,1 to 4) Same as: ((((1*1)*2)*3)*4) Result: 24 applies a function to all items of a sequence iterates a function over a sequence and builds up a single item 20

MapReduce: Functions Input & Output: each a set of key/value pairs. Programmer specifies two functions: map(in_key, in_value) list(out_key, intermed_value): processes input key/value pair, produces set of intermediate pairs reduce(out_key, list(intermed_value)) list(out_value): combines all intermediate values for a particular key, produces a set of merged output values (usually just one) 21

MapReduce: Example All-time favorite: count word occurrence map(string input_key, String input_value): # input_key: document name # input_value: document contents for each word w in input_value: EmitIntermediate(w, "1") reduce(string output_key, Iterator intermed_values): # output_key: a word # output_values: a list of counts int result = 0 for each v in intermed_values: result += ParseInt(v) Emit(AsString(result)) programmer can focus on map/reduce code framework cares about parallelization, fault tolerance, scheduling, 22

23 MapReduce: Example blog.jteam.nl/2009/08/04/introduction-to-hadoop

MapReduce: Execution Dean & Ghemawat [2004] input files are stored on map worker disks master assigns tasks to workers map workers write on local disk reduce workers retrieve data and write result ( may be recursively repeated) 24

MapReduce: File System often based on a custom file system, which is optimized for distributed access (Google: GFS, Hadoop: HDFS) uses few large files, as data throughput >> access time Summary MapReduce framework can be written for different environments: shared memory, distributed networks, used in a wide range of scenarios: distributed search (grep), creation of word indexes, sorting distributed data, count page links, 25