A Brief History of Big Data

Similar documents
Intro To Big Data. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Presented by Sunnie S Chung CIS 612

Introduction to NoSQL Databases

CISC 7610 Lecture 2b The beginnings of NoSQL

10 Million Smart Meter Data with Apache HBase

CIB Session 12th NoSQL Databases Structures

Introduction to Graph Databases

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

DATABASE DESIGN II - 1DL400

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

Stages of Data Processing

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Databases and Big Data Today. CS634 Class 22

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

Chapter 24 NOSQL Databases and Big Data Storage Systems

MarkLogic 8 Overview of Key Features COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Distributed Databases: SQL vs NoSQL

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

DIVING IN: INSIDE THE DATA CENTER

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Study of NoSQL Database Along With Security Comparison

Introduction to NoSQL by William McKnight

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Oracle NoSQL Database Enterprise Edition, Version 18.1

Wrap-Up: Data Sharing and the Web

The age of Big Data Big Data for Oracle Database Professionals

Advanced Database Technologies NoSQL: Not only SQL

Key Value Store. Yiding Wang, Zhaoxiong Yang

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages

Oracle NoSQL Database Enterprise Edition, Version 18.1

A Journey to DynamoDB

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Graph databases Neo4j syntax and examples Document databases

relational Relational to Riak Why Move From Relational to Riak? Introduction High Availability Riak At-a-Glance

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

BIS Database Management Systems.

MIS Database Systems.

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Motivation Overview of NoSQL space Comparing technologies used Getting hands dirty tutorial section

NoSQL database and its business applications

CSE 344 JULY 9 TH NOSQL

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Map-Reduce. Marco Mura 2010 March, 31th

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Database Systems CSE 414

Database infrastructure for electronic structure calculations

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Introduction to Database Services

Distributed Non-Relational Databases. Pelle Jakovits

relational Key-value Graph Object Document

Modern Database Concepts

Part I What are Databases?

Cloud Analytics and Business Intelligence on AWS

CS5412: DIVING IN: INSIDE THE DATA CENTER

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Using space-filling curves for multidimensional

Relational databases

Cloud Computing ECPE 276. AWS Hosted Services

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

CS 655 Advanced Topics in Distributed Systems

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Hands-on immersion on Big Data tools

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Massive Online Analysis - Storm,Spark

CSE6331: Cloud Computing

Distributed Data Store

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Enable Spark SQL on NoSQL Hbase tables with HSpark IBM Code Tech Talk. February 13, 2018

Big Data Management and NoSQL Databases

Challenges for Data Driven Systems

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

SCALABLE DATABASES. Sergio Bossa. From Relational Databases To Polyglot Persistence.

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

Databases : Lectures 11 and 12: Beyond ACID/Relational databases Timothy G. Griffin Lent Term 2013

NoSQL Databases Analysis

Creating a Recommender System. An Elasticsearch & Apache Spark approach

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

CompSci 516 Database Systems

Data Analytics at Logitech Snowflake + Tableau = #Winning

Transcription:

A Brief History of Big Data John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Wikipedia

Once there was only small data... Find a tasty appetizer Easy! Find something to use up these oranges grumble A classic amount of small data What if.

Less sophisticated is sometimes better Get all articles from 2007. Chronologically or geologically organized. Familiar to some of you at tax time. Get all papers on fault tolerance grumble and cough Indexing will determine your individual performance. Teamwork can scale that up.

The culmination of centuries... Find books on Modern Physics (DD# 539) Find books by Wheeler where he isn t the first author grumble Your only hope

Then data started to grow. 1956 IBM Model 350 But still pricey. $ Better think about what you want to save. 5 MB of data!

And finally got BIG. 8TB for $130 Whys: Storage got cheap So why not keep it all? Today data is a hot commodity $ And we got better at generating it facebook = 10 TB * IoT Science... Pan-STARRS telescope http://panstarrs.ifa.hawaii.edu/publ Wikipedia *Actually, a silly estimate. The original reference ic/ actually mentions / a more accurate 208TB, and in get_involved/blog/bioblitzinsects-reviewed Commons 2013 the digital collection alone was 3PB. Genome sequencers (Wikipedia Commons) Collections Horniman museum: http://www.horniman.ac.uk Legacy documents Environmental sensors: Water temperature profiles from tagged hooded seals http://www.arctic.noaa.gov/report1 1/biodiv_whales_walrus.html

Size 1000 Genomes Project AWS hosted 260TB Common Crawl Hosted on Bridges 300-800TB+ A better sense of biggish Throughput Square Kilometer Array Building now Exabyte of raw data/day compressed to 10PB Internet of Things (IoT) / motes Endless streaming Records GDELT (Global Database of Events, Language, and Tone) (also soon to be hosted on Bridges) 250M rows and 59 fields (BigTable) during periods with relatively little content, maximal translation accuracy can be achieved, with accuracy linearly degraded as needed to cope with increases in volume in order to ensure that translation always finishes within the 15 minute window. and prioritizes the highest quality material, accepting that lower-quality material may have a lower-quality translation to stay within the available time window.

Good Ol SQL couldn't keep up. Oracle SELECT NAME, NUMBER, FROM PHONEBOOK Why it wasn t fashionable: Schemas set in stone: Need to define before we can add data Not a fit for agile development "What do you mean we didn't plan to keep logs of everyone's heartbeat?" Payroll Inventory Name Number Address Phonebook Product Number Address Name Number Address Queries often require accessing multiple indexes and joining and sorting multiple tables Sharding isn t trivial Caching is tough ACID (Atomicity,Consistency,Isolation,Durability) in a Transaction is costly.

So we gave up: Key-Value Redis, Memcached, Amazon DynamoDB, Riak, Ehcache GET foo foo bar 2 fast Certainly agile (no schema) Certainly scalable (linear in most ways: hardware, storage, cost) Good hash might deliver fast lookup Sharding, backup, etc. could be simple Often used for session information: online games, shopping carts GET cart:joe:15~4~7~0723 6 0 9 0 0 9 text pic 1055 stuff bar foo

How does a pile of unorganized data solve our problems? Sure, giving up ACID buys us a lot performance, but doesn't our crude organization cost us something? Yes, but remember these guys? This is what they look like today.

NoSQL It turns out this NoSQL approach can benefit from some oldfashioned organization. Let's look at what types of structure take us beyond simple key-value: Document Column Graph DataFrames Analysis / Machine Learning

Document foo GET foo Value must be an object the DB can understand Common are: XML, JSON, Binary JSON and nested thereof This allows server side operations on the data GET plant=daisy Can be quite complex: Linq query, JavaScript function Different DB s have different update/staleness paradigms 2 6 JSON 9 XML 0 Binary JSON bar <CATALOG> <PLANT> <COMMON>Bloodroot</COMMON> <BOTANICAL>Sanguinaria canadensis</botanical> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599</AVAILABILITY> </PLANT> <PLANT> <COMMON>Columbine</COMMON> <BOTANICAL>Aquilegia canadensis</botanical> <ZONE>3</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$9.37</PRICE> <AVAILABILITY>030699</AVAILABILITY> </PLANT>.. OBJECT JSON XML 12 XML XML

Wide Column Stores Google BigTable SELECT Name, Occupation FROM People WHERE key IN (199, 200, 207); No predefined schema Can think of this as a 2-D key-value store: the value may be a key-value store itself Different databases aggregate data differently on disk with different optimizations Key Joe Email: joe@gmail Web: www.joe.com Fred Phone: 412-555-3412 Email: fred@yahoo.com Address: 200 S. Main Street Julia Email: julia@apple.com Mac Phone: 214-555-5847

Graph Titan, GEMS Great for semantic web Great for graphs Can be hard to visualize Serialization can be difficult Queries more complicated From PDX Graph Meetup

SPARQL (W3C Standard) Uses Resource Description Framework format (triple store) RDF Limitations No named graphs No quantifiers or general statements Every page was created by some author Cats meow Requires a schema (RDFS) to define rules "The object of homepage must be a Document. SELECT?name?email WHERE {?person a foaf:person.?person foaf:name?name.?person foaf:mbox?email. } Queries SPARQL, Cypher Cypher (Neo4J only) No longer proprietary Stores whole graph, not just triples Allows for named graphs and general Property Graphs (edges and nodes may have values) SMATCH (Jack:Person { name: Jack Nicolson })-[:ACTED_IN]-(movie:Movie) RETURN movie

Hadoop & Spark What kind of databases are they?

Frameworks for Data These are both frameworks for distributing and retrieving data. Hadoop is focused on disk based data and a basic map-reduce scheme, and Spark evolves that in several directions that we will get in to. Both can accommodate multiple types of databases and achieve their performance gains by using parallel workers. We have repurposed many of these blocks to build a better framework. SQL DataFrame The mother of Hadoop was necessity. It is trendy to ridicule its primitive design, but it was the a first step.