Intro To Big Data. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Similar documents
A Brief History of Big Data

DATABASE DESIGN II - 1DL400

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Introduction to NoSQL Databases

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Databases and Big Data Today. CS634 Class 22

CISC 7610 Lecture 2b The beginnings of NoSQL

Stages of Data Processing

CIB Session 12th NoSQL Databases Structures

Presented by Sunnie S Chung CIS 612

Introduction to Graph Databases

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Non-Relational Databases. Pelle Jakovits

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

CSE 344 JULY 9 TH NOSQL

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

10 Million Smart Meter Data with Apache HBase

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Database Systems CSE 414

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

High Performance NoSQL with MongoDB

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Relational databases

Introduction to NoSQL by William McKnight

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Study of NoSQL Database Along With Security Comparison

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Challenges for Data Driven Systems

Distributed Databases: SQL vs NoSQL

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

The age of Big Data Big Data for Oracle Database Professionals

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Distributed Non-Relational Databases. Pelle Jakovits

YeSQL: Battling the NoSQL Hype Cycle with Postgres

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

SQL in the Hybrid World

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Graph databases Neo4j syntax and examples Document databases

Understanding NoSQL Database Implementations

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Databases : Lectures 11 and 12: Beyond ACID/Relational databases Timothy G. Griffin Lent Term 2013

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Distributed Systems. 29. Distributed Caching Paul Krzyzanowski. Rutgers University. Fall 2014

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

CompSci 516 Database Systems

Road to a Multi-model Database -- making PostgreSQL the most popular and versatile database

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Polyglot Persistence in Today s Data World

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data Architect.

DATA MINING II - 1DL460

Rule 14 Use Databases Appropriately

Using space-filling curves for multidimensional

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Distributed Data Store

Database Solution in Cloud Computing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Accessing other data fdw, dblink, pglogical, plproxy,...

1

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages

/ Cloud Computing. Recitation 8 October 18, 2016

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Part I What are Databases?

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Motivation Overview of NoSQL space Comparing technologies used Getting hands dirty tutorial section

Facebook, 14 Fast projection index, 84 First database revolution data handling code, 6 DBMS, 6 network and hierarchical model, 6 7

A Journey to DynamoDB

DIVING IN: INSIDE THE DATA CENTER

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Why NoSQL? Why Riak?

Highly Scalable, Ultra-Fast and Lots of Choices

Big Data Hadoop Course Content

Big Data Analytics using Apache Hadoop and Spark with Scala

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

MapReduce and Friends

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Hands-on immersion on Big Data tools

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

OPEN SOURCE DB SYSTEMS TYPES OF DBMS

5/1/17. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Chapter 24 NOSQL Databases and Big Data Storage Systems

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Class Overview. Two Classes of Database Applications. NoSQL Motivation. RDBMS Review: Client-Server. RDBMS Review: Serverless

Transcription:

Intro To Big Data John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Wikipedia

Implications of Big Find a tasty appetizer Easy! Find something to use up these oranges grumble A classic amount of small data What if.

Just need a little more sophistication Find all client names starting with A Get all cases from 2007 - grumble Get all cases from 2007 from Clinton County. grumble, grumble. Faster

Even sophistication has its limits. Find books on Modern Physics (DD# 539) Find books by Wheeler where he isn t the first author grumble Your only hope

Less sophisticated, but sometimes better Get all articles from 2007. Chronologically or geologically organized. Familiar to some of you at tax time. Get all papers on fault tolerance grumble and cough Indexing will determine your individual performance. Teamwork can scale that up.

How much is Big Data? 1 LOC = 10 TB * Library of Congress *Actually, a silly estimate. The original reference actually mentions a more accurate 208TB, and in 2013 the digital collection alone was 3PB.

Size 1000 Genomes Project AWS hosted 260TB Common Crawl Soon to be hosted on Bridges 300-800TB+ A better sense of biggish Throughput Square Kilometer Array Building now Exabyte of raw data/day compressed to 10PB Internet of Things (IoT) / motes Endless streaming Records GDELT (Global Database of Events, Language, and Tone) 250M rows and 59 fields (BigTable) during periods with relatively little content, maximal translation accuracy can be achieved, with accuracy linearly degraded as needed to cope with increases in volume in order to ensure that translation always finishes within the 15 minute window. and prioritizes the highest quality material, accepting that lower-quality material may have a lower-quality translation to stay within the available time window.

If it is all about the queries vs. the data, what options do we have? Let s map out a few provinces: SQL No SQL Key/Value Document Column Graph Analysis / Machine Learning

Good Ol SQL MySQL, Postgres, Oracle, etc. SELECT NAME, NUMBER, FROM PHONEBOOK Why it wasn t fashionable: Payroll Inventory Name Number Address Phonebook Product Number Address Name Number Address Schemas set in stone: Need to define before we can add data Not a fit for agile development Queries often require accessing multiple indexes and joining and sorting multiple tables Sharding isn t trivial Caching is tough ACID (Atomicity,Consistency,Isolation,Durability) in a Transaction is costly.

NoSQL Everything Else? Not only SQL Was effectively NoACID Now maybe NoRelational

Key-Value Redis, Memcached, Amazon DynamoDB, Riak, Ehcache GET foo Certainly agile (no schema) Certainly scalable (linear in most ways: hardware, storage, cost) Good hash might deliver fast lookup Sharding, backup, etc. could be simple foo bar 2 fast 6 0 9 0 0 9 text pic 1055 stuff bar foo Often used for session information: online games, shopping carts GET cart:joe:15~4~7~0723

GET foo Document Cassandra, CouchDB, MongoDB Value must be an object the DB can understand Common are: XML, JSON, Binary JSON and nested thereof This allows server side operations on the data GET plant=daisy Can be quite complex: Linq query, JavaScript function Different DB s have different update/staleness paradigms foo 2 6 JSON 9 XML 0 Binary JSON bar JSON XML OBJECT 12 XML XML <CATALOG> <PLANT> <COMMON>Bloodroot</COMMON> <BOTANICAL>Sanguinaria canadensis</botanical> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599</AVAILABILITY> </PLANT> <PLANT> <COMMON>Columbine</COMMON> <BOTANICAL>Aquilegia canadensis</botanical> <ZONE>3</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$9.37</PRICE> <AVAILABILITY>030699</AVAILABILITY> </PLANT>..

Wide Column Stores Google BigTable, Cassandra, HBase SELECT Name, Occupation FROM People WHERE key IN (199, 200, 207); No predefined schema Can think of this as a 2-D key-value store: the value may be a key-value store itself Different databases aggregate data differently on disk with different optimizations Key Joe Email: joe@gmail Web: www.joe.com Fred Phone: 412-555-3412 Email: fred@yahoo.com Address: 200 S. Main Street Julia Email: julia@apple.com Mac Phone: 214-555-5847

Graph Neo4J, Titan, GEMS Great for semantic web Great for graphs Can be hard to visualize Serialization can be difficult Queries more complicated From PDX Graph Meetup

SPARQL (W3C Standard) Uses Resource Description Framework format (triple store) RDF Limitations No named graphs No quantifiers or general statements Every page was created by some author Cats meow Requires a schema (RDFS) to define rules "The object of homepage must be a Document. SELECT?name?email WHERE {?person a foaf:person.?person foaf:name?name.?person foaf:mbox?email. } Queries SPARQL, Cypher Cypher (Neo4J only) No longer proprietary Stores whole graph, not just triples Allows for named graphs and general Property Graphs (edges and nodes may have values) SMATCH (Jack:Person { name: Jack Nicolson })-[:ACTED_IN]-(movie:Movie) RETURN movie

Hadoop & Spark What kind of databases are they?

Frameworks for data These are both frameworks for distributing and retrieving data. Hadoop is focused on disk based data and a basic map-reduce scheme, and Spark attempts evolves that in several directions that we will get in to. Both can accommodate multiple types of databases and achieve their performance gains by using parallel workers. You are about to learn a lot more, but here are a few concrete examples: Hadoop HBASE: modeled after BigTable and a natural fit HIVE: SQL-like HiveQL converts to the underlying map/reduce (often slow) Now very passe Spark Spark SQL (moved to star billing as DataFrames!) GraphX MLlib: stats, clustering, optimization, regression, etc.