SQL++ Query Language 9/9/2015. Semistructured Databases have Arrived. in the form of JSON. and its use in the FORWARD framework

Size: px

Start display at page:

Download "SQL++ Query Language 9/9/2015. Semistructured Databases have Arrived. in the form of JSON. and its use in the FORWARD framework"

Lora Jenkins
6 years ago
Views:

1 Query Language and its use in the FORWARD framework with the support of National Science Foundation, Google, Informatica, Couchbase Semistructured Databases have Arrived SQL-on-Hadoop NoSQL PIG CQL SQL-on-Hadoop Jaql Others N1QL AQL in the form of JSON location: 'Alpine', readings: time: timestamp(' t20:00:00'), ozone: 0.035, no2: , time: timestamp(' t22:00:00'), ozone: 'm', co: 0.4 1

2 Semistructured Query Languages come along SQL: Expressive Power and Adoption JSON: Semistructured Data but in a Tower of Babel SELECT AVG(temp) AS tavg FROM readings GROUP BY sid SQL db.readings.aggregate( $group: _id: "$sid", tavg: $avg:"$temp") MongoDB readings -> group by sid = $.sid into tavg: avg($.temp) ; Jaql for $r in collection("readings") group by $r.sid return tavg: avg($r.temp) JSONiq a = LOAD 'readings' AS (sid:int, temp:float); b = GROUP a BY sid; c = FOREACH b GENERATE AVG(temp); DUMP c; Pig with limited query abilities Growing out of key-value databases or SQL with special type UDFs Far from the full-fledged power of prior non-relational query languages OQL, Xquery *academic projects being the exception (eg, ASTERIX, FORWARD) 2

3 in lieu of formal semantics Query languages were not meant for Ninjas Outline What is? How we use in UCSD s FORWARD? Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop : Data model + query language for querying both structured data (SQL) and semistructured data (native JSON) Formal Schemaless Semantics: adjusting tuple calculus to the richness of JSON Configurable : Formal configuration options towards choosing semantics, understanding repercussions Full fledged, declarative, aggressively backwardscompatible with SQL Unifying: 11 popular databases/qls captured by appropriate settings of Configurable 3

4 How is it used in the Big Data industry ASTERIXDB CouchDB heading towards How do we use it in UCSD Automatic Incremental Maintenance Middleware Query Optimization & Execution: FORWARD virtual database query plans based on algebra based on a schemaless algebra Data Model (also JSON++ data model) A superset of SQL and JSON Extend SQL with arrays + nesting + heterogeneity location: 'Alpine', readings: time: timestamp(' t20:00:00'), ozone: 0.035, no2: , time: timestamp(' t22:00:00'), ozone: 'm', co: 0.4 Array nested inside a tuple Heterogeneous tuples in collections Arbitrary compositions of array, bag, tuple 4

5 Data Model (also JSON++ data model) A superset of SQL and JSON Extend SQL with arrays + nesting + heterogeneity + permissive structures 5, 10, 3, 21, 2, 15, 6, 4, lo: 3, exp: 4, hi: 7, 2, 13, 6 Arbitrary compositions of array, bag, tuple Data Model (also JSON++ data model) A superset of SQL and JSON Extend SQL with arrays + nesting + heterogeneity + permissive structures Extend JSON with bags and enriched types location: 'Alpine', readings: time: timestamp(' t20:00:00'), ozone: 0.035, no2: , time: timestamp(' t22:00:00'), ozone: 'm', co: 0.4 Bags Enriched types Exercise: Why the data model is a superset of the SQL data model? a SQL table corresponds to a JSON bag that has homogeneous tuples 5

6 Data Model (also JSON++ data model) A superset of SQL and JSON Extend SQL with arrays + nesting + heterogeneity + permissive structures + permissive schema Extend JSON with bags and enriched types location: 'Alpine', readings: time: timestamp(' t20:00:00'), time: timestamp, ozone: 0.035, ozone: any, no2: *, time: timestamp(' t22:00:00'), ozone: 'm', co: 0.4 location: string, readings: With schema Or, schemaless Or, partial schema From SQL to Goal: Queries that input/output any JSON type, yet backwards compatible with SQL Backwards Compatibility with SQL readings : sid: 2, temp: 70.1, sid: 2, temp: 49.2, sid: 1, temp: null SELECT DISTINCT r.sid FROM readings AS r WHERE r.temp < 50 sid : 2 sid temp null Find sensors that recorded a temperature below 50 sid 2 6

7 From SQL to Goal: Queries that input/output any JSON type, yet backwards compatible with SQL Adjust SQL syntax/semantics for heterogeneity, complex input/output values Achieved with few extensions and mostly by SQL limitation removals From Tuple Variables to Element Variables readings : 1.3, 0.7, 0.3, 0.8 SELECT r AS co FROM readings AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2 Find the highest two sensor readings that are below 1.0 co: 0.8, co: 0.7 Element Variables readings : 1.3, 0.7, 0.3, 0.8 FROM readings AS r B out FROM = B in WHERE = r : 1.3, r : 0.7, r : 0.3, r : 0.8 WHERE r < 1.0 B out WHERE = B in ORDERBY = r : 0.7, r : 0.3, r : 0.8 ORDER BY r DESC LIMIT 2 B out ORDERBY = B in LIMIT = r : 0.8, r : 0.7, r : 0.3 B out LIMIT = B in SELECT = r : 0.8 r : 0.7 SELECT r AS co co: 0.8, co: 0.7 7

8 Correlated Subqueries Everywhere sensors : 1.3, 2 0.7, 0.7, , 0.8, , 1.4 SELECT r AS co FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2 Find the highest two sensor readings that are below 1.0 co: 0.9, co: 0.8 Correlated Subqueries Everywhere FROM sensors AS s, s AS r sensors : 1.3, 2 0.7, 0.7, , 0.8, , 1.4 B out FROM = B in WHERE = s : 1.3, 2, r : 1.3, s : 1.3, 2, r : 2, s : 0.7, 0.7, 0.9, r : 0.7, s : 0.7, 0.7, 0.9, r : 0.7, s : 0.7, 0.7, 0.9, r : 0.9, s : 0.3, 0.8, 1.1, r : 0.3, s : 0.3, 0.8, 1.1, r : 0.8, s : 0.3, 0.8, 1.1, r : 1.1, s : 0.7, 1.4, r : 0.7, s : 0.7, 1.4, r : 1.4 WHERE r < 1.0 ORDER BY r DESC LIMIT 2 SELECT r AS co Correlated Subqueries Everywhere sensors : 1.3, 2 0.7, 0.7, , 0.8, , 1.4 FROM sensors AS s, s AS r WHERE r < 1.0 B out WHERE = B in ORDERBY = s : 0.7, 0.7, 0.9, r : 0.7, s : 0.7, 0.7, 0.9, r : 0.7, s : 0.7, 0.7, 0.9, r : 0.9, s : 0.3, 0.8, 1.1, r : 0.3, s : 0.3, 0.8, 1.1, r : 0.8, s : 0.7, 1.4, r : 0.7 ORDER BY r DESC LIMIT 2 SELECT r AS co 8

9 Correlated Subqueries Everywhere sensors : 1.3, 2 0.7, 0.7, , 0.8, , 1.4 FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2 B out ORDERBY = B in LIMIT = s : 0.7, 0.7, 0.9, r : 0.9, s : 0.3, 0.8, 1.1, r : 0.8, s : 0.7, 0.7, 0.9, r : 0.7, s : 0.7, 0.7, 0.9, r : 0.7, s : 0.7, 1.4, r : 0.7, s : 0.3, 0.8, 1.1, r : 0.3 B out LIMIT = B in SELECT = s : 0.7, 0.7, 0.9, r : 0.9, s : 0.3, 0.8, 1.1, r : 0.8 SELECT r AS co co: 0.9, co: 0.8 Utilization of Input Order y x sensors : 1.3, 2 0.7, 0.7, , 0.8, , 1.4 SELECT r AS co, rp AS x, sp AS y FROM sensors AS s AT sp, s AS r AT rp WHERE r < 1.0 ORDER BY r DESC LIMIT 2 Find the highest two sensor readings that are below 1.0 co: 0.9, x: 3, y: 2, co: 0.8, x: 2, y: 3 Constructing Nested Results Subqueries Again Γ 0 = sensors : sensor: 1, sensor: 2, logs: sensor: 1, co: 0.4, sensor: 1, co: 0.2, sensor: 2, co: 0.3, FROM sensors AS s B out FROM = B in SELECT = s : sensor: 1, s : sensor: 2 SELECT s.sensor AS sensor, ( SELECT l.co AS co FROM logs AS l WHERE l.sensor = s.sensor ) AS readings sensor : 1, readings: co:0.4, co:0.2, sensor : 2, readings: co:0.3 9

10 Constructing Nested Results Γ 1 = s : sensor : 1, sensors : sensor: 1, sensor: 2, logs: sensor: 1, co: 0.4, sensor: 1, co: 0.2, sensor: 2, co: 0.3, FROM logs AS l B out FROM = B in SELECT = l : sensor: 1, co: 0.4, l : sensor: 1, co: 0.2, l : sensor: 2, co: 0.3 WHERE l.sensor = s.sensor B out FROM = B in SELECT = l : sensor: 1, co: 0.4, l : sensor: 1, co: 0.2 SELECT l.co AS co co:0.4, co:0.2 Iterating over Attribute-Value pairs INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN Γ = readings : co :, no2: 0.7, so2: 0.5, 2 FROM readings AS g:v, v AS n g : "no2", v : 0.7, n : 0.7, g : "so2", v : 0.5, 2, n : 0.5, g : "co", v : 0.5, 2, n : 2 SELECT ELEMENT gas: g, val: v gas: "co", num: null, gas: "no2", num: 0.7, gas: "so2", num: 0.5,, gas: "so2", num: 2 10

11 INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN Γ = readings : co :, no2: 0.7, so2: 0.5, 2 FROM readings AS g:v INNER CORRELATE v AS n g : "no2", v : 0.7, n : 0.7, g : "so2", v : 0.5, 2, n : 0.5, g : "co", v : 0.5, 2, n : 2 SELECT ELEMENT gas: g, val: v gas: "co", num: null, gas: "no2", num: 0.7, gas: "so2", num: 0.5,, gas: "so2", num: 2 INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN Γ = readings : co :, no2: 0.7, so2: 0.5, 2 FROM readings AS g:v OUTER CORRELATE v AS n g : "co", v :, n : missing, g : "no2", v : 0.7, n : 0.7, g : "so2", v : 0.5, 2, n : 0.5, g : "co", v : 0.5, 2, n : 2 SELECT ELEMENT gas: g, val: v gas: "co", num: null, gas: "no2", num: 0.7, gas: "so2", num: 0.5,, gas: "so2", num: 2 The Extensions to SQL Goal: Queries that input/output any JSON type, yet backwards compatible with SQL Adjust SQL syntax/semantics for heterogeneity, complex values Achieved with few extensions and SQL limitation removals Element Variables Correlated Subqueries (in FROM and SELECT clauses) Optional utilization of Input Order Grouping indendent of Aggregation Functions... and a few more details Configurability: making a unifying and configurable language 11

12 Algebras do not need schemaful inputs Γ 1 = s : sensor : 1, sensors : sensor: 1, sensor: sensor 2 co, logs: sensor: 1, co: 2 0.4, 0.3 sensor: 1, co: 0.2, sensor: 2, co: 0.3, FROM logs AS l B out FROM = B in SELECT = l : sensor: 1, co: 0.4, l : sensor: 1, co: 0.2, l : sensor: 2, co: 0.3 WHERE l.sensor = s.sensor B out FROM = B in SELECT = l : sensor: 1, co: 0.4, l : sensor: 1, co: 0.2 SELECT l.co AS co co:0.4, co:0.2 Outline What is? How we use in UCSD s FORWARD? uses in integration Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop based Virtual Database Client Federated Queries Results FORWARD Virtual Database Query Processor Virtual s of Sources Virtual SQL Wrapper Virtual NewSQL Wrapper Native queries Virtual NoSQL Wrapper Native results Virtual SQL-on- Hadoop Wrapper Virtual Java Objects Wrapper SQL Database NewSQL Database NoSQL Database SQL-on- Hadoop Database Java In-Memory Objects 12

13 Virtual and/or Materialized Client s Queries Results FORWARD Integration Database Definitions Integrated Query Processor Integrated s Integrated Added Value Virtual Virtual Virtual Virtual Virtual SQL Database NewSQL Database NoSQL Database SQL-on- Hadoop Database Java In-Memory Objects Use Case 1: Hide the ++ source features behind SQL views BI Tool SELECT / Report Writer m.sid, t AS temp FROM measurements AS m, SQL query SQL result m.temperatures AS t FORWARD Virtual Database measurements: SQL sid temp or its equivalent measurements: sid: 2, temp: 70.1, sid: 2, temp: 70.2, sid: 1, temp: 71.0 Couchbase query Couchbase results measurements: sid: 2, temperatures: 70.1, 70.2, sid: 1, temperatures: 71.0 Couchbase Use Case 2: Semantics-Aware Pushdown "Sensors that recorded a temperature below 50" SELECT DISTINCT m.sid FROM measurements AS m WHERE m.temp < 50 semantics for m.temp < $match: temp: $lt: 50, $match: temp: $not: null... MongoDB semantics for temp $lt 50 query MongoDB query result MongoDB Virtual Database Virtual (Identical) MongoDB Wrapper MongoDB results MongoDB measurements: sid: 2, temp: 70.1, sid: 2, temp: 49.2, sid: 1, temp: null 13

14 Push-Down Challenges: Limited capabilities & semantic variations How to simulate incompatible semantics/missing features? Sources like SQL and MongoDB cannot execute the entirety of How to efficiently push down computation? Issues automatically handled by the query processor. Plenty of query rewriting problems, including novel ones on semi-structured operators captures Semantics Variations Semantics of "less-than" comparison are different across sources: < sql, < mongodb, etc. Config Parameters to capture these MongoDB ( x < y complex : deep, type_mismatch: false, null_lt_null : false, null_lt_value: false,... ( x < y ) Semantics Variations In NoSQL, NewSQL, SQL-on-Hadoop variation of semantics for: Paths Equality Comparisons And all the operators that use them: Selections Grouping Ordering Set operations Each of these features has a set of config parameters in 14

15 Don t give a standard Teach how to reach a standard Fast evolving space Standard would be premature Configuration options itemize and explain the space options Rationalize discussion Advise on consistent settings of the configuration options Not every configuration option setting is sensible Configuration options have to subscribe to common sense iff x = y then x = y if x = y and y = z then x = z which enables rewritings Have to be consistent with each other Eg, if true AND missing is missing, then false OR missing is missing FORWARD: Incremental Maintenance and Application Visualization layer The Incremental Maintenance functionality: (Materialized) Definition Eg, Couchbase has a JSON web log showing user, list of displayed products and FORWARD produces materialized view product category, count, products: product, count Stream of inserts, deletes, updates on base data Before DB Before IVM Stream Stream The Incremental Maintenance module Automatically and efficiently updates the materialized view to reflect the stream of changes can also enable automatic Incremental Maintenance! With attention to replication of data in views Opportunities by keys After DB After 15

16 FORWARD: Incremental Maintenance and Application Visualization layer Custom dashboards, interactive pages & apps The data models of visualization components (e.g. Google Maps) can be nicely captured with JSON models The pages are (JSON) views! Mashups of the components views feeds and incrementally updates the page views Use case From data to visualization with just & markup Ajax/Javascript visuals with no Ajax/Javascript mess How to easily connect to today s JS libraries Custom Ajax visualizations & interfaces for IT personnel FORWARD: Incremental Maintenance and Application Visualization layer (part of) the Google Map model <% unit google.maps.maps %> markers: position: latitude : number, longitude: number... <% end unit %> Outline What is? How we use in UCSD s FORWARD? uses in integration and (live) analytics applications Introduction to the formal survey of NoSQL, NewSQL and SQLon-Hadoop 16

17 Surveyed Databases SQL-on-Hadoop NoSQL Hive Pig Jaql SQL AQL MongoJDBC CQL MongoDB JSONiq N1QL BigQuery (IBM) (AsterixDB) (Cassandra) (Couchbase) (Google) Algebraic FLWOR SQL-like Created JSON-like Based Previously standard on syntax at XQuery "Dremel" Facebook Hadoop Also Research MongoDB Most 28msec No SQL-like production comprises used clusters cluster(s) syntax database via NoSQL JDBC release NewSQL database (UCI) driver PIG CQL SQL-on-Hadoop Jaql Others N1QL SQL & NewSQL AQL MongoDB driver Removes Superficial Differences covers SQL, N1QL and QL research prototypes (e.g., UCI s ASTERIX) Removing the current Tower of Babel effect Providing formal syntax and semantics SELECT AVG(temp) AS tavg FROM readings GROUP BY sid SQL readings -> group by sid = $.sid into tavg: avg($.temp) ; Jaql for $r in collection("readings") group by $r.sid return tavg: avg($r.temp) JSONiq db.readings.aggregate( $group: _id: "$sid", tavg: $avg:"$temp") MongoDB a = LOAD 'readings' AS (sid:int, temp:float); b = GROUP a BY sid; c = FOREACH b GENERATE AVG(temp); DUMP c; Pig Surveyed features 15 feature matrices (1-11 dimensions each) classifying: Data values Schemas Access and construct nested data Missing information Equality semantics Ordering semantics Aggregation Joins Set operators Extensibility 17

18 Outline What is? How we use in UCSD s FORWARD? Introduction to the formal survey of NoSQL, NewSQL and SQLon-Hadoop Methodology Example 1: data model (data values) Example 2: query language (SELECT clause) Example 3: semantics (path) Example 4: semantics (equality function) Methodology For each feature: 1. A formal definition of the feature in 2. A example 3. A feature matrix that classifies each dimension of the feature 4. A discussion of the results, partial support and unexpected behaviors All the results are empirically validated Example: Data values 1. example: location: 'Alpine', readings: time: timestamp(' t20:00:00'), ozone: 0.035, no2: , time: timestamp(' t22:00:00'), ozone: 'm', co:

19 Example: Data values 2. BNF for values: Example: Data values 3. Feature matrix: Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives Bag of tuples No Yes No No Partial Yes Yes Hive Jaql Any Value Yes Yes No No No Yes Yes Pig Bag of tuples Partial No Partial No Partial Yes Yes CQL Bag of tuples No Partial No Partial Partial No Yes JSONiq Any Value Yes Yes No No No Yes Yes MongoDB Bag of tuples Yes Yes No No No Yes Yes N1QL Bag of tuples Yes Yes No No No Yes Yes SQL Bag of tuples No No No No No No Yes AQL Any Value Yes Yes Yes No No Yes Yes BigQuery Bag of tuples No No No No No Yes Yes MongoJDBC Bag of tuples Yes Yes No No No Yes Yes Any Value Yes Yes Yes Partial Yes Yes Yes Example: Data values 4. Discussion of the results: Column-by-column comparison Partial support (65k scalar elements) Identify clusters (who supports JSON?) Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives Bag of tuples No Yes No No Partial Yes Yes Hive Jaql Any Value Yes Yes No No No Yes Yes Pig Bag of tuples Partial No Partial No Partial Yes Yes CQL Bag of tuples No Partial No Partial Partial No Yes JSONiq Any Value Yes Yes No No No Yes Yes MongoDB Bag of tuples Yes Yes No No No Yes Yes N1QL Bag of tuples Yes Yes No No No Yes Yes SQL Bag of tuples No No No No No No Yes AQL Any Value Yes Yes Yes No No Yes Yes BigQuery Bag of tuples No No No No No Yes Yes MongoJDBC Bag of tuples Yes Yes No No No Yes Yes Any Value Yes Yes Yes Partial Yes Yes Yes 19

20 Example: SELECT clause 1. example: Projecting nested collections: SELECT TUPLE s.lat, s.long, (SELECT r.ozone FROM readings AS r WHERE r.location = s.location) FROM sensors AS s "Position and (nested) ozone readings of each sensor?" Projecting non-tuples: SELECT ELEMENT ozone FROM readings "Bag of all the (scalar) ozone readings?" Example: SELECT clause 3. Feature matrix: Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No Yes Yes Example: SELECT clause 4. Discussion of the results: Not well supported features 3 languages support them entirely (same cluster as for data values) Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No Yes Yes 20

21 Config Parameters We use config parameters to encompass and compare various semantics of a feature: Minimal number of independent dimensions 1 dimension = 1 config parameter formalism parametrized by the config parameters Feature matrix classifies the values of each config parameter Example: Paths Config parameters for tuple absent : missing, type_mismatch: error, ( x.y ) Example: Paths The feature matrix classifies, for each language, the value of each config parameter: Missing Type mismatch Hive Error Error Jaql Null Error# Pig Error Error CQL Error Error JSONiq Missing# Missing# MongoDB Missing Missing N1QL Missing Missing SQL Error Error AQL Null Error BigQuery Error Error MongoJDBC 21

22 Example: Equality Function Config parameters for complex : error, type_mismatch : false, null_eq_null : null, null_eq_value : null, missing_eq_missing: missing, missing_eq_value : missing, missing_eq_null : missing ( x = y ) Example: Equality Function The feature matrix classifies, for each language, the value of each config parameter: No real cluster Some languages have multiple (incompatible) equality functions Some edge cases cannot happen due to other limitations (SQL has no complex values) Complex Type mismatch Null = Null Null = Value Missing = Missing Missing = Value Missing = Null Hive (=, <=>) Err, Err Err, Err Null, True Null, False N/A N/A N/A Jaql Boolean Null Null Null N/A N/A N/A Pig Boolean partial Err Null Null N/A N/A N/A CQL Err Err Err False/Null N/A N/A N/A JSONiq Boolean Err, False True, True False, False Missing, True Missing, False Missing, False (=, deep-equal()) Err, MongoDB Boolean False True False True False False N1QL Boolean False Null Null Missing Missing Missing SQL N/A Err Null Null N/A N/A N/A AQL Err Err Null Null N/A N/A N/A BigQuery Err Err Null Null N/A N/A N/A MongoJDBC Boolean False True False N/A @equal How to use this survey? As a database user: Understand the semantics of a (often underspecified) query language / be aware of the limitation of a database As a designer/architect of a database Produce formal specification of your query language Align semantics with SQL's As a database researcher The results might change, but the survey methodology stays As a designer/architect of database middleware Understand what capability variations need to be encapsulated and simulated 22

23 The survey shows: The marketing clusters do not correspond to real capabilities Limited capabilities: matrices are sparse and fragmented (more pressure on source-specific rewriters and distributor) The Future is Semi-Structured and Declarative Scalability Flexibility of Semistructured Data Power of Declarative: Automatic Optimization, Maintenance The primary operational and the secondary analytics application out of semistructured, declarative platforms 23

The SQL++ Unifying Semi-structured Query Language, and an Expressiveness Benchmark of SQL-on-Hadoop, NoSQL and NewSQL Databases

Noname manuscript No. (will be inserted by the editor) The SQL++ Unifying Semi-structured Query Language, and an Expressiveness Benchmark of SQL-on-Hadoop, NoSQL and NewSQL Databases Kian Win Ong Yannis