Extending XQuery with Window Functions

Similar documents
Extending XQuery with Window Functions

Overview: XQuery 3.0

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Stream Schema: Providing and Exploiting Static Metadata for Data Stream Processing

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

MICS Cluster 4 In-network information management

AMAχOS Abstract Machine for Xcerpt

Partition and Compose: Parallel Complex Event Processing. Martin Hirzel, IBM Research Tuesday, 17 July 2012 DEBS

Transactional Stream Processing

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Building a Scalable Architecture for Web Apps - Part I (Lessons Directi)

XQuery. Leonidas Fegaras University of Texas at Arlington. Web Databases and XML L7: XQuery 1

Compilers and computer architecture From strings to ASTs (2): context free grammars

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

Pre-Discussion. XQuery: An XML Query Language. Outline. 1. The story, in brief is. Other query languages. XML vs. Relational Data

Azure-persistence MARTIN MUDRA

Overview. Cayuga. Autometa. Stream Processing. Stream Definition Operators. Discussion. Applications. Stock Markets. Internet of Things

Advanced Database Technologies XQuery

Introduction to Parsing. Lecture 5

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Recap: Functions as first-class values

XML Query Languages. Yanlei Diao UMass Amherst April 22, Slide content courtesy of Ramakrishnan & Gehrke, Donald Kossmann, and Gerome Miklau

Query Processing and Optimization using Compiler Tools

Crescando: Predictable Performance for Unpredictable Workloads

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

Module 4. Implementation of XQuery. Part 2: Data Storage

SQL++ Query Language 9/9/2015. Semistructured Databases have Arrived. in the form of JSON. and its use in the FORWARD framework

XPath. by Klaus Lüthje Lauri Pitkänen

Comprehensive Guide to Evaluating Event Stream Processing Engines

Motivation and basic concepts Storage Principle Query Principle Index Principle Implementation and Results Conclusion

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

DSMS Benchmarking. Morten Lindeberg University of Oslo

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

XQuery Reloaded Contact:

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

CSE Database Management Systems. York University. Parke Godfrey. Winter CSE-4411M Database Management Systems Godfrey p.

Introduction to Parsing. Lecture 5

Compiler construction

Unifying Big Data Workloads in Apache Spark

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

XML and Semantic Web Technologies. II. XML / 6. XML Query Language (XQuery)

Webinar Series TMIP VISION

Compilers Project Proposals

Microsoft Azure Stream Analytics

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.

Table of Contents Chapter 1 - Introduction Chapter 2 - Designing XML Data and Applications Chapter 3 - Designing and Managing XML Storage Objects

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

Database Languages and their Compilers

FOUNDATIONS OF DATABASES AND QUERY LANGUAGES

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

COPYRIGHTED MATERIAL. Contents. Part I: Introduction 1. Chapter 1: What Is XML? 3. Chapter 2: Well-Formed XML 23. Acknowledgments

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Towards Expressive Publish/Subscribe Systems

Context-Free Grammar (CFG)

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Relational Databases

R13 SET Discuss how producer-consumer problem and Dining philosopher s problem are solved using concurrency in ADA.

A Distributed Query Engine for XML-QL

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Column Stores vs. Row Stores How Different Are They Really?

Introduction to Semistructured Data and XML. Management of XML and Semistructured Data. Path Expressions

Java Basics. Object Orientated Programming in Java. Benjamin Kenwright

Simplified and fast Fraud Detection with just SQL. developer.oracle.com/c ode

DRAFT A Survey of Event Processing Languages (EPLs)

Yandex.Classifieds. Vadim Tsesko

Compiler construction

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

What s New in MySQL 5.7 Geir Høydalsvik, Sr. Director, MySQL Engineering. Copyright 2015, Oracle and/or its affiliates. All rights reserved.

745: Advanced Database Systems

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Relational Model and Algebra. Introduction to Databases CompSci 316 Fall 2018

20762B: DEVELOPING SQL DATABASES

Advanced Programming Techniques. Database Systems. Christopher Moretti

Database Driven Web 2.0 for the Enterprise

Next Generation Query and Transformation Standards. Priscilla Walmsley Managing Director, Datypic

Data Stream Management and Complex Event Processing in Esper. INF5100, Autumn 2010 Jarle Søberg

XML Systems & Benchmarks

Jennifer Widom. Stanford University

Semantic Analysis. Lecture 9. February 7, 2018

Microsoft. [MS20762]: Developing SQL Databases

R-Trees. Accessing Spatial Data

INF5110 Compiler Construction

Type Inference Systems. Type Judgments. Deriving a Type Judgment. Deriving a Judgment. Hypothetical Type Judgments CS412/CS413

XQuery Optimization Based on Rewriting

Research Collection. XQuery (scripting) debugging IDE and engine support. Master Thesis. ETH Library. Author(s): Petrovay, Gabriel

Indexing Bi-temporal Windows. Chang Ge 1, Martin Kaufmann 2, Lukasz Golab 1, Peter M. Fischer 3, Anil K. Goel 4

Wrap-Up: Data Sharing and the Web

Column-Stores vs. Row-Stores: How Different Are They Really?

Processing Data Streams: An (Incomplete) Tutorial

Developing SQL Databases

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Compiler Design Spring 2018

XML Projection and Streaming Compared and Contrasted. Michael Kay, Saxonica XML Prague 2017

Transcription:

Extending XQuery with Window Functions Irina Botan, Peter M. Fischer, Dana Florescu*, Donald Kossmann, Tim Kraska, Rokas Tamosevicius ETH Zurich, Oracle*

Elevator Pitch Version of this Talk XQuery can do stream processing now, too! It is easy Single new clause for window bindings Simple extension of data model It is fast Linear Road compliance L=2.0 2

Motivation XML is the data format for communication data (RSS, Atom, Web Services) meta data, logs (XMI, schemas, config files,...) documents (Office, XHTML, ) XQuery is the way to process XML data even if it is not perfect, it is has many nice abilities works well for non-xml: CSV, binary XML,... XQuery Data Model is a good match to streams sequences of items XQuery has HUGE potential, BUT... poor current support for streams/continous queries 3

Example: RSS Feed Filtering Blog postings <item>... <author>john</author>... </item><item>... <author>tom</author>... </item><item>... <author>tom</author>... </item><item>... <author>tom</author>... </item><item>... <author>peter</author>... </item> Not very elegant Return annoying authors: 3 consecutive postings for $first at $i in $blog let $second := $blog[i+1], let $third := $blog[i+2] where $first/author eq $second/author and $first/author eq $third/author return $first/author three-way self-join: bad performance + hard to maintain Very annoying authors : n postings = n-way join 4

Overcoming the Limitations of XQuery 1.0 No (good) way to define a window need to implement windows with self-joins No way to work on infinite sequences infinite sequences are not in XQuery DM no way to run continuous queries => Goal of this work: Extend XQuery new clause to express windows allow infinite sequences in XDM implement extensions in XQuery engine optimizations 5

Overview Motivation Windows for XQuery Continuous XQuery Implementation and Optimization Linear Road Benchmark Summary + Future Work 6

New Window Clause: FORSEQ Extends FLWOR expression of XQuery Generalizes LET and FOR clauses LET $x := $seq - Binds $x once to the whole $seq FOR $x in $seq... - Binds $x iteratively to each item of $seq FORSEQ $x in $seq - Binds $x iteratively to sub-sequences of $seq - Several variants for different types of sub-sequences FOR, LET, FORSEQ can be nested FLOWRExpr ::= (Forseq For Let)+ Where? OrderBy? RETURN Expr 7

Four Variants of FORSEQ WINDOW = contiguous sub-seq. of items 1. TUMBLING WINDOW An item is in zero or one windows (no overlap) 2. SLIDING WINDOW An item is at most the start of a single window (but different windows may overlap) 3. LANDMARK WINDOW Any window (contiguous sub-seq) allowed # windows quadratic with size of input 4. General FORSEQ Any sub-seq allowed # sequences exponential with size of input! Not a window! Cost, Expressiveness 8

RSS Example Revisited - Syntax Annoying authors (3 consecutive postings) in RSS stream: forseq $window in $blog tumbling window start curitem $first when fn:true() end nextitem $lookahead when $first/author ne $lookahead/author where count($window) ge 3 return $first/author START, END specify window boundaries WHEN clauses can take any XQuery expression curitem, nextitem, clauses bind variables for whole FLOWR Complete grammar in paper! 9

RSS Example Revisited - Semantics <item><author>john</author></item> <item><author>tom</author></item> <item><author>tom</author></item> <item><author>tom</author></item> <item><author>peter</author></item> Go through sequence item by item If window is not open, bind variables in start, check start If window open, bind end variables, check end If end true, close window, + window variables Conditions relaxed for sliding, landmark forseq $window in $blog tumbling window start curitem $first when fn:true() end nextitem $lookahead when $first/author ne $lookahead/author where count($window) ge 3 return $first/author Simplified version; refinements for efficiency + corner cases => Predicate-based windows, full generality Open window Closed +bound window 10

Application Areas Overall about 60 use cases specified and implemented Domains ranging over RSS Financial Social networks/sequence operations Stream Toolbox Document formatting/positional grouping => Many use cases go beyond the abilities of relational streaming proposals 11

Overview Motivation Windows for XQuery Continuous XQuery Implementation and Optimization Linear Road Benchmark Summary + Future Work 12

Continuous XQuery Streams are (possibly) infinite e.g., a stream of sensor data, stock ticker,... not allowed in XQuery 1.0: infinite sequences are not part of XDM => Proposed extension allow infinite sequences, new occurrence indicator: ** much less disruptive than SQL stream extensions Example: inform me when temperature > 0 C declare variable $stream as (int)**; for $temp in $stream where $temp ge 0 return <alarm/> 13

XQuery Semantics on Infinite Sequences Blocking expressions (e.g., ORDER BY) not allowed, raise error Non-blocking expressions infinite input -> infinite output (e.g., If-then-else) infinite input -> finite output (e.g., [5]) Some expressions undecidable at compile time (e.g., Quantified expression) We developed derivation rules for all expressions, similar to formalism of updating expressions Short version in the paper, extended version in a tech report (go to mxquery.org) 14

Overview Motivation Windows for XQuery Continuous XQuery Implementation and Optimization Linear Road Benchmark Summary + Future Work 15

Implementation Overview FORSEQ clause parser: add new clause compiler: some clever optimizations runtime system: new iterators + indexing Continuous XQuery parser: add ** occurrence indicator context: annotate functions & operators compiler: data flow analysis (infinite input) optimizations at store, scheduler level possible! Easy to integrate extended existing Java-based, open source engine 16

Optimization: Cheaper Window Remember: cost(tumbling) << cost(sliding) << cost(landmark) forseq $w in $seq sliding tumbling window start curitem $s when $s eq a end curitem $e when $e eq c... Assume (stream) schema knowledge: a, b, c, a, b, c,... Only one open window possible at a time Rewrite to tumbling 17

Additional Optimizations I Predicate Movearound move predicates from where to start/end reduce number of open/bound windows need schema knowledge Indexing Windows needed to handle large number of predicates and/or complex value predicates speed up evaluation of END condition index windows just like any other collection keys on start variable values 18

Additional Optimizations II Improved Pipelining start evaluating WHERE and RETURN clauses even though last item has not been read Hopeless Windows detect windows that can never be closed i.e., END condition is not satisfiable Aggressive Garbage Collection Materialize only items needed for WHERE and RETURN clauses 19

Overview Motivation Windows for XQuery Continuous XQuery Implementation and Optimization Linear Road Benchmark Summary + Future Work 20

Linear Road Benchmark The only established streaming benchmark Models dynamic road pricing scenario toll information, accidents, accounts, as streams historic queries on large database Complex workload streams: window-based aggregation, correlation,... involves response time guarantees (< 5 sec) load factor (L) determines number of inputs/second Compliant Implementations with Results Aurora (L=2.5) IBM Stream Processing Core (L=2.5 on single machine) RDBMS (L=0.5): reference implementation by Aurora 21

Linear Road on MXQuery First attempt: One big XQuery expression bad performance we are not there, yet! Second attempt: 8 XQuery expressions explicitly specify where to materialize Hardware (comparable to Aurora, IBM): Linux box: 1 AMD Opteron 248 processor, 2.2 GHz 2 GB main memory Sun JVM Version 1.5.0.09 Software: MXQuery engine (Java, open source) stream data in main memory historic data in MySQL database no transactions, recoverability, security, etc. 22

Results L up to 2.0 fully compliant we improved a bit since the paper was accepted at L = 2.5: maximum response time 116 sec (5 sec allowed) Aurora, IBM compliant up to L= 2.5 Why are we slower? general-purpose XQuery engine vs. hand-written + hand-tuned query plan No additional DSMS infrastructure (scheduler, ) Java vs. C++ engine XQuery is not the problem!!! 23

Related Work Saxon s item grouping (XIME-P 06) Designed for text document handling (XSLT UC) No nested FORSEQ (multiple streams) No sliding windows (only non-overlapping windows) SQL Extensions (on-going IBM/Oracle work) Semantics of Windows and Continuous Queries STREAM, several surveys in DB + other communities Some recent research projects (SASE, Cayuga) Invent language for smaller problem + fancy algorithm By far not powerful enough 24

Summary and Future Work Extending XQuery for streams/cq is important as important as the corresponding SQL extensions XQuery for programming streams is easy need only small extensions (Windows, DM) XQuery on streams is efficient no performance penalty for XQuery XQuery as efficient as SQL! Future Work rigorous study of optimization techniques stream schema more use cases and examples 25

Thank you for your interest Questions? Contact: peter.fischer@inf.ethz.ch Please visit http://mxquery.org for - the MXQuery engine with FORSEQ - the complete use case document - tech report of infinite XDM semantics 26

Backup Slides 27

Time in XDM What about built-in time (system-time)? XQuery DM is not temporal Predicate-based approach allows time-based windows on application-defined timestamps If needed, also possible to provide user-/systemdefined function which generates timestamps for each incoming item 28

Sliding Window: Moving Average forseq $window in $seq sliding window start position $s when fn:true() end position $e when $e - $s eq 2 return fn:avg($w/rating) Input: (infinite) sequence of posting with rating Output: (infinite) sequence of doubles Average rating of the last three postings Alternative way to express Count-Based Window using position 29

Time-Based Window: Web Log Analysis declare variable $weblog as (entry)** forseq $w in $weblog tumbling window start curitem $s when fn:true() end nextitem $e when $e/tstamp - $s/tstamp gt PT1H return fn:count($w[op eq "login"]) Logins per hour Express time constraints by (normal) predicates on time elements/attributes As efficient as specific data model-based time If system time needed, inject it via function 30

FORSEQ Post-Relational: Positional Grouping <q/> <bullet>one</bullet> <bullet>two</bullet> <x/> => <q/> <list> <bullet>one</bullet> <bullet>two</bullet> </list> <x/> declare variable $seq; forseq $w in $seq tumbling window start curitem $x when true() end nextitem $y when node-name($x) ne node-name($y) return if (string(node-name($x)) eq "bullet") then <list> {$w} </list> else $w 31

General FORSEQ Compute all 2 n -1 possible sub-sequences Input: <a, b, c> Bindings: { <a>, <a,b>, <b>, <a,b,c>, <a,c>,... } Example: match regular expressions Inform me if sensors A (B C) D have reported forseq $w in $sensors where $w[1]/id eq A and ($w[2]/id eq B OR $w[2]/id eq C ) and $w[3]/id eq D return... (N.B.: XQuery Library has powerful reg. expr.) 32

Why not use SQL? SQL does not work on XML data even CSV might be a stretch SQL is too broken extensions do not compose well (fresh start needed) data model: relations to streams mapping is a mess -> careful (limited) SQL extensions for streams SQL works well for niche (maybe Wall Street), but not for masses (Web logs, text/media,...) But, SQL extensions good foundation (thanks!) use cases, window types, techniques very useful 33

Continuous Queries Part Linear Road Benchmark Implementation (data flow) Accident Segments Accidents prev minute I N P U T Car positions Segment Statistics for every minute Historical Queries Part Car Position Car positions to Respond Toll Calculation Segment Tolls Accident Events Toll Events prev minute Historical Tolls Tolls for Balance Balance Balance Query Daily Expenditure Query Result Output Result Output Result Output Result Output Storage XQuery Result to the benchmark validator Input Stream 34

Compatibility with XQuery + Extensions Proposals Groupby: orthogonal Forseq partitions input into contiguous sequences using position Groupby partitions input according to groups having certain value Forseq creates bindings, GroupBy uses bindings Maybe common proposal, but unusual behavior for both interest groups (SQL-GroupBy, Streaming) But: We are open for alternative proposals 35

Partioning Windows? Aka parallel tumbling / splitting / predicate windows Split stream into several streams using a predicate Proposed in relational streaming system Not orthogonal to GROUP BY Wait for Group By, see how or if FORSEQ + Group By can be combined to achieve same effect 36

MXQuery vs. RDBMS, (Stream) SQL vs. XQuery? Speed difference: MXQuery Storage optimized for streaming data, not static data Optimizations for streams available XQuery vs SQL: Conceptually, neither is better in the areas which both support Currently, implementations make the difference 37

FOR and LET with FORSEQ FORSEQ generalizes FOR and LET for $x in $seq... forseq $x in $seq tumbling window start when fn:true() end when fn:true()... let $x := $seq... forseq $x in $seq tumbling window start when fn:true() end when fn:false()... (Nevertheless: FOR and LET still needed.) 38