Estimating Quantiles from the Union of Historical and Streaming Data
|
|
- Brendan Watts
- 5 years ago
- Views:
Transcription
1 Estimating Quantiles from the Union of Historical and Streaming Data Sneha Aman Singh, Iowa State University Divesh Srivastava, AT&T Labs - Research Srikanta Tirthapura, Iowa State University
2 Quantiles Quantile A statistical measure used to characterize distribution of data sets Indicates a point in a data distribution at or below which a given fraction of the entire data set falls 50% data < median Normal distribution Fundamental database query. An operator in many big data tools, such as R and Apache Spark 2
3 φ Quantile of a Set Rank of an element x in a set S is the number of elements with value less than or equal to x S: N = For instance, rank of elements in above set S S (sorted): Ranks φ-quantile is an element of rank φn, for 0<φ<1 Find 0.5-Quantile: Rank of 0.5 quantile = φn = 0.5*11 = 5 0-quantile is the minimum and 1-quantile is the maximum of S. 3
4 Application Web Latency Web Latency delay experienced by a user between sending a request to the web server and getting back a response. Latency Experienced by All the Users Indicator of user experience 0.95 Quantiles 0.95 quantile > Threshold latency -- Track poor network performance 4
5 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 5
6 Historical Data Historical Data Data collected and stored for later analysis. Web user requests collected at the server end: to analyze user trends / behavior and web site performance User Requests Web Server User Requests Stored for later analysis Batch Processing System Storage 6
7 Streaming Data Data Stream: Continuous sequence of data observed in the form of events, messages, tuples, in a single pass or limited number of pass. User Requests User Requests Sent for storage Web Server Data observed on the fly Stream Processing System Memory 7
8 Stream Processing System Memory (stores stream state) Update state Retrieve state e Retrieve information 43 e 23 e 12 from e 1 2 e 1 e 2 Data stream Stream Processing Engine Query ex: 0.5 quantile (Approximate) Answer to query Continuous update of the state of the data, with every incoming element 8
9 Integrated Processing of Historical and Streaming Data Long term data analysis with higher accuracy in real time Archived Historical data Live Data stream Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D1 D2 D3 D4 D5 Correlation of historical data with live streaming data Predictive analysis 9
10 Historical vs Stream Processing System Batch Processing System Stream Processing System Data stored and can be traversed multiple times Single or limited pass over data Accurate results of user query J Approximate results of user query L Latency due to disk I/O L Fast processing time J Can we have our cake and eat it too? 10
11 Our Focus: (Approximate) Quantile N : size of data stream Computation ε-approximate quantiles Find an element of rank ρ from a dataset of size N, such that, (φ-ε)n ρ (φ+ε)n, 0 < ε << φ < 1 ε : approximation parameter rank of element e in D : {x D x<e} 11
12 Prior Work on Approximate Quantile Computation Space efficient ε-approximate streaming algorithms: Manku et al [MRL98] with space cost O(1/ε log 2 (εn)) Greenwald-Khanna [GK01] with space cost O(1/ε log(εn)) Qdigest [SBAS04] with space cost O(1/ε log(u)) N: size of data stream, U: universe size, ε: approximation parameter, 0<ε<1 12
13 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 13
14 Processing Historical Data + Streaming Data Server Live Data Stream R Stream Processing System continuously maintains a stream state for current timestep, using state-of-the-art streaming algorithm, which is used to generate a summary S(R) upon query Batching Processing System maintains batch summary S(H) (a set of index for faster lookup) Stream summary S(R) Coordinator Batch Data loaded in batch at every timestep, ex: 1 day Time Step Batch summary S(H) Answer Query H R Data Warehouse Historical Data H 14
15 Contribution An algorithm that returns an element of rank ρ as a φ- quantile from a union of historical and streaming data, such that φn-εm ρ φn+εm, m << N compared to φn-εn ρ φn+εn by streaming algorithms The algorithm is memory efficient requiring O memory log(εn) ε m : size of data stream, rank of an element e in S : {x S x<e} ε : approximation parameter, 0<ε<1 T : no of time steps 15
16 Contribution Small processing time for updating historical data in the warehouse at every timestep, using amortized O((m / B)log(n / B)) disk accesses per timestep, Real time querying using amortized O (log 2 (εm / B)log(T )) disk accesses n : size of historical data B : block size of the data warehouse T : no of time steps ε : approximation parameter, 0<ε<1 16
17 3-way Tradeoff between Accuracy, Disk Accesses & Memory Accuracy Batch + External Memory Pure Streaming Batch + Streaming Disk Accesses Memory
18 Comparison of Our Algorithm with Pure Streaming Algorithms Relative Error GK (streaming) Qdigest (streaming) Our Algorithm ε Memory O! log(εn) $ O! log(u) $ O log(εn) # & # & " ε % " ε % ε ε εm N N: total size of historical and streaming data m: data stream size, m << N T: no of time steps U: universe size 18
19 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 19
20 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 20
21 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for each time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 21
22 Historical Data Update at Each Time Step: Approach 1 Merge the newly arrived dataset with older historical data at every time step Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D 1 D 2 D 3 D 4 D 1 sorted D 1 U D 2 sorted D sorted 2 sorted D 1 U D 2 U D 3 D sorted 3 Sort D 1 Merge D 1 and D 2 Sort in sorted D order 2 Merge D 3 with DSort 1 U D 23 in sorted order Disadvantages : Large disk access at each timestep i : Sort D i of size n i : O((n i /B)log(n i /B)) Merge with historical data of size n : O(n/B), B- block size Data Warehouse 22
23 Historical Data Update at Each Time Step: Approach 2 Add the new arrived dataset at each time step as a new partition in data warehouse Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D 1 D 2 D 3 D 4 D 5 D 1 sorted D 2 sorted D 3 sorted D sorted 4 Data Warehouse Sort D 1 Sort D 2 Sort D 3 Sort D 4 Advantage : Small update cost at each timestep i, Sort and store D i of size n i : O((n i /B)log(n i /B)) Disadvantage : Maintainence of several partitions Expensive to answer query - O(d*T) d: no of disk access / timestep to answer a query T: no of timesteps 23
24 Historical Data Update: Our Approach Dataset D i arrives at timestep i Data stored in sorted format in data partitions using levels Merge Threshold = Merge 2 Threshold = Merge 2 Threshold = 2 Level 0 D 2 D 1 D 3 D 4 D 5 D 6 D 7 Level 1 Level 2 D 1 U D 2 D 3 U D 4 D 5 U D 6 Violation of merge threshold! D 1 U D 2 U D 3 U D 4 Data Warehouse Partitions Each level has at most κ data partitions. In the figure, κ=2, κ merge threshold At most log κ T data partitions at the end of T time steps. 24
25 Historical Data Update at Each Time Step: Our Approach D 7 D 5 U D 6 Advantages: Relatively small update cost since the entire historical data is not accessed at each time step Small query time D 1 U D 2 U D 3 U D 4 25
26 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 26
27 Batch and Stream Summary Batch Summary S(H i ) ((1/ε 1 )+1) elements and their database index, for each disk partition H i of size n i, updated at the end of each time step H i - sorted ε 1 n i 2ε 1 n i n-ε 1 n i n i -3 n i -2 n i -1 S(H i ) - sorted 0 ε 1 n i 2ε 1 n i 3ε 1 n i n-ε 1 n i n i -1 Stream Summary S(R) use stream state SS to get ((1/ε 2 )+1) elements of approximate ranks ε 2 m, 2ε 2 m, 3ε 2 m and so on, with a max error of ε 2 m, created when a query is made S(R) - sorted r 0 r 1 r 2 r m-2 r m-1 rank(r 0, R)=0 ε 2 m rank(r 2, R) 3ε 2 m rank(r m-1, R) = m-1 0<rank(r1, R) 2ε 2 m ε 2 (m-3) rank(rm-2, R) ε 2 (m-1) 27
28 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 28
29 Answer a Query : Find a Range within which Quantile is Guaranteed to lie H : historical data of size n; R : live data stream of size m; N=(n+m); Find elements u and v from S(H) U S(R), such that the quantile is guaranteed to lie between [u, v] in the union of historical and streaming data Filters S(H) U S(R) in sorted order s 1 s 2 s 3 s 4 s 5 u v s 2(1/ε+1) H U R in sorted order e 1 e 2 e 3 e 4 u v e N-1 e N 2εN elements φ quantile guaranteed to lie here 29
30 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 30
31 Additional Disk Accesses to Improve Accuracy The filter range [u, v] is narrowed down recursively to zero down on error due to batch. At each recursion, Compute rank of z = (u+v)/2 in all data partitions and stream For each partition H i of size n i S(H i ) : (sorted) Step a. Find p and q, such that & q v e 1 e 2 p q e β 4εn i elements β=(1/ε 1 ) + 1 H i : (sorted) Step b: Access p and q in H i using index from S(H i ) e 1 e 2 p q e ni Step c: Find rank of z (using binary search), p z Rank of z = Σ(rank of z in each partition) + rank of z in data stream Choose [u, z] or [z, v] as the new filter for next recursion, depending on which contains the quantile 31
32 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Problem Statement and Motivation 4. Contribution 5. High Level Algorithm Description 6. Experimental Evaluation 32
33 Experimental Setup Datasets Wikipedia Network traces Uniform Random Normal Data Size (GB) Tuples / time step (million) Algorithms Greenwald-Khanna Qdigest Our Performance Metric Accuracy Processing time Querytime Description Measure relative error : φn - r / φn Runtime and no. of disk accesses Runtime and no. of disk accesses 33
34 Dependence On Memory Update time (seconds) Update time vs Memory Load Sort Relative Error Merge Summary Our Algorithm GK Memory (MB) e-05 1e-06 1e-07 1e-08 1e-09 Our Algorithm Greenwald-Khanna Q-Digest Quick Response 1e Memory (MB) (Merge Threshold apple = 10) QD Query RunTime (seconds) Accuracy vs Memory Query time vs Memory Our Algorithm - Runtime Greenwald-Khanna - Runtime Q-Digest - Runtime Our Algorithm - Disk accesses Query Disk Access Memory (MB) (Merge Threshold apple = 10) Normal Dataset (merge threshold: 10) 34
35 Update time vs Merge Threshold Dependence On Merge Threshold Relative Error e-05 1e-06 1e-07 1e-08 1e-09 1e-10 Relative Error in Practice Relative Error in Theory Merge Threshold apple (Memory fixed at 250MB) Accuracy vs Merge Threshold Query time vs Merge Threshold Update time (seconds) Avg. disk access (overall) Avg. disk access for Merge Sort Load Merge Summary Avg no. of disk accesses / time step Query RunTime (seconds) Our Algorithm - Runtime Our Algorithm - Disk accesses Memory (MB) (Merge Threshold apple = 10) Query Disk Access Normal Dataset (memory fixed: 250 MB) 35
36 Dependence On Historical Data Size Update time vs Historical Data Size Relative Error 1e-07 8e-08 6e-08 4e-08 2e Historical Data Size (GB) Accuracy vs Historical Data Size Query time vs Historical Data Size Average Update RunTime (sec) Runtime Avg Disk accesses (overall) Avg Disk accesses (merge) Normal Dataset (streamind data size: 1GB, memory: 250 MB, merge threshold: 10) Historical Data Size (GB) Average Update Disk Accesses Query RunTime (seconds) Runtime Disk accesses Query Disk Accesses Historical Data Size (GB) 36
37 Dependence On Streaming Data Size Update time vs Data Stream Size Relative Error 1e-08 8e-09 6e-09 4e-09 2e Stream Size (MB) Accuracy vs Data Stream Size Query time vs Data Stream Size Average Update RunTime (sec) Runtime Avg Disk accesses (overall) 5000 Avg Disk accesses (merge) Normal Data (historical data size: 100GB, memory: 250 MB, merge threshold: 10) Stream Size (MB) Average Update Disk Accesses Query RunTime (seconds) Runtime Disk accesses Stream Size (MB) 37 Query Disk Accesses
38 Conclusion We present an algorithm that identifies quantile from a union of historical and streaming data in real time with accuracy significantly better than pure streaming algorithms using memory similar to pure streaming algorithms using few additional disk accesses Future Work Analysis on the tradeoff between accuracy and disk accesses Improve the window adaptation of algorithm Parallel methods for processing historical data Other aggregates over a union of historical and streaming data 38
39 Thank You 39
Estimating quantiles from the union of historical and streaming data
Electrical and Computer Engineering Conference Papers, Posters and Presentations Electrical and Computer Engineering 11-216 Estimating quantiles from the union of historical and streaming data Sneha Aman
More informationarxiv: v1 [cs.db] 24 Aug 2015
UNIVERSITY OF CALIFORNIA, MERCED arxiv:1508.05710v1 [cs.db] 24 Aug 2015 Committee in charge: An Experimental Study of Distributed Quantile Estimation Professor Florin Rusu, Chair Professor Mukesh Singhal
More informationarxiv: v3 [cs.db] 20 Feb 2018
Variance-Optimal Offline and Streaming Stratified Random Sampling Trong Duc Nguyen 1 Ming-Hung Shih 1 Divesh Srivastava 2 Srikanta Tirthapura 1 Bojian Xu 3 arxiv:1801.09039v3 [cs.db] 20 Feb 2018 1 Iowa
More informationEfficient Approximation of Correlated Sums on Data Streams
Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University
More informationTechniques for online analysis of large distributed data
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2015 Techniques for online analysis of large distributed data Sneha Aman Singh Iowa State University Follow this
More informationEfficient Lists Intersection by CPU- GPU Cooperative Computing
Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative
More informationMultimedia Streaming. Mike Zink
Multimedia Streaming Mike Zink Technical Challenges Servers (and proxy caches) storage continuous media streams, e.g.: 4000 movies * 90 minutes * 10 Mbps (DVD) = 27.0 TB 15 Mbps = 40.5 TB 36 Mbps (BluRay)=
More informationFGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance
FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance Yujuan Tan, Jian Wen, Zhichao Yan, Hong Jiang, Witawas Srisa-an, Baiping Wang, Hao Luo Outline Background and Motivation
More informationMining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.
DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationSketching Probabilistic Data Streams
Sketching Probabilistic Data Streams Graham Cormode AT&T Labs - Research graham@research.att.com Minos Garofalakis Yahoo! Research minos@acm.org Challenge of Uncertain Data Many applications generate data
More informationHow to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates
How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates Milos Nikolic, Mohammad Dashti, Christoph Koch DATA lab, EPFL SIGMOD, 28 th June 2016 REALTIME APPLICATIONS Web Analytics
More informationAn Efficient Transformation for Klee s Measure Problem in the Streaming Model Abstract Given a stream of rectangles over a discrete space, we consider the problem of computing the total number of distinct
More informationStateful Detection in High Throughput Distributed Systems
Stateful Detection in High Throughput Distributed Systems Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, Saurabh Bagchi Dependable Computing Systems Lab School of Electrical and Computer Engineering Purdue
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationPrivApprox. Privacy- Preserving Stream Analytics.
PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017 Motivation Clients Analysts
More informationYCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores
YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie
More informationOne-Pass Streaming Algorithms
One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,
More informationBIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
More information( , *) one-to-many: e.g., scanning (*, ) many-to-one: e.g., DDoS ( /24, /28) subnet-to-subnet
(1.2.3.4, *) one-to-many: e.g., scanning (*, 5.6.7.8) many-to-one: e.g., DDoS (1.2.3.0/24, 4.5.6.0/28) subnet-to-subnet 2 c φn φ: N: 0.0.0.0/0 0.0.0.0/0 10.1/16 192.168/16 10.1/16 192.168/16 10.1.1/24
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationSummarizing and mining inverse distributions on data streams via dynamic inverse sampling
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationBuffered Count-Min Sketch on SSD: Theory and Experiments
Buffered Count-Min Sketch on SSD: Theory and Experiments Mayank Goswami, Dzejla Medjedovic, Emina Mekic, and Prashant Pandey ESA 2018, Helsinki, Finland The heavy hitters problem (HH(k)) Given stream of
More informationMining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams
Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06
More informationEnergy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09)
Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09) Niranjan Balasubramanian Aruna Balasubramanian Arun Venkataramani University of Massachusetts
More informationApproximate String Joins
Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different
More informationChapter 8 Main Memory
COP 4610: Introduction to Operating Systems (Spring 2014) Chapter 8 Main Memory Zhi Wang Florida State University Contents Background Swapping Contiguous memory allocation Paging Segmentation OS examples
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationStreaming k-means Clustering with Fast Queries
Streaming k-means Clustering with Fast Queries Yu Zhang [1] Dept. of Elec. and Computer Eng. Iowa State U., USA yuz1988@iastate.edu Kanat Tangwongsan [] Computer Science Program Mahidol U. International
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationAn Efficient Transformation for Klee s Measure Problem in the Streaming Model
An Efficient Transformation for Klee s Measure Problem in the Streaming Model Gokarna Sharma, Costas Busch, Ramachandran Vaidyanathan, Suresh Rai, and Jerry Trahan Louisiana State University Outline of
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationTools for Social Networking Infrastructures
Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes
More informationSQL Server 2014 Column Store Indexes. Vivek Sanil Microsoft Sr. Premier Field Engineer
SQL Server 2014 Column Store Indexes Vivek Sanil Microsoft Vivek.sanil@microsoft.com Sr. Premier Field Engineer Trends in the Data Warehousing Space Approximate data volume managed by DW Less than 1TB
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationSketching Asynchronous Streams Over a Sliding Window
Sketching Asynchronous Streams Over a Sliding Window Srikanta Tirthapura (Iowa State University) Bojian Xu (Iowa State University) Costas Busch (Rensselaer Polytechnic Institute) 1/32 Data Stream Processing
More informationMining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania
Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary
More informationPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page
More informationStaying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing
Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an
More informationSandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.
Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationNesnelerin İnternetinde Veri Analizi
Bölüm 4. Frequent Patterns in Data Streams w3.gazi.edu.tr/~suatozdemir What Is Pattern Discovery? What are patterns? Patterns: A set of items, subsequences, or substructures that occur frequently together
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationSpace-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order
Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Luca Foschini joint work with Sorabh Gandhi and Subhash Suri University of California Santa Barbara ICDE 2010
More informationLecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka
Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another
More informationData Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners
Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationColumnStore Indexes. מה חדש ב- 2014?SQL Server.
ColumnStore Indexes מה חדש ב- 2014?SQL Server דודאי מאיר meir@valinor.co.il 3 Column vs. row store Row Store (Heap / B-Tree) Column Store (values compressed) ProductID OrderDate Cost ProductID OrderDate
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationParallel Merge Sort Using MPI
Parallel Merge Sort Using MPI CSE 702: Seminar on Programming Massively Parallel Systems Course Instructor: Dr. Russ Miller UB Distinguished Professor Department of Computer Science & Engineering State
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationImproving the Performance of OLAP Queries Using Families of Statistics Trees
Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University
More informationLab - Manage Virtual Memory in Windows 7 and Vista
Lab - Manage Virtual Memory in Windows 7 and Vista Introduction In this lab, you will customize virtual memory settings. Recommended Equipment A computer with Windows 7 or Vista installed The hard drive
More informationCACHE-OBLIVIOUS MAPS. Edward Kmett McGraw Hill Financial. Saturday, October 26, 13
CACHE-OBLIVIOUS MAPS Edward Kmett McGraw Hill Financial CACHE-OBLIVIOUS MAPS Edward Kmett McGraw Hill Financial CACHE-OBLIVIOUS MAPS Indexing and Machine Models Cache-Oblivious Lookahead Arrays Amortization
More informationMultiresolution Motif Discovery in Time Series
Tenth SIAM International Conference on Data Mining Columbus, Ohio, USA Multiresolution Motif Discovery in Time Series NUNO CASTRO PAULO AZEVEDO Department of Informatics University of Minho Portugal April
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationVisualization-supported Analysis of System Data for Controlled VMI-based Intrusion Detection
Visualization-supported Analysis of System Data for Controlled VMI-based Intrusion Detection Noëlle Rakotondravony, Prof. Hans P. Reiser Juniorprofessur für Sicherheit in Informationssystemen Universität
More informationQuery Processing. Introduction to Databases CompSci 316 Fall 2017
Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationFast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann
Fast Approximations for Analyzing Ten Trillion Cells Filip Buruiana (filipb@google.com) Reimar Hofmann (reimar.hofmann@hs-karlsruhe.de) Outline of the Talk Interactive analysis at AdSpam @ Google Trade
More informationHANA Performance. Efficient Speed and Scale-out for Real-time BI
HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business
More informationDatabase System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static
More informationChapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition
Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Yanlei Diao UMass Amherst March 13 and 15, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection
More informationCharacterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory
Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal,
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationKey aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling
Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationUsing the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver
Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data
More informationChunkStash: Speeding Up Storage Deduplication using Flash Memory
ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath +, Sudipta Sengupta *, Jin Li * * Microsoft Research, Redmond (USA) + Univ. of Minnesota, Twin Cities (USA) Deduplication
More informationLearning Kernels with Random Features
Learning Kernels with Random Features Aman Sinha John Duchi Stanford University NIPS, 2016 Presenter: Ritambhara Singh Outline 1 Introduction Motivation Background State-of-the-art 2 Proposed Approach
More informationSpace- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams
Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams Graham Cormode Bell Laboratories cormode@lucent.com Flip Korn AT&T Labs Research flip@research.att.com S. Muthukrishnan
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationCrescando: Predictable Performance for Unpredictable Workloads
Crescando: Predictable Performance for Unpredictable Workloads G. Alonso, D. Fauser, G. Giannikis, D. Kossmann, J. Meyer, P. Unterbrunner Amadeus S.A. ETH Zurich, Systems Group (Funded by Enterprise Computing
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationJinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)
Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University) Background: Memory Caching Two orders of magnitude more reads than writes
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationAnnouncement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17
Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa
More informationDynamic Flow Regulation for IP Integration on Network-on-Chip
Dynamic Flow Regulation for IP Integration on Network-on-Chip Zhonghai Lu and Yi Wang Dept. of Electronic Systems KTH Royal Institute of Technology Stockholm, Sweden Agenda The IP integration problem Why
More informationDesign Tradeoffs for Data Deduplication Performance in Backup Workloads
Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia
More informationCS 655 Advanced Topics in Distributed Systems
Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3
More informationHow TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.
6.172 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul Guest Lecture in MIT 6.172 Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 2 I m an MIT
More informationUsing MPI One-sided Communication to Accelerate Bioinformatics Applications
Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data
More informationI/O Characterization of Commercial Workloads
I/O Characterization of Commercial Workloads Kimberly Keeton, Alistair Veitch, Doug Obal, and John Wilkes Storage Systems Program Hewlett-Packard Laboratories www.hpl.hp.com/research/itc/csl/ssp kkeeton@hpl.hp.com
More informationEfficiency vs. Effectiveness in Terabyte-Scale IR
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system
More informationIntroduction to Linked List: Review. Source:
Introduction to Linked List: Review Source: http://www.geeksforgeeks.org/data-structures/linked-list/ Linked List Fundamental data structures in C Like arrays, linked list is a linear data structure Unlike
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationOverview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages
Overview of Query Processing Query Parser Query Processor Evaluation of Relational Operations Query Rewriter Query Optimizer Query Executor Yanlei Diao UMass Amherst Lock Manager Access Methods (Buffer
More informationDATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843
DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843 WHAT IS A DATA CUBE? The Data Cube or Cube operator produces N-dimensional answers
More informationMassive Data Algorithmics. Lecture 12: Cache-Oblivious Model
Typical Computer Hierarchical Memory Basics Data moved between adjacent memory level in blocks A Trivial Program A Trivial Program: d = 1 A Trivial Program: d = 1 A Trivial Program: n = 2 24 A Trivial
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More information