Estimating Quantiles from the Union of Historical and Streaming Data

Size: px
Start display at page:

Download "Estimating Quantiles from the Union of Historical and Streaming Data"

Transcription

1 Estimating Quantiles from the Union of Historical and Streaming Data Sneha Aman Singh, Iowa State University Divesh Srivastava, AT&T Labs - Research Srikanta Tirthapura, Iowa State University

2 Quantiles Quantile A statistical measure used to characterize distribution of data sets Indicates a point in a data distribution at or below which a given fraction of the entire data set falls 50% data < median Normal distribution Fundamental database query. An operator in many big data tools, such as R and Apache Spark 2

3 φ Quantile of a Set Rank of an element x in a set S is the number of elements with value less than or equal to x S: N = For instance, rank of elements in above set S S (sorted): Ranks φ-quantile is an element of rank φn, for 0<φ<1 Find 0.5-Quantile: Rank of 0.5 quantile = φn = 0.5*11 = 5 0-quantile is the minimum and 1-quantile is the maximum of S. 3

4 Application Web Latency Web Latency delay experienced by a user between sending a request to the web server and getting back a response. Latency Experienced by All the Users Indicator of user experience 0.95 Quantiles 0.95 quantile > Threshold latency -- Track poor network performance 4

5 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 5

6 Historical Data Historical Data Data collected and stored for later analysis. Web user requests collected at the server end: to analyze user trends / behavior and web site performance User Requests Web Server User Requests Stored for later analysis Batch Processing System Storage 6

7 Streaming Data Data Stream: Continuous sequence of data observed in the form of events, messages, tuples, in a single pass or limited number of pass. User Requests User Requests Sent for storage Web Server Data observed on the fly Stream Processing System Memory 7

8 Stream Processing System Memory (stores stream state) Update state Retrieve state e Retrieve information 43 e 23 e 12 from e 1 2 e 1 e 2 Data stream Stream Processing Engine Query ex: 0.5 quantile (Approximate) Answer to query Continuous update of the state of the data, with every incoming element 8

9 Integrated Processing of Historical and Streaming Data Long term data analysis with higher accuracy in real time Archived Historical data Live Data stream Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D1 D2 D3 D4 D5 Correlation of historical data with live streaming data Predictive analysis 9

10 Historical vs Stream Processing System Batch Processing System Stream Processing System Data stored and can be traversed multiple times Single or limited pass over data Accurate results of user query J Approximate results of user query L Latency due to disk I/O L Fast processing time J Can we have our cake and eat it too? 10

11 Our Focus: (Approximate) Quantile N : size of data stream Computation ε-approximate quantiles Find an element of rank ρ from a dataset of size N, such that, (φ-ε)n ρ (φ+ε)n, 0 < ε << φ < 1 ε : approximation parameter rank of element e in D : {x D x<e} 11

12 Prior Work on Approximate Quantile Computation Space efficient ε-approximate streaming algorithms: Manku et al [MRL98] with space cost O(1/ε log 2 (εn)) Greenwald-Khanna [GK01] with space cost O(1/ε log(εn)) Qdigest [SBAS04] with space cost O(1/ε log(u)) N: size of data stream, U: universe size, ε: approximation parameter, 0<ε<1 12

13 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 13

14 Processing Historical Data + Streaming Data Server Live Data Stream R Stream Processing System continuously maintains a stream state for current timestep, using state-of-the-art streaming algorithm, which is used to generate a summary S(R) upon query Batching Processing System maintains batch summary S(H) (a set of index for faster lookup) Stream summary S(R) Coordinator Batch Data loaded in batch at every timestep, ex: 1 day Time Step Batch summary S(H) Answer Query H R Data Warehouse Historical Data H 14

15 Contribution An algorithm that returns an element of rank ρ as a φ- quantile from a union of historical and streaming data, such that φn-εm ρ φn+εm, m << N compared to φn-εn ρ φn+εn by streaming algorithms The algorithm is memory efficient requiring O memory log(εn) ε m : size of data stream, rank of an element e in S : {x S x<e} ε : approximation parameter, 0<ε<1 T : no of time steps 15

16 Contribution Small processing time for updating historical data in the warehouse at every timestep, using amortized O((m / B)log(n / B)) disk accesses per timestep, Real time querying using amortized O (log 2 (εm / B)log(T )) disk accesses n : size of historical data B : block size of the data warehouse T : no of time steps ε : approximation parameter, 0<ε<1 16

17 3-way Tradeoff between Accuracy, Disk Accesses & Memory Accuracy Batch + External Memory Pure Streaming Batch + Streaming Disk Accesses Memory

18 Comparison of Our Algorithm with Pure Streaming Algorithms Relative Error GK (streaming) Qdigest (streaming) Our Algorithm ε Memory O! log(εn) $ O! log(u) $ O log(εn) # & # & " ε % " ε % ε ε εm N N: total size of historical and streaming data m: data stream size, m << N T: no of time steps U: universe size 18

19 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Contribution 4. High Level Algorithm Description 5. Experimental Evaluation 19

20 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 20

21 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for each time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 21

22 Historical Data Update at Each Time Step: Approach 1 Merge the newly arrived dataset with older historical data at every time step Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D 1 D 2 D 3 D 4 D 1 sorted D 1 U D 2 sorted D sorted 2 sorted D 1 U D 2 U D 3 D sorted 3 Sort D 1 Merge D 1 and D 2 Sort in sorted D order 2 Merge D 3 with DSort 1 U D 23 in sorted order Disadvantages : Large disk access at each timestep i : Sort D i of size n i : O((n i /B)log(n i /B)) Merge with historical data of size n : O(n/B), B- block size Data Warehouse 22

23 Historical Data Update at Each Time Step: Approach 2 Add the new arrived dataset at each time step as a new partition in data warehouse Timestep 1 Timestep 2 Timestep 3 Timestep 4 Timestep 5 D 1 D 2 D 3 D 4 D 5 D 1 sorted D 2 sorted D 3 sorted D sorted 4 Data Warehouse Sort D 1 Sort D 2 Sort D 3 Sort D 4 Advantage : Small update cost at each timestep i, Sort and store D i of size n i : O((n i /B)log(n i /B)) Disadvantage : Maintainence of several partitions Expensive to answer query - O(d*T) d: no of disk access / timestep to answer a query T: no of timesteps 23

24 Historical Data Update: Our Approach Dataset D i arrives at timestep i Data stored in sorted format in data partitions using levels Merge Threshold = Merge 2 Threshold = Merge 2 Threshold = 2 Level 0 D 2 D 1 D 3 D 4 D 5 D 6 D 7 Level 1 Level 2 D 1 U D 2 D 3 U D 4 D 5 U D 6 Violation of merge threshold! D 1 U D 2 U D 3 U D 4 Data Warehouse Partitions Each level has at most κ data partitions. In the figure, κ=2, κ merge threshold At most log κ T data partitions at the end of T time steps. 24

25 Historical Data Update at Each Time Step: Our Approach D 7 D 5 U D 6 Advantages: Relatively small update cost since the entire historical data is not accessed at each time step Small query time D 1 U D 2 U D 3 U D 4 25

26 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 26

27 Batch and Stream Summary Batch Summary S(H i ) ((1/ε 1 )+1) elements and their database index, for each disk partition H i of size n i, updated at the end of each time step H i - sorted ε 1 n i 2ε 1 n i n-ε 1 n i n i -3 n i -2 n i -1 S(H i ) - sorted 0 ε 1 n i 2ε 1 n i 3ε 1 n i n-ε 1 n i n i -1 Stream Summary S(R) use stream state SS to get ((1/ε 2 )+1) elements of approximate ranks ε 2 m, 2ε 2 m, 3ε 2 m and so on, with a max error of ε 2 m, created when a query is made S(R) - sorted r 0 r 1 r 2 r m-2 r m-1 rank(r 0, R)=0 ε 2 m rank(r 2, R) 3ε 2 m rank(r m-1, R) = m-1 0<rank(r1, R) 2ε 2 m ε 2 (m-3) rank(rm-2, R) ε 2 (m-1) 27

28 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 28

29 Answer a Query : Find a Range within which Quantile is Guaranteed to lie H : historical data of size n; R : live data stream of size m; N=(n+m); Find elements u and v from S(H) U S(R), such that the quantile is guaranteed to lie between [u, v] in the union of historical and streaming data Filters S(H) U S(R) in sorted order s 1 s 2 s 3 s 4 s 5 u v s 2(1/ε+1) H U R in sorted order e 1 e 2 e 3 e 4 u v e N-1 e N 2εN elements φ quantile guaranteed to lie here 29

30 High Level Algorithm Steps Live Data Stream R Update SS Memory Retrieve SS Stream Processing System For each element observed in R, in current time step, update stream state SS (reset SS for new time step) On query, generate summary S(R) using SS S(R) S(H) Update S(H) Memory Retrieve S(H) Batch Processing System Collect data during the time step At the end of each time step, load the collected data in H Update batch summary S(H) upon each load Coordinator When a query is posed over ζ = H R 1.Retrieve S(H) & S(R) and use these to generate approximate answer Approximate Answer 2.Retrieve Access data warehouse to improve accuracy of answer Improved Accuracy Additional disk accesses Data Warehouse Historical Data H 30

31 Additional Disk Accesses to Improve Accuracy The filter range [u, v] is narrowed down recursively to zero down on error due to batch. At each recursion, Compute rank of z = (u+v)/2 in all data partitions and stream For each partition H i of size n i S(H i ) : (sorted) Step a. Find p and q, such that & q v e 1 e 2 p q e β 4εn i elements β=(1/ε 1 ) + 1 H i : (sorted) Step b: Access p and q in H i using index from S(H i ) e 1 e 2 p q e ni Step c: Find rank of z (using binary search), p z Rank of z = Σ(rank of z in each partition) + rank of z in data stream Choose [u, z] or [z, v] as the new filter for next recursion, depending on which contains the quantile 31

32 Outline 1. Description of quantile 2. Historical and Streaming Data 3. Problem Statement and Motivation 4. Contribution 5. High Level Algorithm Description 6. Experimental Evaluation 32

33 Experimental Setup Datasets Wikipedia Network traces Uniform Random Normal Data Size (GB) Tuples / time step (million) Algorithms Greenwald-Khanna Qdigest Our Performance Metric Accuracy Processing time Querytime Description Measure relative error : φn - r / φn Runtime and no. of disk accesses Runtime and no. of disk accesses 33

34 Dependence On Memory Update time (seconds) Update time vs Memory Load Sort Relative Error Merge Summary Our Algorithm GK Memory (MB) e-05 1e-06 1e-07 1e-08 1e-09 Our Algorithm Greenwald-Khanna Q-Digest Quick Response 1e Memory (MB) (Merge Threshold apple = 10) QD Query RunTime (seconds) Accuracy vs Memory Query time vs Memory Our Algorithm - Runtime Greenwald-Khanna - Runtime Q-Digest - Runtime Our Algorithm - Disk accesses Query Disk Access Memory (MB) (Merge Threshold apple = 10) Normal Dataset (merge threshold: 10) 34

35 Update time vs Merge Threshold Dependence On Merge Threshold Relative Error e-05 1e-06 1e-07 1e-08 1e-09 1e-10 Relative Error in Practice Relative Error in Theory Merge Threshold apple (Memory fixed at 250MB) Accuracy vs Merge Threshold Query time vs Merge Threshold Update time (seconds) Avg. disk access (overall) Avg. disk access for Merge Sort Load Merge Summary Avg no. of disk accesses / time step Query RunTime (seconds) Our Algorithm - Runtime Our Algorithm - Disk accesses Memory (MB) (Merge Threshold apple = 10) Query Disk Access Normal Dataset (memory fixed: 250 MB) 35

36 Dependence On Historical Data Size Update time vs Historical Data Size Relative Error 1e-07 8e-08 6e-08 4e-08 2e Historical Data Size (GB) Accuracy vs Historical Data Size Query time vs Historical Data Size Average Update RunTime (sec) Runtime Avg Disk accesses (overall) Avg Disk accesses (merge) Normal Dataset (streamind data size: 1GB, memory: 250 MB, merge threshold: 10) Historical Data Size (GB) Average Update Disk Accesses Query RunTime (seconds) Runtime Disk accesses Query Disk Accesses Historical Data Size (GB) 36

37 Dependence On Streaming Data Size Update time vs Data Stream Size Relative Error 1e-08 8e-09 6e-09 4e-09 2e Stream Size (MB) Accuracy vs Data Stream Size Query time vs Data Stream Size Average Update RunTime (sec) Runtime Avg Disk accesses (overall) 5000 Avg Disk accesses (merge) Normal Data (historical data size: 100GB, memory: 250 MB, merge threshold: 10) Stream Size (MB) Average Update Disk Accesses Query RunTime (seconds) Runtime Disk accesses Stream Size (MB) 37 Query Disk Accesses

38 Conclusion We present an algorithm that identifies quantile from a union of historical and streaming data in real time with accuracy significantly better than pure streaming algorithms using memory similar to pure streaming algorithms using few additional disk accesses Future Work Analysis on the tradeoff between accuracy and disk accesses Improve the window adaptation of algorithm Parallel methods for processing historical data Other aggregates over a union of historical and streaming data 38

39 Thank You 39

Estimating quantiles from the union of historical and streaming data

Estimating quantiles from the union of historical and streaming data Electrical and Computer Engineering Conference Papers, Posters and Presentations Electrical and Computer Engineering 11-216 Estimating quantiles from the union of historical and streaming data Sneha Aman

More information

arxiv: v1 [cs.db] 24 Aug 2015

arxiv: v1 [cs.db] 24 Aug 2015 UNIVERSITY OF CALIFORNIA, MERCED arxiv:1508.05710v1 [cs.db] 24 Aug 2015 Committee in charge: An Experimental Study of Distributed Quantile Estimation Professor Florin Rusu, Chair Professor Mukesh Singhal

More information

arxiv: v3 [cs.db] 20 Feb 2018

arxiv: v3 [cs.db] 20 Feb 2018 Variance-Optimal Offline and Streaming Stratified Random Sampling Trong Duc Nguyen 1 Ming-Hung Shih 1 Divesh Srivastava 2 Srikanta Tirthapura 1 Bojian Xu 3 arxiv:1801.09039v3 [cs.db] 20 Feb 2018 1 Iowa

More information

Efficient Approximation of Correlated Sums on Data Streams

Efficient Approximation of Correlated Sums on Data Streams Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University

More information

Techniques for online analysis of large distributed data

Techniques for online analysis of large distributed data Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2015 Techniques for online analysis of large distributed data Sneha Aman Singh Iowa State University Follow this

More information

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Efficient Lists Intersection by CPU- GPU Cooperative Computing Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative

More information

Multimedia Streaming. Mike Zink

Multimedia Streaming. Mike Zink Multimedia Streaming Mike Zink Technical Challenges Servers (and proxy caches) storage continuous media streams, e.g.: 4000 movies * 90 minutes * 10 Mbps (DVD) = 27.0 TB 15 Mbps = 40.5 TB 36 Mbps (BluRay)=

More information

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance Yujuan Tan, Jian Wen, Zhichao Yan, Hong Jiang, Witawas Srisa-an, Baiping Wang, Hao Luo Outline Background and Motivation

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Sketching Probabilistic Data Streams

Sketching Probabilistic Data Streams Sketching Probabilistic Data Streams Graham Cormode AT&T Labs - Research graham@research.att.com Minos Garofalakis Yahoo! Research minos@acm.org Challenge of Uncertain Data Many applications generate data

More information

How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates

How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates Milos Nikolic, Mohammad Dashti, Christoph Koch DATA lab, EPFL SIGMOD, 28 th June 2016 REALTIME APPLICATIONS Web Analytics

More information

An Efficient Transformation for Klee s Measure Problem in the Streaming Model Abstract Given a stream of rectangles over a discrete space, we consider the problem of computing the total number of distinct

More information

Stateful Detection in High Throughput Distributed Systems

Stateful Detection in High Throughput Distributed Systems Stateful Detection in High Throughput Distributed Systems Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, Saurabh Bagchi Dependable Computing Systems Lab School of Electrical and Computer Engineering Purdue

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

PrivApprox. Privacy- Preserving Stream Analytics.

PrivApprox. Privacy- Preserving Stream Analytics. PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017 Motivation Clients Analysts

More information

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie

More information

One-Pass Streaming Algorithms

One-Pass Streaming Algorithms One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,

More information

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements

More information

( , *) one-to-many: e.g., scanning (*, ) many-to-one: e.g., DDoS ( /24, /28) subnet-to-subnet

( , *) one-to-many: e.g., scanning (*, ) many-to-one: e.g., DDoS ( /24, /28) subnet-to-subnet (1.2.3.4, *) one-to-many: e.g., scanning (*, 5.6.7.8) many-to-one: e.g., DDoS (1.2.3.0/24, 4.5.6.0/28) subnet-to-subnet 2 c φn φ: N: 0.0.0.0/0 0.0.0.0/0 10.1/16 192.168/16 10.1/16 192.168/16 10.1.1/24

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Buffered Count-Min Sketch on SSD: Theory and Experiments

Buffered Count-Min Sketch on SSD: Theory and Experiments Buffered Count-Min Sketch on SSD: Theory and Experiments Mayank Goswami, Dzejla Medjedovic, Emina Mekic, and Prashant Pandey ESA 2018, Helsinki, Finland The heavy hitters problem (HH(k)) Given stream of

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09)

Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09) Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09) Niranjan Balasubramanian Aruna Balasubramanian Arun Venkataramani University of Massachusetts

More information

Approximate String Joins

Approximate String Joins Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different

More information

Chapter 8 Main Memory

Chapter 8 Main Memory COP 4610: Introduction to Operating Systems (Spring 2014) Chapter 8 Main Memory Zhi Wang Florida State University Contents Background Swapping Contiguous memory allocation Paging Segmentation OS examples

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Streaming k-means Clustering with Fast Queries

Streaming k-means Clustering with Fast Queries Streaming k-means Clustering with Fast Queries Yu Zhang [1] Dept. of Elec. and Computer Eng. Iowa State U., USA yuz1988@iastate.edu Kanat Tangwongsan [] Computer Science Program Mahidol U. International

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

An Efficient Transformation for Klee s Measure Problem in the Streaming Model

An Efficient Transformation for Klee s Measure Problem in the Streaming Model An Efficient Transformation for Klee s Measure Problem in the Streaming Model Gokarna Sharma, Costas Busch, Ramachandran Vaidyanathan, Suresh Rai, and Jerry Trahan Louisiana State University Outline of

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

SQL Server 2014 Column Store Indexes. Vivek Sanil Microsoft Sr. Premier Field Engineer

SQL Server 2014 Column Store Indexes. Vivek Sanil Microsoft Sr. Premier Field Engineer SQL Server 2014 Column Store Indexes Vivek Sanil Microsoft Vivek.sanil@microsoft.com Sr. Premier Field Engineer Trends in the Data Warehousing Space Approximate data volume managed by DW Less than 1TB

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Sketching Asynchronous Streams Over a Sliding Window

Sketching Asynchronous Streams Over a Sliding Window Sketching Asynchronous Streams Over a Sliding Window Srikanta Tirthapura (Iowa State University) Bojian Xu (Iowa State University) Costas Busch (Rensselaer Polytechnic Institute) 1/32 Data Stream Processing

More information

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an

More information

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Nesnelerin İnternetinde Veri Analizi

Nesnelerin İnternetinde Veri Analizi Bölüm 4. Frequent Patterns in Data Streams w3.gazi.edu.tr/~suatozdemir What Is Pattern Discovery? What are patterns? Patterns: A set of items, subsequences, or substructures that occur frequently together

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Luca Foschini joint work with Sorabh Gandhi and Subhash Suri University of California Santa Barbara ICDE 2010

More information

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another

More information

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

ColumnStore Indexes. מה חדש ב- 2014?SQL Server.

ColumnStore Indexes. מה חדש ב- 2014?SQL Server. ColumnStore Indexes מה חדש ב- 2014?SQL Server דודאי מאיר meir@valinor.co.il 3 Column vs. row store Row Store (Heap / B-Tree) Column Store (values compressed) ProductID OrderDate Cost ProductID OrderDate

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Parallel Merge Sort Using MPI

Parallel Merge Sort Using MPI Parallel Merge Sort Using MPI CSE 702: Seminar on Programming Massively Parallel Systems Course Instructor: Dr. Russ Miller UB Distinguished Professor Department of Computer Science & Engineering State

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Improving the Performance of OLAP Queries Using Families of Statistics Trees Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University

More information

Lab - Manage Virtual Memory in Windows 7 and Vista

Lab - Manage Virtual Memory in Windows 7 and Vista Lab - Manage Virtual Memory in Windows 7 and Vista Introduction In this lab, you will customize virtual memory settings. Recommended Equipment A computer with Windows 7 or Vista installed The hard drive

More information

CACHE-OBLIVIOUS MAPS. Edward Kmett McGraw Hill Financial. Saturday, October 26, 13

CACHE-OBLIVIOUS MAPS. Edward Kmett McGraw Hill Financial. Saturday, October 26, 13 CACHE-OBLIVIOUS MAPS Edward Kmett McGraw Hill Financial CACHE-OBLIVIOUS MAPS Edward Kmett McGraw Hill Financial CACHE-OBLIVIOUS MAPS Indexing and Machine Models Cache-Oblivious Lookahead Arrays Amortization

More information

Multiresolution Motif Discovery in Time Series

Multiresolution Motif Discovery in Time Series Tenth SIAM International Conference on Data Mining Columbus, Ohio, USA Multiresolution Motif Discovery in Time Series NUNO CASTRO PAULO AZEVEDO Department of Informatics University of Minho Portugal April

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

Visualization-supported Analysis of System Data for Controlled VMI-based Intrusion Detection

Visualization-supported Analysis of System Data for Controlled VMI-based Intrusion Detection Visualization-supported Analysis of System Data for Controlled VMI-based Intrusion Detection Noëlle Rakotondravony, Prof. Hans P. Reiser Juniorprofessur für Sicherheit in Informationssystemen Universität

More information

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Query Processing. Introduction to Databases CompSci 316 Fall 2017 Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Fast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann

Fast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann Fast Approximations for Analyzing Ten Trillion Cells Filip Buruiana (filipb@google.com) Reimar Hofmann (reimar.hofmann@hs-karlsruhe.de) Outline of the Talk Interactive analysis at AdSpam @ Google Trade

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure

More information

Evaluation of Relational Operations

Evaluation of Relational Operations Evaluation of Relational Operations Yanlei Diao UMass Amherst March 13 and 15, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection

More information

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal,

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

ChunkStash: Speeding Up Storage Deduplication using Flash Memory ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath +, Sudipta Sengupta *, Jin Li * * Microsoft Research, Redmond (USA) + Univ. of Minnesota, Twin Cities (USA) Deduplication

More information

Learning Kernels with Random Features

Learning Kernels with Random Features Learning Kernels with Random Features Aman Sinha John Duchi Stanford University NIPS, 2016 Presenter: Ritambhara Singh Outline 1 Introduction Motivation Background State-of-the-art 2 Proposed Approach

More information

Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams

Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams Graham Cormode Bell Laboratories cormode@lucent.com Flip Korn AT&T Labs Research flip@research.att.com S. Muthukrishnan

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Crescando: Predictable Performance for Unpredictable Workloads

Crescando: Predictable Performance for Unpredictable Workloads Crescando: Predictable Performance for Unpredictable Workloads G. Alonso, D. Fauser, G. Giannikis, D. Kossmann, J. Meyer, P. Unterbrunner Amadeus S.A. ETH Zurich, Systems Group (Funded by Enterprise Computing

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University) Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University) Background: Memory Caching Two orders of magnitude more reads than writes

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

Dynamic Flow Regulation for IP Integration on Network-on-Chip

Dynamic Flow Regulation for IP Integration on Network-on-Chip Dynamic Flow Regulation for IP Integration on Network-on-Chip Zhonghai Lu and Yi Wang Dept. of Electronic Systems KTH Royal Institute of Technology Stockholm, Sweden Agenda The IP integration problem Why

More information

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Design Tradeoffs for Data Deduplication Performance in Backup Workloads Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia

More information

CS 655 Advanced Topics in Distributed Systems

CS 655 Advanced Topics in Distributed Systems Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3

More information

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul Guest Lecture in MIT 6.172 Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 2 I m an MIT

More information

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Using MPI One-sided Communication to Accelerate Bioinformatics Applications Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang (hwang121@vt.edu) Department of Computer Science, Virginia Tech Next-Generation Sequencing (NGS) Data Analysis NGS Data

More information

I/O Characterization of Commercial Workloads

I/O Characterization of Commercial Workloads I/O Characterization of Commercial Workloads Kimberly Keeton, Alistair Veitch, Doug Obal, and John Wilkes Storage Systems Program Hewlett-Packard Laboratories www.hpl.hp.com/research/itc/csl/ssp kkeeton@hpl.hp.com

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

Introduction to Linked List: Review. Source:

Introduction to Linked List: Review. Source: Introduction to Linked List: Review Source: http://www.geeksforgeeks.org/data-structures/linked-list/ Linked List Fundamental data structures in C Like arrays, linked list is a linear data structure Unlike

More information

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from

More information

Data Platforms and Pattern Mining

Data Platforms and Pattern Mining Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,

More information

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages Overview of Query Processing Query Parser Query Processor Evaluation of Relational Operations Query Rewriter Query Optimizer Query Executor Yanlei Diao UMass Amherst Lock Manager Access Methods (Buffer

More information

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843 DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843 WHAT IS A DATA CUBE? The Data Cube or Cube operator produces N-dimensional answers

More information

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model Typical Computer Hierarchical Memory Basics Data moved between adjacent memory level in blocks A Trivial Program A Trivial Program: d = 1 A Trivial Program: d = 1 A Trivial Program: n = 2 24 A Trivial

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information