Computations with Bounded Errors and Response Times on Very Large Data
|
|
- Bryce Richard
- 5 years ago
- Views:
Transcription
1 Computations with Bounded Errors and Response Times on Very Large Data Ion Stoica UC Berkeley (joint work with: Sameer Agarwal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Purnamrita Sarkar, Michael Jordan, Sam Madden) Paris, May 15, 2013 UC BERKELEY
2 Problem Support interactive ad-hoc exploration queries over very large datasets
3 Why is This a Problem? 100 TB on 1000 cores/disks 1-2 Hours Minutes 1 second? Hard Disks Memory
4 Why is This a Problem? Even if no communication and all data in memory, query may take tens of sec» Just scanning GB RAM may take 10 sec Still slow for interactive queries
5 Why is This a Problem? 60 Projected Growth Increase over Moore's Law Overall Data Particle Accel. DNA Sequencers Data Grows faster than Moore s Law [IDC report, Kathy Yelick, LBNL]
6 Key Insight Computations don t always need exact answers Input often noisy: exact computations do not guarantee exact answers Error often acceptable if small and bounded Best scale ± 200g error Speedometers ± 2.5 % error (edmunds.com) OmniPod Insulin Pump ± 0.96 % error (
7 Approach: Sampling Compute results on samples instead of full data» Typically, error depends on sample size (n) not on original data size, i.e., error α 1/ n Can trade between answer s latency and accuracy Data rapid increase no longer a problem :» Error decreases with Moore s law: halves every 36 months
8 This Talk BlinkDB: approximate query engine for very large data sets using off- line sampling
9 BlinkDB Interface SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt= WITHIN 1 SECONDS ± 15.32
10 BlinkDB Interface SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt= WITHIN 2 SECONDS SELECT avg(sessiontime) FROM Table WHERE city= San Francisco AND dt= ERROR 0.1 CONFIDENCE 95.0% ± ± 4.96
11 BlinkDB Overview TABLE Original Data Sampling Module Offline- sampling: OpHmal set of samples across different dimensions (columns or sets of columns) to support ad- hoc exploratory queries
12 BlinkDB Overview TABLE Original Data Sampling Module Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in- memory. On- Disk Samples In- Memory Samples
13 BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query Query Plan Sample Selection TABLE Original Data Sampling Module On- Disk Samples In- Memory Samples
14 BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query TABLE Original Data Sampling Module Query Plan Sample Selection Online sample selection to pick best sample(s) based on query latency and accuracy requirements On- Disk Samples In- Memory Samples
15 BlinkDB Overview SELECT foo (*) FROM TABLE WITHIN 2 HiveQL/SQL Query New Query Plan Sample Selection Error Bars & Confidence Intervals Hadoop/Spark TABLE Original Data Sampling Module On- Disk Samples In- Memory Samples Result ± 5.56 (95% confidence) Parallel query execuhon on mulhple samples striped across mulhple machines.
16 Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?
17 Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?
18 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)
19 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)
20 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)
21 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Time Taken (in seconds) Sample Size (in MB)
22 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Error 1/ n Time Taken (in seconds) Sample Size (in MB)
23 Error Latency Profile (ELP) Relative Error Time Taken Relative Error Error 1/ n Latency n Time Taken (in seconds) Sample Size (in MB)
24 Error Latency Profile (ELP) SELECT foo (*) FROM TABLE WITHIN 2
25 Error Latency Profile (ELP) SELECT foo (*) FROM TABLE WITHIN ms
26 Challenges Which set of samples to build given a storage budget? Which sample to run the query on? How to accurately estimate the error?
27 How to Accurately Estimate Error? Close formulas for limited number of operators» E.g., count, mean, percentiles What about user defined functions (UDFs)?
28 Experimental Workload Conviva: 30- day log of media accesses by Conviva users. Raw data 17 TB, partitioned this data across 100 nodes Log of 20,000 queries» 43.6% queries have one or more UDFs Storage budget: 50% of original data» 8 stratified samples
29 Experimental Setting 100 node cluster of EC2 extra large instances:» 800 cores» 6.8TB RAM» 75TB disk Two datasests: 2.5TB, and 7.5TB, respectively» Significantly larger when stored in memory
30 Sampling vs. No Sampling (close formula) Partially Fully Cached
31 Sampling vs. No Sampling (close formula)
32 Sampling vs. No Sampling (close formula) 12x Faster! x Faster!
33 How to Accurately Estimate Error? Close formulas for limited number of operators» E.g., count, mean, percentiles What about user defined functions (UDFs)?
34 Bootstrap Quantify accuracy of a query on a sample table Original Table T Q(T) Q(T) takes too long! sample S = N S Q(S) what is Q(S) s error? sampling with replacement S 1 S k Q (S 1 ) Q (S k ) use Q(S 1 ),, Q(S K ) to compute estimator quality assessment, e.g., stdev(q(s i )) S i = N
35 Also Useful for Close Formulas filter1 sum filter2 sum x, ε 1 y, ε 2 Every error propagation step may div introduce additional error r = x/y, ε = epr(ε 1, ε 2 )
36 Also Useful for Close Formulas S S 1 S k filter1 filter2 filter1 filter2 filter1 filter2 sum x, ε 1 div sum y, ε 2 sum sum x y sum Bootstrapping div only computes div errors on the final result x sum y r = x/y, ε = epr(ε 1, ε 2 ) r 1 r, ε r k
37 Also Useful for Close Formulas 6x
38 Bootstrap Challenges How do you know the bootstrap is working?» Depends on distribution, computation, sample size Overhead
39 How Do You Know Bootstrap is Working? Assumption: f() is Hadamard differentiable» How do you know an UDF is Hadamard differentiable?» Only asymptotic consistency» Sufficient, not necessary condition Developed data- driven diagnostic for Bootstrap» Compare bootstrapping with ground truth for small samples» Check whether error improves as sample size increases
40 Ground Truth (Approximation) T Original Table
41 Ground Truth (Approximation) T Original Table S 1 Q (S 1 ) ξ = stdev(q(s j )) S p Q (S p ) Estimator quality assessment
42 Ground Truth and Bootstrap T Original Table Bootstrap on Individual Samples S 11 Q (S 11 ) S 1 ξ * 1 = stdev(q(s 1 j )) S 1k Q (S 1k ) S p1 Q (S p1 ) S p ξ * p = stdev(q(s pj )) S pk Q (S pk )
43 Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * = stdev(q(s )) ip ipj RelaHve Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size
44 Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * = stdev(q(s )) ip ipj RelaHve Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size
45 Ground Truth vs. Bootstrap ξ i = mean(ξ ij ) ξ * i1 = stdev(q(s i1 j )) ξ * = stdev(q(s )) ip ipj RelaHve Error Bootstrap Ground Truth n 1 n 2 n 3 Sample Size
46 Bootstrap Diagnosis Expectation test:» Bootstrap results should not deviate too much from ground truth, on average Standard deviation test:» Bootstrap results should not vary too much Confidence interval test:» Most (e.g., 95%) of bootstrap results should be close to ground truth
47 Expectation Test Δ i mean(ξ * i1,...,ξ * ip) ξ i ξ i (Δ i+1 < Δ i ) (Δ i+1 c 1 ) i =1,..., s Δ c 1 n 1 n 2 n 3 Sample Size sample size
48 Standard Deviation Test σ i stddev(ξ * i1,...,ξ * ip ) (σ i+1 < σ i ) (σ i+1 c 2 ), ξ i i =1,..., s σ Test fails! c 2 n 1 n 2 n 3 Sample Size sample size
49 Confidence Interval Test $ & # j 1,..., p : ξ * ij ξ ( i & % c 3 ) '& ξ i *& p α FracHon α of bootstrap results have rel. error of at most c 3 from ground truth (e.g., α = 0.95, c 3 = 0.5)
50 How Well Does it Work in Practice? Evaluated on 268 real- world Conviva Queries of which 113 had custom User- Defined Functions Diagnostic predicted that 207 (77%) queries can be approximated» False Positives: 3 (conditional UDFs)» False Negatives: 18
51 Bootstrap Challenges How do you know the bootstrap is working?» Depends on distribution, computation, sample size Overhead
52 Very, Very Preliminary Results Setting:» 25 EC2 instances with 4 slots and 15GB RAM» Input: 365mil rows, 204GB on disk, > 600GB in memory (deserialized format)» Workload: query computing 95- th percentile Overheads:» Bootstrap to estimate result s error» Bootstrap diagnosis
53 Query Resp. Time & Overhead Operation Computation complexity I/O complexity Full data C(N) O(N) N data size
54 Query Resp. Time & Overhead Operation Full data Sample Computation complexity C(N) C(n) I/O complexity O(N) O(n) N data size n sample size
55 Query Resp. Time & Overhead Operation Full data Sample Bootstrap Computation complexity C(N) C(n) k C(n) I/O complexity O(N) O(n) k O(n) N data size n sample size k # of samples used by bootstrap
56 Query Resp. Time & Overhead Operation Full data Sample Bootstrap Diagnosis Computation complexity C(N) C(n) k C(n) s i=1 p k C(n i ) I/O complexity O(N) O(n) k O(n) s i=1 p k O(n i ) N data size n sample size n i sample size for iteration i of diagnostic k # of samples used by bootstrap p # of samples used by ground truth s # of iteration used by diagnostic
57 Query Response Time seconds 1020 (0.02%) 10x as response time is dominated by I/O Query Response Time 103 (0.07%) (1.1%) (3.4%) (11%) Fraction of full data
58 Query Resp. Time + Bootstrap seconds Bootstrap overhead Query Response Time Bootstrap overhead up to 10x higher than query itself Fraction of full data
59 Bootstrap Query Plan Query plan Table Scan Table 1 Scan sample Table Bootstrap plan sample Table k Scan Filter Filter Filter Perc. Perc. Perc.
60 Detailed Query Plan Query plan Table Scan Filter Perc. Query SELECT percentile (sesstiontimes, 0.95) FROM Table WHERE dt >= start_day AND dt <= end_day AND customerid = customer1 AND sessiontype= type1
61 Detailed Query Plan Typically, filters remove many rows from input table Table Scan Filter Perc. SELECT percentile (sesstiontimes, 0.95) FROM Table WHERE dt >= start_day AND dt <= end_day AND customerid = customer1 AND sessiontype= type1
62 Filter Pushdown Optimization Table Table Scan Scan Filter Typically, much smaller than Table Filter Table f 1 Table f Table f k Perc. Perc. Bootstrap Perc.
63 Query + Bootstrap seconds (with Filter Pushdown) 202 Up to 2x overall improvement Bootstrap overhead Query Response Time Fraction of full data
64 Query + Bootstrap + Diagnosis seconds (with Filter Pushdown) 532 Diagnosis overhead Bootstrap overhead Query Response Time Diagnosis overhead up to 8x higher than the rest Fraction of full data
65 Overhead Operation Full data Comp. complexity C(N) Comm. complexity O(N) # of tasks O(d) Sample Bootstrap Diagnosis C(n) k C(n) s i=1 p k C(n i ) O(n) k O(n) s i=1 p k O(n i ) O(d) Can generate millions k O(d) of tasks s p k O(d) N data size n sample size n i sample size for iteration i of diagnostic k # of samples used by bootstrap p # of samples used by ground truth s # of iteration used by diagnostic d degree of parallelism
66 Diagnosis Optimization Problem: too many tasks to launch & schedule» Fixed overhead dominates Solution: task consolidation» Consolidate the computation into d tasks, where d is the number of slots in the system» Caveat: currently a hack; not part of BlinkDB codebase Result:» Reduce diagnosis from 330s to 4.5s» Reduce bootstrap overhead by up to 10x
67 Query + Bootstrap + Diagnosis (with Filter Pushdown and Task Consolidation) sec Diagnosis overhead Bootstrap overhead Query Response Time Less than 70% overhead Fraction of full data
68 Summary Computing error bounds for approximate queries on massive data sets: a hard, important, and exciting problem At the intersection between:» ML: error computation» Databases: query plan optimization, sample creation and selection» Systems: improved parallelism, scheduling Preliminary results encouraging
69 Future Work Improve bootstrap diagnostic:» Provable properties for specific settings Improve coverage for error estimation» E.g., use static analysis to decompose programs in multiple Hadamard differentiable components Improve Bootstrap scalability with Bag of Little Bootstraps [Kleiner et al. 2012]:» No need to distribute query computation even for huge samples (e.g., 100 billion 1KB per record) Scheduling concurrent parallel, dependent jobs
Warehouse- Scale Computing and the BDAS Stack
Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationDatabase Learning: Toward a Database that Becomes Smarter Over Time
Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor Today s databases Database Users
More informationRecord Placement Based on Data Skew Using Solid State Drives
BPOE-5 Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1,2, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories,
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationLinear Regression Optimization
Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:
More informationUniversalizing Approximate Query Processing
Universalizing Approximate Query Processing Yongjoo Park Joseph Sorenson Barzan Mozafari Junhao Wang Universal Approximate Query Processing Universal Approximate Query Processing What is Approximate Query
More informationFast, Interactive, Language-Integrated Cluster Computing
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationRecord Placement Based on Data Skew Using Solid State Drives
Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories, NEC j-suzuki@ax.jp.nec.com
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive
More informationIntroduction to Queueing Theory for Computer Scientists
Introduction to Queueing Theory for Computer Scientists Raj Jain Washington University in Saint Louis Jain@eecs.berkeley.edu or Jain@wustl.edu A Mini-Course offered at UC Berkeley, Sept-Oct 2012 These
More informationComputational and Statistical Tradeoffs in VoI-Driven Learning
ARO ARO MURI MURI on on Value-centered Theory for for Adaptive Learning, Inference, Tracking, and and Exploitation Computational and Statistical Tradefs in VoI-Driven Learning Co-PI Michael Jordan University
More informationSinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley
Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC Berkeley Communication is Crucial for Analytics at Scale Performance Facebook analytics
More informationPage 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24
Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony
More informationA Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510
A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510 Incentives for migrating to Exchange 2010 on Dell PowerEdge R720xd Global Solutions Engineering
More informationComputer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1
Computer Memory Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up public int sum1(int n, int m, int[][] table) { int output = 0; for (int i = 0; i < n; i++) { for (int j = 0; j
More informationASAP: Fast, Approximate Graph Pattern Mining at Scale
ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;
More informationHIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS
HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS Proven Companies and Products Fusion-io Leader in PCIe enterprise flash platforms Accelerates mission-critical applications
More informationAbstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight
ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group
More informationBuilding High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL
Building High Performance Apps using NoSQL Swami Sivasubramanian General Manager, AWS NoSQL Building high performance apps There is a lot to building high performance apps Scalability Performance at high
More informationHANA Performance. Efficient Speed and Scale-out for Real-time BI
HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business
More informationDatabase Services at CERN with Oracle 10g RAC and ASM on Commodity HW
Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW UKOUG RAC SIG Meeting London, October 24 th, 2006 Luca Canali, CERN IT CH-1211 LCGenève 23 Outline Oracle at CERN Architecture of CERN
More informationBeyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist
Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist *ICSI Berkeley Zuse Institut Berlin 4/26/2010 Joos-Hendrik Boese Slide
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationCS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures
CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2 I/O performance measures I/O performance measures diversity: which I/O devices can connect to the system? capacity: how many I/O devices
More informationOptimizing Parallel Access to the BaBar Database System Using CORBA Servers
SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,
More informationHow Eventual is Eventual Consistency?
Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February
More informationHYRISE In-Memory Storage Engine
HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University
More informationMapReduce Algorithms
Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline
More informationDatabase in the Cloud Benchmark
Database in the Cloud Benchmark Product Profile and Evaluation: Vertica and Redshift By William McKnight and Jake Dolezal July 2016 Sponsored by Table of Contents EXECUTIVE OVERVIEW 3 BENCHMARK SETUP 4
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationOverview and Introduction to Scientific Visualization. Texas Advanced Computing Center The University of Texas at Austin
Overview and Introduction to Scientific Visualization Texas Advanced Computing Center The University of Texas at Austin Scientific Visualization The purpose of computing is insight not numbers. -- R. W.
More informationRapid growth of massive datasets
Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,
More informationVirtual SQL Servers. Actual Performance. 2016
@kleegeek davidklee.net heraflux.com linkedin.com/in/davidaklee Specialties / Focus Areas / Passions: Performance Tuning & Troubleshooting Virtualization Cloud Enablement Infrastructure Architecture Health
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationSuccinct Data Structures: Theory and Practice
Succinct Data Structures: Theory and Practice March 16, 2012 Succinct Data Structures: Theory and Practice 1/15 Contents 1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics Succinct
More informationPrivApprox. Privacy- Preserving Stream Analytics.
PrivApprox Privacy- Preserving Stream Analytics https://privapprox.github.io Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Thorsten Strufe July 2017 Motivation Clients Analysts
More informationRe- op&mizing Data Parallel Compu&ng
Re- op&mizing Data Parallel Compu&ng Sameer Agarwal Srikanth Kandula, Nicolas Bruno, Ming- Chuan Wu, Ion Stoica, Jingren Zhou UC Berkeley A Data Parallel Job can be a collec/on of maps, A Data Parallel
More informationMISO: Souping Up Big Data Query Processing with a Multistore System
MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan, NEC Labs Hakan Hacıgümüş. NEC Labs Junichi Tatemura, NEC Labs Neoklis Polyzotis,
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationActian Vector Benchmarks. Cloud Benchmarking Summary Report
Actian Vector Benchmarks Cloud Benchmarking Summary Report April 2018 The Cloud Database Performance Benchmark Executive Summary The table below shows Actian Vector as evaluated against Amazon Redshift,
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationSAP NetWeaver BW Performance on IBM i: Comparing SAP BW Aggregates, IBM i DB2 MQTs and SAP BW Accelerator
SAP NetWeaver BW Performance on IBM i: Comparing SAP BW Aggregates, IBM i DB2 MQTs and SAP BW Accelerator By Susan Bestgen IBM i OS Development, SAP on i Introduction The purpose of this paper is to demonstrate
More informationRandom Forests for Big Data
Random Forests for Big Data R. Genuer a, J.-M. Poggi b, C. Tuleau-Malot c, N. Villa-Vialaneix d a Bordeaux University c Nice University b Orsay University d INRA Toulouse October 27, 2017 CNAM, Paris Outline
More informationPredictive Analysis: Evaluation and Experimentation. Heejun Kim
Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training
More informationAmbry: LinkedIn s Scalable Geo- Distributed Object Store
Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil
More informationA scalable bootstrap for massive data
J. R. Statist. Soc. B (24) 76, Part 4, pp. 795 86 A scalable bootstrap for massive data Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I. Jordan University of California, Berkeley, USA [Received
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationApproxJoin: Approximate Distributed Joins
Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas #, Ruichuan Chen, Christof Fetzer, Thorsten Strufe TU Dresden, Germany Nokia Bell Labs, Germany The University
More informationSTORING DATA: DISK AND FILES
STORING DATA: DISK AND FILES CS 564- Spring 2018 ACKs: Dan Suciu, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? How does a DBMS store data? disk, SSD, main memory The Buffer manager controls how
More informationChapter 7. GridStor Technology. Adding Data Paths. Data Paths for Global Deduplication. Data Path Properties
Chapter 7 GridStor Technology GridStor technology provides the ability to configure multiple data paths to storage within a storage policy copy. Having multiple data paths enables the administrator to
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing
More informationMicrosoft Exam
Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred
More informationCS 425 / ECE 428 Distributed Systems Fall 2015
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Measurement Studies Lecture 23 Nov 10, 2015 Reading: See links on website All Slides IG 1 Motivation We design algorithms, implement
More informationCS Project Report
CS7960 - Project Report Kshitij Sudan kshitij@cs.utah.edu 1 Introduction With the growth in services provided over the Internet, the amount of data processing required has grown tremendously. To satisfy
More informationDAHA AKILLI BĐR DÜNYA ĐÇĐN BĐLGĐ ALTYAPILARIMIZI DEĞĐŞTĐRECEĞĐZ
Information Infrastructure Forum, Istanbul DAHA AKILLI BĐR DÜNYA ĐÇĐN BĐLGĐ ALTYAPILARIMIZI DEĞĐŞTĐRECEĞĐZ 2010 IBM Corporation Information Infrastructure Forum, Istanbul IBM XIV Veri Depolama Sistemleri
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced
GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed
More informationMetrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationIngo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.
Intelligent Storage Results from real life testing Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA SAS Intelligent Storage components! OLAP Server! Scalable Performance Data Server!
More informationDiscretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More informationBe Fast, Cheap and in Control with SwitchKV Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationApproximate Query Engines
Approximate Query Engines Commercial Challenges and Research Opportunities Barzan Mozafari University of Michigan, Ann Arbor SnappyData Inc. SIGMOD 2017 SnappyData Inc. 2017 1 What Is Approximate Query
More informationThe Right Read Optimization is Actually Write Optimization. Leif Walsh
The Right Read Optimization is Actually Write Optimization Leif Walsh leif@tokutek.com The Right Read Optimization is Write Optimization Situation: I have some data. I want to learn things about the world,
More informationApproximately Uniform Random Sampling in Sensor Networks
Approximately Uniform Random Sampling in Sensor Networks Boulat A. Bash, John W. Byers and Jeffrey Considine Motivation Data aggregation Approximations to COUNT, SUM, AVG, MEDIAN Existing work does not
More informationWas ist dran an einer spezialisierten Data Warehousing platform?
Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction
More informationNearest Neighbor Classification
Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest
More informationImproved Visibility Computation on Massive Grid Terrains
Improved Visibility Computation on Massive Grid Terrains Jeremy Fishman Herman Haverkort Laura Toma Bowdoin College USA Eindhoven University The Netherlands Bowdoin College USA Laura Toma ACM GIS 2009
More informationA Thorough Introduction to 64-Bit Aggregates
Technical Report A Thorough Introduction to 64-Bit Aggregates Shree Reddy, NetApp September 2011 TR-3786 CREATING AND MANAGING LARGER-SIZED AGGREGATES The NetApp Data ONTAP 8.0 operating system operating
More informationImproving the Efficiency of Multi-site Web Search Engines
Improving the Efficiency of Multi-site Web Search Engines Xiao Bai ( xbai@yahoo-inc.com) Yahoo Labs Joint work with Guillem Francès Medina, B. Barla Cambazoglu and Ricardo Baeza-Yates July 15, 2014 Web
More informationMemory-Based Cloud Architectures
Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)
More informationField Update Expanded Deduplication Sizing Guidelines. Oct 2015
Field Update Expanded Deduplication Sizing Guidelines Oct 2015 As part of our regular service pack updates in version 10, we have been making incremental improvements to our media and storage management
More informationCS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es
CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit Agarwal Slides based on: many many discussions with Ion Stoica, his class, and many industry folks Servers Typical
More informationTop Trends in DBMS & DW
Oracle Top Trends in DBMS & DW Noel Yuhanna Principal Analyst Forrester Research Trend #1: Proliferation of data Data doubles every 18-24 months for critical Apps, for some its every 6 months Terabyte
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationBuffered Count-Min Sketch on SSD: Theory and Experiments
Buffered Count-Min Sketch on SSD: Theory and Experiments Mayank Goswami, Dzejla Medjedovic, Emina Mekic, and Prashant Pandey ESA 2018, Helsinki, Finland The heavy hitters problem (HH(k)) Given stream of
More informationTime Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel
More informationImplementation of Aggregation of Map and Reduce Function for Performance Improvisation
2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation
More informationEfficiency vs. Effectiveness in Terabyte-Scale IR
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system
More informationVarys. Efficient Coflow Scheduling. Mosharaf Chowdhury, Yuan Zhong, Ion Stoica. UC Berkeley
Varys Efficient Coflow Scheduling Mosharaf Chowdhury, Yuan Zhong, Ion Stoica UC Berkeley Communication is Crucial Performance Facebook analytics jobs spend 33% of their runtime in communication 1 As in-memory
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationPF-OLA: A High-Performance Framework for Parallel Online Aggregation
Noname manuscript No. (will be inserted by the editor) PF-OLA: A High-Performance Framework for Parallel Online Aggregation Chengjie Qin Florin Rusu Received: date / Accepted: date Abstract Online aggregation
More informationGoogle File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo
Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance
More informationPARDA: Proportional Allocation of Resources for Distributed Storage Access
PARDA: Proportional Allocation of Resources for Distributed Storage Access Ajay Gulati, Irfan Ahmad, Carl Waldspurger Resource Management Team VMware Inc. USENIX FAST 09 Conference February 26, 2009 The
More informationParallel File Systems. John White Lawrence Berkeley National Lab
Parallel File Systems John White Lawrence Berkeley National Lab Topics Defining a File System Our Specific Case for File Systems Parallel File Systems A Survey of Current Parallel File Systems Implementation
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationCSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare
CSE 373: Data Structures and Algorithms Memory and Locality Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey Champion, Ben Jones, Adam Blank, Michael Lee, Evan McCarty, Robbie Weber,
More informationEfficient Aggregation for Graph Summarization
Efficient Aggregation for Graph Summarization Yuanyuan Tian (University of Michigan) Richard A. Hankins (Nokia Research Center) Jignesh M. Patel (University of Michigan) Motivation Graphs are everywhere
More informationA Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu
More informationOPTIMAL SERVICE PRICING FOR A CLOUD CACHE
OPTIMAL SERVICE PRICING FOR A CLOUD CACHE ABSTRACT: Cloud applications that offer data management services are emerging. Such clouds support caching of data in order to provide quality query services.
More informationThe Oracle Database Appliance I/O and Performance Architecture
Simple Reliable Affordable The Oracle Database Appliance I/O and Performance Architecture Tammy Bednar, Sr. Principal Product Manager, ODA 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
More informationGoogle File System. Arun Sundaram Operating Systems
Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)
More informationOne Factor Experiments
One Factor Experiments 20-1 Overview Computation of Effects Estimating Experimental Errors Allocation of Variation ANOVA Table and F-Test Visual Diagnostic Tests Confidence Intervals For Effects Unequal
More information