Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich
|
|
- Hilda Wright
- 5 years ago
- Views:
Transcription
1 7 Things To Know When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich
2 1 What Shoes? Why Shoes?
3 3 Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs (Hadoop++, epic) Data Layouts & Access Paths!!
4 2 Why Elephant Needs Different Shoes?
5 5 Very Large Scale Storage & Execution DBMS MapReduce
6 6 Large Data Block Sizes DBMS MapReduce 8 KB 1 GB
7 7 Block Level Data Replication DBMS MapReduce 001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc 001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc
8 3 What s Wrong with Old Shoes?
9 Current Data Layouts in Hadoop Row Column* PAX** (default) 001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc * A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, April, 2011 ** Y. He et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE,
10 10 Current Data Layouts in Hadoop Row Column PAX Non-required Reads Network Costs Data Block Placement Tuple Reconstruction
11 10 Current Data Layouts in Hadoop Data Access Cost [sec] 5 4 Non-required Reads 3 Network Costs Data Block Placement 2 Trojan Layout Row Layout Column Layout PAX Layout Optimal Layout Tuple Reconstruction Row Column PAX Number of Referenced Attributes (Out of 30)
12 4 What Shoes do We Propose?
13 12 Trojan Data Layouts Replica 1 Replica 2 Replica 3
14 13 Trojan Data Layouts Non-required Reads Network Costs Data Block Placement Tuple Reconstruction Row Column PAX Trojan
15 Challenges in Trojan Data Layouts How do we design shoe for one leg? How do we design shoes for all legs? How do we make the shoes from the design? 14
16 5 How Do We Design the Shoes?
17 Single Replica Columns Column groups Filter Novel Column Group Interestingness Interesting Column groups Column Group Packing as 0-1 Knapsack Pack Complete & disjoint column groups 16
18 Multiple Replicas Queries Query groups Filter Interesting Query groups Pack Complete & disjoint query groups 17
19 18 Multiple Replicas Filter Pack Replica 1 Replica 2 Replica 3 Columns Columns Columns Column groups Filter Column groups Filter Column groups Filter Interesting Column groups Interesting Column groups Interesting Column groups Pack Pack Pack Complete & disjoint column groups Complete & disjoint column groups Complete & disjoint column groups
20 19 Multiple Replicas Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8 Filter TPC-H Customer Pack Q2, Q3, Q4 Q5 Q1, Q6, Q7, Q8 Replica 1 Replica 2 Replica 3 Columns Columns Columns Column groups Filter Column groups Filter Name Column groups Filter Custkey, Nationkey Interesting Column groups Name, Address, Phone, AcctBal, Mktsegment, Comment Pack Complete & disjoint column groups Mktsegment Interesting Column groups Custkey, Name, Address, Nationkey, Phone, AcctBal, Comment Pack Complete & disjoint column groups Custkey Mktsegment Phone, AcctBal Interesting Column groups Pack Complete & disjoint Address, Nationkey, Comment column groups
21 20 Trojan Layout Advantages Multiple layouts for a given workload Default row layout still available Specialized replicas for different query sub-class Divide and conquer layout computation
22 6 How do We Ride the Elephant?
23 Putting It All Together Load Create trojan layout configuration file in HDFS dataset layout-1 layout-2 layout-3 Query Supply referenced attributes in JobConf itemize UDF to transparently read the referenced attributes Schedule? Three Optimization Options: - data locality (default) - best layout - best layout & locality 22
24 7 How were the Field Trials?
25 24 Setup Datasets TPC-H Lineitem, TPC-H Customer, SSB LineOrder, SDSS PhotoObj Queries First 8 queries from the respective benchmark for each table Methodology focus on scan and projection operators i.e. map-phase-only jobs improvement: record reader time (I/O and tuple reconstruction) Hardware 50 virtual nodes in a 10 node cluster
26 25 Per-replica Trojan Layout Performance TPC-H Lineitem Improvement Factor over Hadoop-Row over Hadoop-PAX Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 TPC-H Queries (b) TPC-H
27 Layout Quality #Non-required Attributes Read #Joins in Tuple Reconstruction HADOOP-ROW HADOOP-PAX HYRISE* Layout 2 64 Trojan Layout >14% improvement over HYRISE * M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, November,
28 Rela 0 Scheduling Decisions Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 TPC-H Queries TPC-H Lineitem 5 Scheduling Penalty 8 1 Best-Layout & Locality 4 Best-Layout Locality (default) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 27
29 28 Summary Data layouts crucial to MR job performance Exploit default data block replication in MR Novel algorithm to compute per-replica layouts Improvement: 4.8x over Row, 3.5x over PAX Better than HYRISE; 14% improvement
How Achaeans Would Construct Columns in Troy. Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte
How Achaeans Would Construct Columns in Troy Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte Number of Visas Received 1 0,75 0,5 0,25 0 Alekh Jens Health Level
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data
More informationAccelerating Analytical Workloads
Accelerating Analytical Workloads Thomas Neumann Technische Universität München April 15, 2014 Scale Out in Big Data Analytics Big Data usually means data is distributed Scale out to process very large
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationA Comparison of Knives for Bread Slicing
A Comparison of Knives for Bread Slicing Alekh Jindal Endre Palatinus Vladimir Pavlov Jens Dittrich Information Systems Group, Saarland University http://infosys.cs.uni-saarland.de ABSTRACT Vertical partitioning
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationarxiv: v1 [cs.db] 21 Jan 2017
INGESTBASE: A Declarative Data Ingestion System Alekh Jindal Microsoft aljindal@microsoft.com Jorge-Arnulfo Quiané-Ruiz QCRI jquianeruiz@qf.org.qa Samuel Madden MIT madden@csail.mit.edu arxiv:171.693v1
More informationarxiv: v1 [cs.db] 1 Aug 2012
Only ggressive Elephants are Fast Elephants Jens Dittrich, Jorge-rnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, lekh Jindal, Jörg Schad Information Systems Group Saarland University http://infosys.cs.uni-saarland.de
More informationShark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko
Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationResource and Performance Distribution Prediction for Large Scale Analytics Queries
Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationHYRISE In-Memory Storage Engine
HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University
More informationMap/Reduce. Large Scale Duplicate Detection. Prof. Felix Naumann, Arvid Heise
Map/Reduce Large Scale Duplicate Detection Prof. Felix Naumann, Arvid Heise Agenda 2 Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere Agenda
More informationCombining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics
EDIC RESEARCH PROPOSAL 1 Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics Ioannis Klonatos DATA, I&C, EPFL Abstract High scalability is becoming an essential requirement
More informationa linear algebra approach to olap
a linear algebra approach to olap Rogério Pontes December 14, 2015 Universidade do Minho data warehouse ETL OLTP OLAP ETL Warehouse OLTP Data Mining ETL OLTP Data Marts 2 olap Online analytical processing
More informationHyrise - a Main Memory Hybrid Storage Engine
Hyrise - a Main Memory Hybrid Storage Engine Philippe Cudré-Mauroux exascale Infolab U. of Fribourg - Switzerland & MIT joint work w/ Martin Grund, Jens Krueger, Hasso Plattner, Alexander Zeier (HPI) and
More informationAndrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why
More informationMulti-indexed Graph Based Knowledge Storage System
Multi-indexed Graph Based Knowledge Storage System Hongming Zhu 1,2, Danny Morton 2, Wenjun Zhou 3, Qin Liu 1, and You Zhou 1 1 School of software engineering, Tongji University, China {zhu_hongming,qin.liu}@tongji.edu.cn,
More informationShark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )
Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationEvaluating Data Storage Structures of Map Reduce
The 8th nternational Conference on Computer Science & Education (CCSE 2013) April 26-28, 2013. Colombo, Sri Lanka MoB3.2 Evaluating Data Storage Structures of Map Reduce Haiming Lai, Ming Xu, Jian Xu,
More informationOnly Aggressive Elephants are Fast Elephants. Jun Fan, Vijay Sukhadeve. Computer Science Dept. Worcester Polytechnic Institute (WPI)
Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve Computer Science Dept. Worcester Polytechnic Institute (WPI) Introduction/Motivation Typical analysts Problem analyzing Web Logs Source
More informationWeaving Relations for Cache Performance
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE
More informationBridging the Processor/Memory Performance Gap in Database Applications
Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE
More informationModeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment
DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationPig A language for data processing in Hadoop
Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop
More informationA What-if Engine for Cost-based MapReduce Optimization
A What-if Engine for Cost-based MapReduce Optimization Herodotos Herodotou Microsoft Research Shivnath Babu Duke University Abstract The Starfish project at Duke University aims to provide MapReduce users
More informationI am: Rana Faisal Munir
Self-tuning BI Systems Home University (UPC): Alberto Abelló and Oscar Romero Host University (TUD): Maik Thiele and Wolfgang Lehner I am: Rana Faisal Munir Research Progress Report (RPR) [1 / 44] Introduction
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationParallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case
1 / 39 Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Frédéric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems
More informationOverview of Data Exploration Techniques. Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri
Overview of Data Exploration Techniques Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri data exploration not always sure what we are looking for (until we find it) data has always been big volume
More informationTutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access
Map/Reduce vs. DBMS Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu
More informationWeaving Relations for Cache Performance
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles
More informationV Conclusions. V.1 Related work
V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical
More informationLoad Balancing Through Map Reducing Application Using CentOS System
Load Balancing Through Map Reducing Application Using CentOS System Nidhi Sharma Research Scholar, Suresh Gyan Vihar University, Jaipur (India) Bright Keswani Associate Professor, Suresh Gyan Vihar University,
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationA Graph-based Database Partitioning Method for Parallel OLAP Query Processing
ICDE 18 A Graph-based Database Partitioning Method for Parallel OLAP Query Processing Yoon-Min Nam, Min-Soo Kim*, Donghyoung Han Department of Information and Communication Engineering DGIST, Republic
More informationA Review Paper on Big data & Hadoop
A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College
More informationCOLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)
COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL Introduction On analytical workloads, Column
More informationA Performance Study of Big Data Analytics Platforms
2017 IEEE International Conference on Big Data (BIGDATA) A Performance Study of Big Data Analytics Platforms Pouria Pirzadeh Microsoft United States pouriap@microsoft.com Michael Carey University of California,
More informationLarge Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report
Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases
More informationSandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.
Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationDremel: Interactice Analysis of Web-Scale Datasets
Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32 Overview
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationBig Data Facebook
Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions Big Data @ FB: Scale
More informationHive SQL over Hadoop
Hive SQL over Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction Apache Hive is a high-level abstraction on top of MapReduce Uses
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationOLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications
OLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#25: OldSQL vs. NoSQL vs. NewSQL On-line Transaction Processing: Short-lived txns.
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationHadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)
++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) Jens Dittrich Jorge-Arnulfo Quiané-Ruiz Alekh Jindal, Yagiz Kargin Vinay Setty Jörg Schad Information Systems Group, Saarland
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationEvolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo
Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationA Study of SQL-on-Hadoop Systems
A Study of SQL-on-Hadoop Systems Yueguo Chen 1,2(B), Xiongpai Qin 1,2, Haoqiong Bian 1,2, Jun Chen 1,2, Zhaoan Dong 1,2, Xiaoyong Du 1,2, Yanjie Gao 1,2, Dehai Liu 1,2, Jiaheng Lu 1,2, and Huijie Zhang
More informationContents. Part I Setting the Scene
Contents Part I Setting the Scene 1 Introduction... 3 1.1 About Mobility Data... 3 1.1.1 Global Positioning System (GPS)... 5 1.1.2 Format of GPS Data... 6 1.1.3 Examples of Trajectory Datasets... 8 1.2
More informationImplementing a Linear Algebra Approach to Data Processing
Implementing a Linear Algebra Approach to Data Processing Rogério Pontes 1, Miguel Matos 12, José Nuno Oliveira 1, and José Orlando Pereira 1 1 HASLab, INESC TEC & University of Minho, Braga, Portugal
More informationAdaptDB: Adaptive Partitioning for Distributed Joins
AdaptDB: Adaptive Partitioning for Distributed Joins Yi Lu Anil Shanbhag Alekh Jindal Samuel Madden MIT CSAIL MIT CSAIL Microsoft MIT CSAIL yilu@csail.mit.edu anils@mit.edu aljindal@microsoft.com madden@csail.mit.edu
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationColumn-Oriented Database Systems. Liliya Rudko University of Helsinki
Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems
More informationColumn-Stores vs. Row-Stores: How Different Are They Really?
Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi, Samuel Madden, Nabil Hachem Presented by Guozhang Wang November 18 th, 2008 Several slides are from Daniel Abadi and Michael Stonebraker
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationHadoopDB: An open source hybrid of MapReduce
HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009
More informationDocument-oriented Models for Data Warehouses NoSQL Document-oriented for Data Warehouses
Document-oriented Models for Data Warehouses NoSQL Document-oriented for Data Warehouses Max Chevalier 1, Mohammed El Malki 1,2, Arlind Kopliku 1, Olivier Teste 1 and Ronan Tournier 1 1 Université de Toulouse,
More informationIV Statistical Modelling of MapReduce Joins
IV Statistical Modelling of MapReduce Joins In this chapter, we will also explain each component used while constructing our statistical model such as: The construction of the dataset used. The use of
More informationarxiv: v1 [cs.dc] 11 Jun 2018
Noname manuscript No. (will be inserted by the editor) A Cost-based Storage Format Selector for Materialization in Big Data Frameworks Rana Faisal Munir Alberto Abelló Oscar Romero Maik Thiele Wolfgang
More informationHortonworks Certified Developer (HDPCD Exam) Training Program
Hortonworks Certified Developer (HDPCD Exam) Training Program Having this badge on your resume can be your chance of standing out from the crowd. The HDP Certified Developer (HDPCD) exam is designed for
More informationHadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop
Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationWarehouse- Scale Computing and the BDAS Stack
Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,
More informationMRBench : A Benchmark for Map-Reduce Framework
MRBench : A Benchmark for Map-Reduce Framework Kiyoung Kim, Kyungho Jeon, Hyuck Han, Shin-gyu Kim, Hyungsoo Jung, Heon Y. Yeom School of Computer Science and Engineering Seoul National University Seoul
More informationA Survey on Big Data
A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationTP1-2: Analyzing Hadoop Logs
TP1-2: Analyzing Hadoop Logs Shadi Ibrahim January 26th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development
More informationHadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin and Avi Silberschatz Presented by
More informationTowards Energy Proportional Cloud for Data Processing Frameworks
Towards Energy Proportional Cloud for Data Processing Frameworks Hyeong S. Kim, Dong In Shin, Young Jin Yu, Hyeonsang Eom, Heon Y. Yeom Seoul National University Introduction Recent advances in cloud computing
More informationMIT805 BIG DATA MAPREDUCE
MIT805 BIG DATA MAPREDUCE Christoph Stallmann Department of Computer Science University of Pretoria Admin Part 2 & 3 of the assignment Team registrations Concept Roman Empire Concept Roman Empire Concept
More informationColumnstore and B+ tree. Are Hybrid Physical. Designs Important?
Columnstore and B+ tree Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 B+ tree & Columnstore on same table = Hybrid design 4? C O L C O L B+ tree B+ tree ? C O L C O L B+ tree B+ tree
More informationStrategies for Incremental Updates on Hive
Strategies for Incremental Updates on Hive Copyright Informatica LLC 2017. Informatica, the Informatica logo, and Big Data Management are trademarks or registered trademarks of Informatica LLC in the United
More informationI. Introduction. FlashQueryFile: Flash-Optimized Layout and Algorithms for Interactive Ad Hoc SQL on Big Data Rini T Kaushik 1
FlashQueryFile: Flash-Optimized Layout and Algorithms for Interactive Ad Hoc SQL on Big Data Rini T Kaushik 1 1 IBM Research - Almaden Abstract High performance storage layer is vital for allowing interactive
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationCarnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia Final Exam. Administrivia Final Exam
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#28: Modern Database Systems Administrivia Final Exam Who: You What: R&G Chapters 15-22 When: Tuesday
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationHadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)
++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) Jens Dittrich Jorge-Arnulfo Quiané-Ruiz Alekh Jindal, Yagiz Kargin Vinay Setty Jörg Schad Information Systems Group, Saarland
More informationColumn-Stores vs. Row-Stores How Different Are They Really?
Column-Stores vs. Row-Stores How Different Are They Really? Volodymyr Piven Wilhelm-Schickard-Institut für Informatik Eberhard-Karls-Universität Tübingen 2. Januar 2 Volodymyr Piven (Universität Tübingen)
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More information6.830 Problem Set 2 (2017)
6.830 Problem Set 2 1 Assigned: Monday, Sep 25, 2017 6.830 Problem Set 2 (2017) Due: Monday, Oct 16, 2017, 11:59 PM Submit to Gradescope: https://gradescope.com/courses/10498 The purpose of this problem
More informationKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File
More informationColumn Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin
Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin Problem Statement TB disks are coming! Superwide, frequently sparse tables are common DB
More informationOptimizing Communication for Multi- Join Query Processing in Cloud Data Warehouses
Optimizing Communication for Multi- Join Query Processing in Cloud Data Warehouses Swathi Kurunji, Tingjian Ge, Xinwen Fu, Benyuan Liu, Cindy X. Chen Computer Science Department, University of Massachusetts
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationAnurag Sharma (IIT Bombay) 1 / 13
0 Map Reduce Algorithm Design Anurag Sharma (IIT Bombay) 1 / 13 Relational Joins Anurag Sharma Fundamental Research Group IIT Bombay Anurag Sharma (IIT Bombay) 1 / 13 Secondary Sorting Required if we need
More informationChapter 3. Foundations of Business Intelligence: Databases and Information Management
Chapter 3 Foundations of Business Intelligence: Databases and Information Management THE DATA HIERARCHY TRADITIONAL FILE PROCESSING Organizing Data in a Traditional File Environment Problems with the traditional
More information