SciHadoop: Array Based Query Processing in Hadoop
|
|
- Jessica Bennett
- 5 years ago
- Views:
Transcription
1 SciHadoop: Array Based Query Processing in Hadoop Joe Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, Scott Brandt 1 1
2 Damasc Data Management in Scientific Computing 2 2
3 SciHadoop Logical query interface Data stored in original file format MapReduce processing model 3 3
4 Background MapReduce 4 4
5 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 5 5
6 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 6 6
7 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 7 7
8 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 8 8
9 MapReduce map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 9 9
10 Background Scientific Libraries 10 10
11 Scientific Data
12 Scientific Data Access Library
13 MapReduce: One Task Consider a single task processing data Two data sets: text and climate data 13 13
14 MapReduce: One Task map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 14 14
15 Processing Text Thou wast born of woman But swords I smile at, weapons laugh to scorn, Brandish'd by man that's of a woman born
16 Processing Text Thou wast born of woman But swords I smile at, weapons laugh to scorn, Brandish'd by man that's of a woman born. Output: Thou, 1 of, 2 scorn,
17 Processing Temps 4, 5, 6, 7, 8,
18 Scientific Data Latitude Longitude Time 18 18
19 Scientific Data Latitude Longitude Time 19 19
20 Scientific Data Access Library
21 Scientific Data Access Library
22 Scientific Data X Access Library
23 An Issue Arises 4, 5, 6, 7, 8,
24 Solution Propagate logical coordinates throughout the system 24 24
25 Solution map split 0 split 1 split 2 split 3 split 4 (3) read (4) local write (5) remote read (6) write output file 0 output file 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files 25 25
26 Solution Corner: 0, 1, 0 Shape: 1, 1, 3 Data: 4, 5, 6 Corner: 1, 0, 0 Shape: 1, 1, 3 Data: 7, 8, 9 Output: 6 Output:
27 Solution Corner: 0, 1, 0 Shape: 1, 1, 3 Data: 4, 5, 6 Corner: 1, 0, 0 Shape: 1, 1, 3 Data: 7, 8, 9 Output: 6 Output:
28 Recap Mismatch between MapReduce and access libraries Logical coordinates are key 28 28
29 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 29 29
30 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 30 30
31 Experimental Data
32 Experimental Data ==
33 Naive Partitioning Round-robin placement over all the blocks that constitute the input 33 33
34 Naive Partitioning NODE 0 NODE 1 NODE
35 Naive Partitioning NODE 0 NODE 1 NODE NODE0 NODE1 NODE
36 Naive Partitioning NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6 NODE 7 other file data NODE0 NODE1 NODE2 NODE3 NODE4 NODE5 NODE6 NODE
37 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
38 Chunking & Grouping NODE 0 NODE 1 NODE NODE0 NODE0 NODE0 NODE1 NODE1 NODE
39 Chunking & Grouping NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6 NODE 7 other file data NODE4 NODE5 NODE6 NODE
40 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
41 Physical to Logical NODE 0 NODE 1 NODE NODE0 NODE1 NODE
42 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
43 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 43 43
44 No Scan NODE 0 NODE 1 NODE NODE0 NODE1 NODE
45 No Scan NODE 0 NODE 1 NODE NODE0 NODE1 NODE
46 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
47 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 47 47
48 Combiner node 1 node 2 node 3 function: Max Filter / Map 3 Filter / Map Filter / Map 6 8 Combine Combine Combine Reduce Reduce
49 Combiner node 1 node 2 node 2 function: Max Filter / Map 3 Filter / Map 6 8 Filter / Map Combine 3 Combine 8 Reduce Reduce
50 Holistic Combiner node 1 node 2 node 3 function: Median Filter / Map Filter / Map Filter / Map Reduce Reduce
51 Holistic Combiner node 1 node 2 node 3 function: Median Filter / Map Filter / Map Filter / Map Combiner Combiner Combiner 1 Reduce Reduce
52 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
53 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
54 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
55 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
56 Experiments Optimizations: Explore three methods for creating splits Reduce unnecessary reads Holistic Combiner Query-Aware Partitioning 56 56
57 Query-Aware Partitioning node 1 node 2 node 3 function: Median Filter / Map Filter / Map Filter / Map Combiner Combiner Combiner 1 Reduce Reduce
58 Query-Aware Partitioning node 1 node 2 function: Median Filter / Map Filter / Map Combiner Combiner 1 6 Reduce Reduce
59 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
60 Results Test Name Local Temp CPU Run Time Read Data Util Time (%) (GB) (%) (Min) (%) First 4 use no Holistic Combiner Baseline 9.3 2, Baseline +NoScan 9.2 2, ChkGroup 80 2, PhysToLog 88 2, Next 4 use Holistic Combiner with Baseline Baseline NoScan NoScan +HaPart HaPart Next 3 use Holistic Combiner with Local-Read Optimizations ChkGroup +HaPart NoScan ChkGroup +NoScan PhysToLog +NoScan
61 SciHadoop Provides logical query interface In-situ processing over native data Exploits convenient data parallelism 61 61
62 SC 11 Thursday, Nov 17th 1:30-2 pm Room TCC
63 Future Work Integrate structural knowledge into Hadoop proper Produce partial, complete results early Alternative resiliency models Generalize existing niche performance enhancements for scientific data 63 63
64 Future Work Come to my poster for details 64 64
65 Collaborators 65 65
66 Thank You
Storage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11
Storage in HPC: Scalable Scientific Data Management Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11 Who am I? Systems Research Lab (SRL), UC Santa Cruz LANL/UCSC Institute for Scalable Scientific
More informationCompressing Intermediate Keys between Mappers and Reducers in SciHadoop
Compressing Intermediate Keys between Mappers and Reducers in SciHadoop Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt University of California, Santa Cruz {adamcrume,buck,carlosm,scott}@cs.ucsc.edu
More informationPROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP
ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge
More informationUC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations
UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations Title Extending Mapreduce for Scientific Data Permalink https://escholarship.org/uc/item/2gn5x6df Author Buck, Joe Publication Date 2014-01-01
More informationSTUDY ON POTENTIAL CAPABILITIES OF A NODB SYSTEM
STUDY ON POTENTIAL CAPABILITIES OF A NODB SYSTEM Y. Jayanta Singh 1 and L. Kananbala Devi 2 Department of Computer Science & Engineering and Information Technology, Don Bosco College of Engineering and
More informationCamdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa
Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file
More informationData Transformation and Migration in Polystores
Data Transformation and Migration in Polystores Adam Dziedzic, Aaron Elmore & Michael Stonebraker September 15th, 2016 Agenda Data Migration for Polystores: What & Why? How? Acceleration of physical data
More informationMISO: Souping Up Big Data Query Processing with a Multistore System
MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan, NEC Labs Hakan Hacıgümüş. NEC Labs Junichi Tatemura, NEC Labs Neoklis Polyzotis,
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create
More informationParallel DBs. April 25, 2017
Parallel DBs April 25, 2017 1 Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) (x1000) ~1 Hour ~3.5 Seconds 2 Data Parallelism Replication Partitioning A A A A B C 3 Operator Parallelism Pipeline
More informationCurriculum Vitae. April 5, 2018
Curriculum Vitae April 5, 2018 Michael A. Sevilla mikesevilla3@gmail.com website: users.soe.ucsc.edu/~msevilla 127 Storey St., Santa Cruz, CA 95060 code: github.com/michaelsevilla mobile: (858) 449-3086
More informationBig measurements and data analysis, challenges and some ideas
Big measurements and data analysis, challenges and some ideas University of Vaasa March 9 th 2017 What is big data? Big data is... Data sets that are too large and complex to be worked with commonly available
More informationSandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.
Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,
More informationCIS Operating Systems Contiguous Memory Allocation. Professor Qiang Zeng Spring 2018
CIS 3207 - Operating Systems Contiguous Memory Allocation Professor Qiang Zeng Spring 2018 Previous class Uniprocessor policies FCFS, Shortest Job First Round Robin Multilevel Feedback Queue Multiprocessor
More informationParallel DBs. April 23, 2018
Parallel DBs April 23, 2018 1 Why Scale? Scan of 1 PB at 300MB/s (SATA r2 Limit) Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) ~1 Hour Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) (x1000)
More informationAbstract Storage Moving file format specific abstrac7ons into petabyte scale storage systems. Joe Buck, Noah Watkins, Carlos Maltzahn & ScoD Brandt
Abstract Storage Moving file format specific abstrac7ons into petabyte scale storage systems Joe Buck, Noah Watkins, Carlos Maltzahn & ScoD Brandt Introduc7on Current HPC environment separates computa7on
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationLEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud
LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He*, Qi Li # Huazhong University of Science and Technology *Nanyang Technological
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationProgress on Efficient Integration of Lustre* and Hadoop/YARN
Progress on Efficient Integration of Lustre* and Hadoop/YARN Weikuan Yu Robin Goldstone Omkar Kulkarni Bryon Neitzel * Some name and brands may be claimed as the property of others. MapReduce l l l l A
More informationCSC 261/461 Database Systems Lecture 19
CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will
More informationSublinear Models for Streaming and/or Distributed Data
Sublinear Models for Streaming and/or Distributed Data Qin Zhang Guest lecture in B649 Feb. 3, 2015 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index
More informationStrategies for Incremental Updates on Hive
Strategies for Incremental Updates on Hive Copyright Informatica LLC 2017. Informatica, the Informatica logo, and Big Data Management are trademarks or registered trademarks of Informatica LLC in the United
More informationSub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman
Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10 Onur Kahraman High Performance Is No Longer A Nice To Have In Analytical Applications Users expect Google Like performance from
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationDATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843
DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843 WHAT IS A DATA CUBE? The Data Cube or Cube operator produces N-dimensional answers
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationGraphLab: A New Framework for Parallel Machine Learning
GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010 Overview
More informationData Intensive Scalable Computing
Data Intensive Scalable Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them
More informationCLIP: A Compact, Load-balancing Index Placement Function
CLIP: A Compact, Load-balancing Index Placement Function Michael McThrow Storage Systems Research Center University of California, Santa Cruz Abstract Existing file searching tools do not have the performance
More informationOperating systems. Part 1. Module 11 Main memory introduction. Tami Sorgente 1
Operating systems Module 11 Main memory introduction Part 1 Tami Sorgente 1 MODULE 11 MAIN MEMORY INTRODUCTION Background Swapping Contiguous Memory Allocation Noncontiguous Memory Allocation o Segmentation
More informationADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004
ADR and DataCutter Sergey Koren CMSC818S Thursday March 4 th, 2004 Active Data Repository Used for building parallel databases from multidimensional data sets Integrates storage, retrieval, and processing
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationSciSpark 201. Searching for MCCs
SciSpark 201 Searching for MCCs Agenda for 201: Access your SciSpark & Notebook VM (personal sandbox) Quick recap. of SciSpark Project What is Spark? SciSpark Extensions scitensor: N-dimensional arrays
More informationDistributed Machine Learning" on Spark
Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations
More informationMapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec
MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)
More informationHaLoop Efficient Iterative Data Processing on Large Clusters
HaLoop Efficient Iterative Data Processing on Large Clusters Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst University of Washington Department of Computer Science & Engineering Presented
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced
GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationIndexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel
Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationProperties of Processes
CPU Scheduling Properties of Processes CPU I/O Burst Cycle Process execution consists of a cycle of CPU execution and I/O wait. CPU burst distribution: CPU Scheduler Selects from among the processes that
More informationSpongeFiles: Mitigating Data Skew in MapReduce Using Distributed Memory. Khaled Elmeleegy Turn Inc.
SpongeFiles: Mitigating Data Skew in MapReduce Using Distributed Memory Khaled Elmeleegy Turn Inc. kelmeleegy@turn.com Christopher Olston Google Inc. olston@google.com Benjamin Reed Facebook Inc. br33d@fb.com
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationCMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22
More informationOperating Systems and Computer Networks. Memory Management. Dr.-Ing. Pascal A. Klein
Operating Systems and Computer Networks Memory Management pascal.klein@uni-due.de Alexander Maxeiner, M.Sc. Faculty of Engineering Agenda 1 Swapping 2 Segmentation Algorithms 3 Memory Allocation 4 Virtual
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationMind the Gap: Large-Scale Frequent Sequence Mining
Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrücken, Germany SIGMOD 2013 27 th June 2013, New
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationTowards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges
More informationCIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu
CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean
More informationCS698F Advanced Data Management. Instructor: Medha Atre. Aug 11, 2017 CS698F Adv Data Mgmt 1
CS698F Advanced Data Management Instructor: Medha Atre Aug 11, 2017 CS698F Adv Data Mgmt 1 Recap Query optimization components. Relational algebra rules. How to rewrite queries with relational algebra
More informationVenugopal Ramasubramanian Emin Gün Sirer SIGCOMM 04
The Design and Implementation of a Next Generation Name Service for the Internet Venugopal Ramasubramanian Emin Gün Sirer SIGCOMM 04 Presenter: Saurabh Kadekodi Agenda DNS overview Current DNS Problems
More informationMemory Management. Memory Management
Memory Management Chapter 7 1 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated efficiently to pack as many processes into memory as possible 2 1 Memory
More informationk-means Gaussian mixture model Maximize the likelihood exp(
k-means Gaussian miture model Maimize the likelihood Centers : c P( {, i c j,...,, c n },...c k, ) ep( i c j ) k-means P( i c j, ) ep( c i j ) Minimize i c j Sum of squared errors (SSE) criterion (k clusters
More informationArchitecture and Implementation of Database Systems (Winter 2014/15)
Jens Teubner Architecture & Implementation of DBMS Winter 2014/15 1 Architecture and Implementation of Database Systems (Winter 2014/15) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2014/15
More informationGuoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.
Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14 Page 1 Introduction & Notations Multi-Job optimization Evaluation Conclusion
More informationCLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationMalacology. A Programmable Storage System [Sevilla et al. EuroSys '17]
Malacology A Programmable Storage System [Sevilla et al. EuroSys '17] Michael A. Sevilla, Noah Watkins, Ivo Jimenez, Peter Alvaro, Shel Finkelstein, Jeff LeFevre, Carlos Maltzahn University of California,
More informationPortHadoop: Support Direct HPC Data Processing in Hadoop
215 IEEE International Conference on Big Data (Big Data) PortHadoop: Support Direct HPC Data Processing in Hadoop Xi Yang, Ning Liu, Bo Feng, Xian-He Sun and Shujia Zhou Department of Computer Science
More informationDistributed Computing with Spark and MapReduce
Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How
More informationDesigning dashboards for performance. Reference deck
Designing dashboards for performance Reference deck Basic principles 1. Everything in moderation 2. If it isn t fast in database, it won t be fast in Tableau 3. If it isn t fast in desktop, it won t be
More informationChapter 17: Parallel Databases
Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems
More informationEnosis: Bridging the Semantic Gap between
Enosis: Bridging the Semantic Gap between File-based and Object-based Data Models Anthony Kougkas - akougkas@hawk.iit.edu, Hariharan Devarajan, Xian-He Sun Outline Introduction Background Approach Evaluation
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationCIS Operating Systems Memory Management. Professor Qiang Zeng Fall 2017
CIS 5512 - Operating Systems Memory Management Professor Qiang Zeng Fall 2017 Previous class Uniprocessor policies FCFS, Shortest Job First Round Robin Multilevel Feedback Queue Multiprocessor policies
More informationChapter 7 Memory Management
Operating Systems: Internals and Design Principles Chapter 7 Memory Management Ninth Edition William Stallings Frame Page Segment A fixed-length block of main memory. A fixed-length block of data that
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationIntroduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29
Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions
More informationReal-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments
Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!
More informationLecture 11 Parallel Computation Patterns Reduction Trees
ECE408 Applied Parallel Programming Lecture 11 Parallel Computation Patterns Reduction Trees 1 Objective To master Reduction Trees, arguably the most widely used parallel computation pattern Basic concept
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More information! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large
Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!
More informationChapter 20: Parallel Databases
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationChapter 20: Parallel Databases. Introduction
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationJumbo: Beyond MapReduce for Workload Balancing
Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationColumn-Stores vs. Row-Stores: How Different Are They Really?
Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel Madden and Nabil Hachem SIGMOD 2008 Presented by: Souvik Pal Subhro Bhattacharyya Department of Computer Science Indian
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationShark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )
Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationDeclStore: Layering is for the Faint of Heart
DeclStore: Layering is for the Faint of Heart [HotStorage, July 2017] Noah Watkins, Michael Sevilla, Ivo Jimenez, Kathryn Dahlgren, Peter Alvaro, Shel Finkelstein, Carlos Maltzahn Layers on layers on layers
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part V Lecture 15, March 15, 2015 Mohammad Hammoud Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+ Tree) and Hash-based (i.e., Extendible
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery
More informationChapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction
Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationTuning the Hive Engine for Big Data Management
Tuning the Hive Engine for Big Data Management Copyright Informatica LLC 2017. Informatica, the Informatica logo, Big Data Management, PowerCenter, and PowerExchange are trademarks or registered trademarks
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationMemory Management. Memory Management Requirements
Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable supply of ready processes to consume available processor time 1 Memory Management
More informationarxiv: v1 [cs.dc] 8 May 2018
Parallel Computation of PDFs on Big Spatial Data Using Spark Ji Liu 1, Noel Moreno Lemus 2, Esther Pacitti 1, Fabio Porto 2, and Patrick Valduriez 1 1 Inria and LIRMM, Univ. of Montpelier, France 2 LNCC
More informationGLADE: A Scalable Framework for Efficient Analytics
DE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced 52 N Lake Road Merced, CA 95343 frusu@ucmerced.edu Alin Dobra University of Florida PO Box 11612 Gainesville,
More informationParallel, In Situ Indexing for Data-intensive Computing. Introduction
FastQuery - LDAV /24/ Parallel, In Situ Indexing for Data-intensive Computing October 24, 2 Jinoh Kim, Hasan Abbasi, Luis Chacon, Ciprian Docan, Scott Klasky, Qing Liu, Norbert Podhorszki, Arie Shoshani,
More information