Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs

Size: px
Start display at page:

Download "Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs"

Transcription

1 Interruptible s: Treating Pressure as Interrupts for Highly Scalable Data-Parallel Programs Lu Fang 1, Khanh Nguyen 1, Guoqing(Harry) Xu 1, Brian Demsky 1, Shan Lu 2 1 University of California, Irvine 2 University of Chicago SOSP 15, October 7, 2015, Monterey, California, USA Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

2 Motivation Data-parallel system Input data are divided into independent partitions Many popular big data systems Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

3 Motivation Data-parallel system Input data are divided into independent partitions Many popular big data systems pressure on single nodes Our study Search out of memory and data parallel in StackOverflow We have collected 126 related problems Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

4 Pressure in the Real World pressure on individual nodes Executions push heap limit (using managed language) Data-parallel systems struggle for memory consumption OutOfError point Long and useless GC Heap size Execution time Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

5 Pressure in the Real World pressure on individual nodes Executions push heap limit (using managed language) Data-parallel systems struggle for memory consumption OutOfError point Long and useless GC Heap size Execution time CRASH OutOf Error Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

6 Pressure in the Real World pressure on individual nodes Executions push heap limit (using managed language) Data-parallel systems struggle for memory consumption OutOfError point Long and useless GC Heap size Execution time CRASH OutOf Error SLOW Huge GC effort Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

7 Root Cause 1: Hot Keys Key-value pairs Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

8 Root Cause 1: Hot Keys Key-value pairs Popular keys have many associated values Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

9 Root Cause 1: Hot Keys Key-value pairs Popular keys have many associated values Case study (from StackOverflow) Process StackOverflow posts Long and popular posts Many tasks process long and popular posts Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

10 Root Cause 2: Large Intermediate Results Temporary data structures Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

11 Root Cause 2: Large Intermediate Results Temporary data structures Case study (from StackOverflow) Use NLP library to process customers review Some reviews are quite long NLP library creates giant temporary data structures for long reviews Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

12 Existing Solutions More memory? Not really! Data double in size every two years, [ double in size every three years, [ Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

13 Existing Solutions More memory? Not really! Data double in size every two years, [ double in size every three years, [ Application-level solutions Configuration tuning Skew fixing Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

14 Existing Solutions More memory? Not really! Data double in size every two years, [ double in size every three years, [ Application-level solutions Configuration tuning Skew fixing System-level solutions Cluster-wide resource manager, such as YARN Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

15 Existing Solutions More memory? Not really! Data double in size every two years, [ double in size every three years, [ Application-level solutions Configuration tuning Skew fixing System-level solutions Cluster-wide resource manager, such as YARN We need a systematic and effective solution! Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

16 Our Solution Interruptible : treat memory pressure as interrupt Dynamically change parallelism degree Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

17 Why Does Our Technique Help consumption Program starts with multiple tasks Heap size Execution time Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

18 Why Does Our Technique Help consumption Program pushes heap limit Heap size Execution time Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

19 Why Does Our Technique Help consumption Long and useless GC Heap size Execution time Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

20 Why Does Our Technique Help consumption OutOf Error Heap size Execution time Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

21 Why Does Our Technique Help consumption Long and useless GCs are detected Heap size Execution time Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

22 Why Does Our Technique Help consumption Long and useless GCs are detected, start interrupting tasks Heap size Execution time Killed Killed Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

23 Why Does Our Technique Help consumption Release the memory, memory pressure is gone Heap size Execution time Killed Killed Local Data Structures Processed Input Unprocessed Input Output Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

24 Why Does Our Technique Help consumption Release the memory, memory pressure is gone Heap size Execution time Killed Killed Local Data Structures Released Processed Input Unprocessed Input Output Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

25 Why Does Our Technique Help consumption Release the memory, memory pressure is gone Heap size Execution time Killed Killed Local Data Structures Released Processed Input Released Unprocessed Input Output Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

26 Why Does Our Technique Help consumption Release the memory, memory pressure is gone Heap size Execution time Killed Killed Local Data Structures Released Processed Input Released Unprocessed Input Kept in memory, can be serialized Output Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

27 Why Does Our Technique Help consumption Release the memory, memory pressure is gone Heap size Execution time Killed Killed Local Data Structures Released Processed Input Released Unprocessed Input Kept in memory, can be serialized Output Final result: push out and released Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

28 Why Does Our Technique Help consumption Release the memory, memory pressure is gone Heap size Execution time Killed Killed Local Data Structures Released Processed Input Released Unprocessed Input Kept in memory, can be serialized Output Final result: push out and released Intermediate result: kept in memory, can be serialized Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

29 Why Does Our Technique Help consumption Program executes without memory pressure Heap size Execution time Killed Killed Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

30 Why Does Our Technique Help consumption If there is enough memory, increase parallelism degree Heap size Execution time Killed Killed Newly created Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

31 Challenges How to expose semantics How to interrupt/reactivate tasks Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

32 Challenges How to expose semantics a programming model How to interrupt/reactivate tasks Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

33 Challenges How to expose semantics a programming model How to interrupt/reactivate tasks a runtime system Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

34 Challenges How to expose semantics a programming model How to interrupt/reactivate tasks a runtime system Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

35 The Programming Model A unified representation of input/output Separate processed and unprocessed input Specify how to serialize and deserialize Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

36 The Programming Model A unified representation of input/output Separate processed and unprocessed input Specify how to serialize and deserialize A definition of an interruptible task Safely interrupt tasks Specify the actions when interrupt happens Merge the intermediate results Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

37 Representing Input/Output as DataPartitions How to separate processed and unprocessed input How to serialize and deserialize the data DataPartition Abstract Class // The DataPartition abstract class abstract class DataPartition { // Some fields and methods... // A cursor points to the first // unprocessed tuple int cursor; // Serialize the DataPartition abstract void serialize(); // Deserialize the DataPartition abstract DataPartition deserialize(); } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

38 Representing Input/Output as DataPartitions How to separate processed and unprocessed input How to serialize and deserialize the data DataPartition Abstract Class 1 A cursor points to the first unprocessed tuple // The DataPartition abstract class abstract class DataPartition { // Some fields and methods... // A cursor points to the first // unprocessed tuple int cursor; // Serialize the DataPartition abstract void serialize(); // Deserialize the DataPartition abstract DataPartition deserialize(); } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

39 Representing Input/Output as DataPartitions How to separate processed and unprocessed input How to serialize and deserialize the data DataPartition Abstract Class 1 A cursor points to the first unprocessed tuple 2 Users implement serialize and deserialize methods // The DataPartition abstract class abstract class DataPartition { // Some fields and methods... // A cursor points to the first // unprocessed tuple int cursor; // Serialize the DataPartition abstract void serialize(); // Deserialize the DataPartition abstract DataPartition deserialize(); } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

40 Defining an I What actions should be taken when interrupt happens How to safely interrupt a task I Abstract Class // The I interface in the library abstract class I { // Some methods... abstract void interrupt(); boolean scaleloop(datapartition dp) { // Iterate dp, and process each tuple while (dp.hasnext()) { // If pressure occurs, interrupt if (HasPressure()) { interrupt(); return false; } process(); } } } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

41 Defining an I What actions should be taken when interrupt happens How to safely interrupt a task I Abstract Class 1 In interrupt, we define how to deal with partial results // The I interface in the library abstract class I { // Some methods... abstract void interrupt(); boolean scaleloop(datapartition dp) { // Iterate dp, and process each tuple while (dp.hasnext()) { // If pressure occurs, interrupt if (HasPressure()) { interrupt(); return false; } process(); } } } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

42 Defining an I What actions should be taken when interrupt happens How to safely interrupt a task I Abstract Class 1 In interrupt, we define how to deal with partial results 2 s are always interrupted at the beginning in the scaleloop // The I interface in the library abstract class I { // Some methods... abstract void interrupt(); boolean scaleloop(datapartition dp) { // Iterate dp, and process each tuple while (dp.hasnext()) { // If pressure occurs, interrupt if (HasPressure()) { interrupt(); return false; } process(); } } } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

43 Multiple Input for an I How to merge intermediate results MI Abstract Class // The MI interface in the library abstract class MI extends I{ // Most parts are the same as I... // Only difference boolean scaleloop( PartitionIterator<DataPartition> i) { // Iterate partitions through iterator while (i.hasnext()) { DataPartition dp = (DataPartition) i.next(); // Iterate all the data tuples in this partition... } return true; } } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

44 Multiple Input for an I How to merge intermediate results 1 scaleloop takes a PartitionIterator as input MI Abstract Class // The MI interface in the library abstract class MI extends I{ // Most parts are the same as I... // Only difference boolean scaleloop( PartitionIterator<DataPartition> i) { // Iterate partitions through iterator while (i.hasnext()) { DataPartition dp = (DataPartition) i.next(); // Iterate all the data tuples in this partition... } return true; } } Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

45 I WordCount on Hyracks HDFS Map Operator Map Operator Map Operator... Final Reduce Operator Final Final Shuffling Reduce Operator... Reduce Operator MapOperator class MapOperator extends I implements HyracksOperator { void interrupt() { // Push out final // results to shuffling... } // Some other fields and methods... } n n Merge Operator Merge Operator HDFS Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

46 I WordCount on Hyracks HDFS 1 Map Operator Map Operator Map Operator Final Final Final Shuffling Reduce Operator Reduce Operator... Reduce Operator 1 1 n n... ReduceOperator class ReduceOperator extends I implements HyracksOperator { void interrupt() { // Tag the results; // Output as intermediate // results... } // Some other fields and methods... } Merge Operator Merge Operator HDFS Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

47 I WordCount on Hyracks HDFS Map Operator Map Operator Map Operator... Final Reduce Operator Final Final Shuffling Reduce Operator... Reduce Operator MergeOperator class Merge extends MI { void interrupt() { // Tag the results; // Output as intermediate // results } // Some other fields and methods... } n n Merge Operator Merge Operator HDFS Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

48 Challenges How to expose semantics a programming model How to interrupt/activate tasks a runtime system Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

49 I Runtime System Scheduler Partition Manager Data Partition Data Partition Monitor Data Partition Data Partition I Runtime System Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

50 I Runtime System Scheduler Partition Manager Data Partition Grow/Reduce Data Partition Check Monitor Reduce Data Partition Data Partition I Runtime System Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

51 I Runtime System Is Disk Input/Output Serialize/Deserialize Scheduler Partition Manager Data Partition Grow/Reduce Data Partition Check Monitor Reduce Data Partition Data Partition I Runtime System Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

52 I Runtime System Is Disk Interrupt/Create Input/Output Serialize/Deserialize Scheduler Partition Manager Data Partition Grow/Reduce Data Partition Check Monitor Reduce Data Partition Data Partition I Runtime System Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

53 Evaluation Environments We have implemented I on Hadoop Hyracks Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

54 Evaluation Environments We have implemented I on Hadoop Hyracks An 11-node Amazon EC2 cluster Each machine: 8 cores, 15GB, 80GB*2 SSD Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

55 Experiments on Hadoop Goal Show the effectiveness on real-world problems Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

56 Experiments on Hadoop Goal Show the effectiveness on real-world problems Benchmarks Original: five real-world programs collected from Stack Overflow RFix: apply the fixes recommended on websites I: apply I on original programs Name Map-Side Aggregation (MSA) In-Map Combiner (IMC) Inverted-Index Building (IIB) Word Cooccurrence Matrix (WCM) Customer Review Processing (CRP) Dataset Stack Overflow Full Dump Wikipedia Full Dump Wikipedia Full Dump Wikipedia Full Dump Wikipedia Sample Dump Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

57 Improvements Benchmark Original Time RFix Time I Time Speed Up MSA 1047 (crashed) % IMC 5200 (crashed) % IIB 1322 (crashed) % WCM 2643 (crashed) % CRP 567 (crashed) % With I, all programs survive memory pressure On average, I versions are 62.5% faster than RFix Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

58 Experiments on Hyracks Goal Show the improvements on performance Show the improvements on scalability Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

59 Experiments on Hyracks Goal Show the improvements on performance Show the improvements on scalability Benchmarks Original: five hand-optimized applications from repository I: apply I on original programs Name WordCount (WC) Heap Sort (HS) Inverted Index (II) Hash Join (HJ) Group By (GR) Dataset Yahoo Web Map and Its Subgraphs Yahoo Web Map and Its Subgraphs Yahoo Web Map and Its Subgraphs TPC-H Data TPC-H Data Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

60 Tuning Configurations for Original Programs Configurations for best performance Name Thread Number Granularity WordCount (WC) 2 32KB Heap Sort (HS) 6 32KB Inverted Index (II) 8 16KB Hash Join (HJ) 8 32KB Group By (GR) 6 16KB Configurations for best scalability Name Thread Number Granularity WordCount (WC) 1 4KB Heap Sort (HS) 1 4KB Inverted Index (II) 1 4KB Hash Join (HJ) 1 4KB Group By (GR) 1 4KB Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

61 Improvements on Performance Normalized Speed Up Original Best I WC HS II HJ GR On average, I is 34.4% faster Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

62 Improvements on Scalability Normalized Dataset Size WC HS II HJ GR 24 Original Best I On average, I scales to larger datasets Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

63 Conclusions A programming model + a runtime system Non-intrusive Easy to use Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

64 Conclusions A programming model + a runtime system Non-intrusive Easy to use First systematic approach Help data-parallel tasks survive memory pressure I improves performance and scalability On Hadoop, I is 62.5% faster On Hyracks, I is 34.4% faster I helps programs scale to 6.3 larger datasets Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

65 Thank You Q & A Lu Fang et al. Interruptible s SOSP 15, October 7, / 25

Detecting and Fixing Memory-Related Performance Problems in Managed Languages

Detecting and Fixing Memory-Related Performance Problems in Managed Languages Detecting and Fixing -Related Performance Problems in Managed Languages Lu Fang Committee: Prof. Guoqing Xu (Chair), Prof. Alex Nicolau, Prof. Brian Demsky University of California, Irvine May 26, 2017,

More information

Yak: A High-Performance Big-Data-Friendly Garbage Collector. Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Sanazsadat Alamian Shan Lu

Yak: A High-Performance Big-Data-Friendly Garbage Collector. Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Sanazsadat Alamian Shan Lu Yak: A High-Performance Big-Data-Friendly Garbage Collector Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Sanazsadat Alamian Shan Lu University of California, Irvine University of Chicago Onur Mutlu

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Resource and Performance Distribution Prediction for Large Scale Analytics Queries Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Low memory Map-Reduce

Low memory Map-Reduce Low memory -Reduce Hrishikesh Amur, Karsten Schwan, Georgia Tech. Wolf Richter, Athula Balachandran, Erik Zawadzki, Dave Andersen, CMU Michael Kaminsky, Intel 1 Datasets are growing People see value (even

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

CLoud computing is a service through which a

CLoud computing is a service through which a 1 MAP-JOIN-REDUCE: Towards Scalable and Efficient Data Analysis on Large Clusters Dawei Jiang, Anthony K. H. TUNG, and Gang Chen Abstract Data analysis is an important functionality in cloud computing

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Don t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems

Don t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems Don t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems David Lion, Adrian Chiu, Hailong Sun*, Xin Zhuang, Nikola Grcevski, Ding Yuan University

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

HadoopDB: An open source hybrid of MapReduce

HadoopDB: An open source hybrid of MapReduce HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( ) Anti-Caching: A New Approach to Database Management System Architecture Guide: Helly Patel (2655077) Dr. Sunnie Chung Kush Patel (2641883) Abstract Earlier DBMS blocks stored on disk, with a main memory

More information

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce

More information

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley Shark: SQL and Rich Analytics at Scale Reynold Xin UC Berkeley Challenges in Modern Data Analysis Data volumes expanding. Faults and stragglers complicate parallel database design. Complexity of analysis:

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang, Jidong Zhai, Xipeng Shen #, Onur Mutlu, Wenguang Chen Renmin University of China Tsinghua University

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Impala Intro. MingLi xunzhang

Impala Intro. MingLi xunzhang Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

CSE 344 MAY 2 ND MAP/REDUCE

CSE 344 MAY 2 ND MAP/REDUCE CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data

More information

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

Data Blocks: Hybrid OLTP and OLAP on compressed storage

Data Blocks: Hybrid OLTP and OLAP on compressed storage Data Blocks: Hybrid OLTP and OLAP on compressed storage Ben Brümmer Technische Universität München Fürstenfeldbruck, 26. November 208 Ben Brümmer 26..8 Lehrstuhl für Datenbanksysteme Problem HDD/Archive/Tape-Storage

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Practical Big Data Processing An Overview of Apache Flink

Practical Big Data Processing An Overview of Apache Flink Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013

More information

MapReduce Algorithm Design

MapReduce Algorithm Design MapReduce Algorithm Design Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

6.830 Lecture Spark 11/15/2017

6.830 Lecture Spark 11/15/2017 6.830 Lecture 19 -- Spark 11/15/2017 Recap / finish dynamo Sloppy Quorum (healthy N) Dynamo authors don't think quorums are sufficient, for 2 reasons: - Decreased durability (want to write all data at

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

CSE 344 FEBRUARY 14 TH INDEXING

CSE 344 FEBRUARY 14 TH INDEXING CSE 344 FEBRUARY 14 TH INDEXING EXAM Grades posted to Canvas Exams handed back in section tomorrow Regrades: Friday office hours EXAM Overall, you did well Average: 79 Remember: lowest between midterm/final

More information

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 7: Parallel Computing Cho-Jui Hsieh UC Davis May 3, 2018 Outline Multi-core computing, distributed computing Multi-core computing tools

More information

MongoDB DI Dr. Angelika Kusel

MongoDB DI Dr. Angelika Kusel MongoDB DI Dr. Angelika Kusel 1 Motivation Problem Data is partitioned over large scale clusters Clusters change the rules for processing Good news Lots of machines to spread the computation over Bad news

More information

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15 Examples of Physical Query Plan Alternatives Selected Material from Chapters 12, 14 and 15 1 Query Optimization NOTE: SQL provides many ways to express a query. HENCE: System has many options for evaluating

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Databases 2 (VU) ( )

Databases 2 (VU) ( ) Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment

More information

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He*, Qi Li # Huazhong University of Science and Technology *Nanyang Technological

More information

Caching and Buffering in HDF5

Caching and Buffering in HDF5 Caching and Buffering in HDF5 September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 1 Software stack Life cycle: What happens to data when it is transferred from application buffer to HDF5 file and from HDF5

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig CSE 6242 / CX 4242 Scaling Up 1 Hadoop, Pig Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le

More information

1 Overview and Research Impact

1 Overview and Research Impact 1 Overview and Research Impact The past decade has witnessed the explosion of Big Data, a phenomenon where massive amounts of information are created at an unprecedented scale and speed. The availability

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Basics of Data Management

Basics of Data Management Basics of Data Management Chaitan Baru 2 2 Objectives of this Module Introduce concepts and technologies for managing structured, semistructured, unstructured data Obtain a grounding in traditional data

More information

A Parallel R Framework

A Parallel R Framework A Parallel R Framework for Processing Large Dataset on Distributed Systems Nov. 17, 2013 This work is initiated and supported by Huawei Technologies Rise of Data-Intensive Analytics Data Sources Personal

More information

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 10: Basics of Data Storage and Indexes 1 Reminder HW3 is due next Tuesday 2 Motivation My database application is too slow why? One of the queries is very slow why? To

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information