Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Size: px
Start display at page:

Download "Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa"

Transcription

1 Camdoop Exploiting In-network Aggregation for Big Data Applications joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge)

2 MapReduce Overview Input file Intermediate results Final results Chunk 0 Chunk 1 Chunk 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task Map Processes input data and generates (key, value) pairs Shuffle Distributes the intermediate pairs to the reduce tasks Reduce Aggregates all values associated to each key 2/52

3 Problem Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task Shuffle phase is challenging for data center networks All-to-all traffic pattern with O(N 2 ) flows Led to proposals for full-bisection bandwidth 3/52

4 Data Reduction Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task The final results are typically much smaller than the intermediate results In most Facebook jobs the final size is 5.4 % of the intermediate size In most Yahoo jobs the ratio is 8.2 % 4/52

5 Data Reduction Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task The final results are typically much smaller than the intermediate results In most Facebook jobs final size is 5.4 % of the How intermediate can we exploit size this to reduce the traffic and improve In most Yahoo the performance jobs the ratio of is the 8.2 shuffle % phase? 5/52

6 Background: Combiners Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task To reduce the data transferred in the shuffle, users can specify a combiner function Aggregates the local intermediate pairs Server-side only => limited aggregation 6/52

7 Background: Combiners Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Combiner Combiner Reduce Task Reduce Task Map Task Combiner Reduce Task To reduce the data transferred in the shuffle, users can specify a combiner function Aggregates the local intermediate pairs Server-side only => limited aggregation 7/52

8 Distributed Combiners It has been proposed to use aggregation trees in MapReduce to perform multiple steps of combiners e.g., rack-level aggregation [Yu et al., SOSP 09] 8/52

9 Logical and Physical Topology What happens when we map the tree to a typical data center topology? ToR Switch Logical topology Physical topology The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link 9/52

10 Logical and Physical Topology What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Logical topology Physical topology The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link

11 Logical and Physical Topology What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Logical topology Physical topology Camdoop Goal The server link is the bottleneck Full-bisection bandwidth does not help here Perform the combiner functions within the network as opposed to application-level solutions Mismatch between physical and logical topology Two logical links are mapped onto the same physical link Paolo Reduce Costa shuffle Camdoop: time Exploiting In-network by aggregating Aggregation for Big Data Applications packets on path

12 How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x 12/52

13 How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x 13/52

14 (1,2,1) (1,2,2) How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x 14/52

15 (1,2,1) (1,2,2) How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates Key property No This distinction defines between a key-space network (=> key-based and computation routing) devices Coordinates are locally re-mapped in case of failures Servers can perform arbitrary packet processing on-path z x

16 Mapping a tree on a switched topology The 1 Gbps link becomes the bottleneck 1/in-degree Gbps on CamCube Packets are aggregated on path (=> less traffic) 1:1 mapping btw. logical and physical topology 1 Gbps 16/52

17 Camdoop Design Goals 1. No change in the programming model 2. Exploit network locality 3. Good server and link load distribution 4. Fault-tolerance 17/52

18 Design Goal #1 Programming Model Camdoop adopts the same MapReduce model GFS-like distributed file-system Each server runs map tasks on local chunks We use a spanning tree Combiners aggregate map tasks and children results (if any) and stream the results to the parents The root runs the reduce task and generates the final output 18/52

19 Design Goal #2 Network locality How to map the tree nodes to servers? 19/52

20 Design Goal #2 Network locality Map task outputs are always read from the local disk 20/52

21 Design Goal #2 Network locality (1,2,1) (1,2,2) (1,1,1) The parent-children are mapped on physical neighbors 21/52

22 Design Goal #2 Network locality (1,2,1) The Map parent-children task How outputs to map are are the always mapped tree nodes read on from to physical servers? the local neighbors disk (1,2,2) (1,1,1) This ensures maximum locality and optimizes network transfer

23 Network Locality Logical View Physical View (3D Torus) One physical link is used by one and only one logical link 23/52

24 Design Goal #3 Load Distribution 24/52

25 Design Goal #3 Load Distribution Only Different 1 Gbps (instead in-degree of 6) Poor server load distribution 25/52

26 Only 1 Gbps Design Goal #3 (instead of 6) Load Distribution Poor bandwidth utilization 26/52

27 Design Goal #3 Load Distribution Solution: Poor stripe server bandwidth the data load across distribution utilization disjoint trees Different links are used Improves load distribution

28 Design Goal #3 Solution: Second First stripe issue: the Poor data bandwidth load across distribution 6 disjoint utilization trees All links are used => (Up to) 6 Gbps / server Camdoop: Exploiting Good In-network load Aggregation distribution for Big Data Applications

29 Design Goal #4 Fault-tolerance The tree is built in the coordinate space CamCube remaps coordinates in case of failures Details in the paper 29/52

30 Evaluation Testbed 27-server CamCube (3 x 3 x 3) Quad-core Intel Xeon Ghz 12GB RAM 6 Intel PRO/1000 PT 1 Gbps ports Runtime & services implemented in user-space Simulator Packet-level simulator (CPU overhead not modelled) 512-server (8x8x8) CamCube 30/52

31 Evaluation Design and implementation recap Shuffle & reduce parallelized Camdoop Reduce phase is parallelized with the shuffle phase Since all streams are ordered, as soon as the root receive at least one packet from all children, it can start the reduce function No need to store to disk intermediate results on reduce servers Map Shuffle Reduce Map Shuffle Reduce

32 Evaluation Design and implementation recap Camdoop Shuffle & reduce parallelized CamCube Six disjoint trees In - network aggregation 32/52

33 Evaluation Design and implementation recap Camdoop TCP Camdoop (switch) Shuffle & reduce parallelized CamCube Six disjoint trees In - network aggregation TCP Camdoop (switch) 27 CamCube servers attached to a ToR switch Map Map Shuffle Reduce TCP is used to transfer data in the shuffle phase Shuffle 33/52

34 Evaluation Design and implementation recap Camdoop TCP Camdoop (switch) Camdoop (no agg ) Shuffle & reduce parallelized CamCube Six disjoint trees In - network aggregation TCP Camdoop Camdoop (no agg) (switch) 27 Like CamCube Camdoop servers but without attached in-network to a ToR switch Map aggregation Map TCP Shows is Shuffle used the impact to transfer Reduce of just data running the on shuffle CamCube phase Shuffle 34/52

35 Time logscale (s) Validation against Hadoop & Dryad Sort and WordCount Camdoop baselines are competitive against Hadoop and Dryad Several reasons: Shuffle and reduce parellized Fine-tuned implementation Hadoop TCP Camdoop (switch) 1000 Worse Better 0 Sort Dryad/DryadLINQ Camdoop (no agg) WordCount 35/52

36 Time logscale (s) Validation against Hadoop & these Dryad as our Sort and WordCount Camdoop baselines are competitive against Hadoop and Dryad Several reasons: Shuffle and reduce parellized Fine-tuned implementation Hadoop TCP Camdoop (switch) 1000 Worse Better 0 Sort We consider baselines Dryad/DryadLINQ Camdoop (no agg) WordCount 36/52

37 Parameter Sweep Output size / intermediate size (S) S=1 (no aggregation) o Every key is unique S=1/N 0 (full aggregation) o Every key appears in all map task outputs We use synthetic workloads to explore different value of S o Intermediate data size is 22.2 GB (843 MB/server) Reduce tasks (R) R= 1 (all-to-one) o E.g., Interactive queries, top-k jobs R=N (all-to-all) o Common setup in MapReduce jobs o N output files are generated 37/52

38 Time (s) logscale All-to-one (R=1) Worse TCP Camdoop (switch) Camdoop (no agg) Better Camdoop Full Output size/ intermediate size (S) No aggregation aggregation 38/52

39 Time (s) logscale Impact of All-to-one in-network (R=1) aggregation Worse 1000 Performance independent of S Impact of running on CamCube TCP Camdoop (switch) Camdoop (no agg) Camdoop Better Full Output size/ intermediate size (S) No aggregation aggregation 39/52

40 Time (s) logscale Impact of All-to-one in-network (R=1) aggregation Worse 1000 Performance independent of S Impact of running on CamCube TCP Camdoop (switch) Camdoop (no agg) Camdoop Better Facebook reported Full Output size/ intermediate size (S) No aggregation aggregation ratio aggregation 40/52

41 Time (s) logscale All-to-all (R=27) Worse Better 1 TCP Camdoop (switch) Camdoop Camdoop (no agg) Switch 1 Gbps (bound) Output size / intermediate size (S) Full aggregation No aggregation 41/52

42 Time (s) logscale Worse All-to-all (R=27) Impact of in-network aggregation 100 Impact of running on CamCube Maximum theoretical performance over the switch 10 Better 1 TCP Camdoop (switch) Camdoop Camdoop (no agg) Switch 1 Gbps (bound) Output size / intermediate size (S) Full aggregation No aggregation 42/52

43 Time (s) logscale Number of reduce tasks (S=0) Worse TCP Camdoop (switch) Camdoop (no agg) Camdoop x 4.31 x 10 Better 1.91 x 1.13 x All-to-one # reduce tasks (R) All-to-all 43/52

44 Time (s) logscale Performance Number depends of reduce on R tasks (S=0) Worse R 1000 does not (significantly) impact performance 100 TCP Camdoop (switch) Camdoop (no agg) Camdoop x 4.31 x x Better x All-to-one # reduce tasks (R) All-to-all Implementation bottleneck 44/52

45 Time (s) logscale Performance Number depends of reduce on R tasks (S=0) Worse R 1000 does not (significantly) impact performance 100 TCP Camdoop (switch) Camdoop (no agg) Camdoop x 4.31 x x Better x 1 All resources 6 are 11used even 16 when R = All-to-one # reduce tasks (R) All-to-all Camdoop decouples the job execution time from the Camdoop: number Exploiting of In-network output Aggregation files for generated Big Data Applications

46 Time (s) logscale Behavior at scale (simulated) Worse 100 Switch 1 Gbps (bound) Camdoop N=512, S= x Better All-to-one # reduce tasks (R) logscale 512 All-to-all 46/52

47 Time (s) logscale This assumes full-bisection bandwidth Behavior at scale (simulated) Worse 100 Switch 1 Gbps (bound) Camdoop N=512, S= x Better All-to-one # reduce tasks (R) logscale 512 All-to-all 47/52

48 Beyond MapReduce More experiments (failures, multiple jobs, ) in the paper 48/52

49 Beyond MapReduce requests The partition-aggregate model also common in interactive services e.g., Bing Search, Google Dremel Small-scale data 10s to 100s of KB returned per server Typically, these services use one reduce task (R=1) Single result must be returned to the user Web server Leaf servers Full aggregation is common (S 0) E.g., N servers generate their best k responses each and the final result contains the best k responses Cache Parent servers 49/52

50 Time (ms) Small-scale data (R=1, S=0) Worse TCP Camdoop (switch) Camdoop (no agg) Camdoop Better 0 20 KB 200 KB Input data size / server 50/52

51 Time (ms) Small-scale data (R=1, S=0) Worse TCP Camdoop (switch) Camdoop (no agg) Camdoop Better 0 20 KB 200 KB Input data size / server In-network aggregation can be beneficial also for Camdoop: (small-scale Exploiting In-network data) Aggregation interactive for Big Data Applications services

52 Conclusions Camdoop Explores the benefits of in-network processing by running combiners within the network No change in the programming model Achieves lower shuffle and reduce time Decouples performance from the # of output files A final thought: how would Camdoop run on this? AMD SeaMicro a 512-core cluster for data centers using a 3D torus Fast interconnect: 5 Gbps / link 52/52

Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop: Exploiting In-network Aggregation for Big Data Applications : Exploiting In-network Aggregation for Big Data Applications Paolo Costa Austin Donnelly Antony Rowstron Greg O Shea Microsoft Research Cambridge Imperial College London Abstract Large companies like

More information

NaaS Network-as-a-Service in the Cloud

NaaS Network-as-a-Service in the Cloud NaaS Network-as-a-Service in the Cloud joint work with Matteo Migliavacca, Peter Pietzuch, and Alexander L. Wolf costa@imperial.ac.uk Motivation Mismatch between app. abstractions & network How the programmers

More information

CamCubeOS: A Key-based Network Stack for 3D Torus Cluster Topologies

CamCubeOS: A Key-based Network Stack for 3D Torus Cluster Topologies CamCubeOS: A Key-based Network Stack for 3D Torus Cluster Topologies Paolo Costa Austin Donnelly Greg O Shea Antony Rowstron Microsoft Research Cambridge, UK {pcosta,austind,gregos,antr}@microsoft.com

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition

More information

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open

More information

Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks

Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks Deke Guo, Member, IEEE, Junjie Xie, Xiaolei Zhou, Student Member, IEEE, Xiaomin Zhu, Member, IEEE, Wei Wei, Member, IEEE,

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Routing Domains in Data Centre Networks. Morteza Kheirkhah. Informatics Department University of Sussex. Multi-Service Networks July 2011

Routing Domains in Data Centre Networks. Morteza Kheirkhah. Informatics Department University of Sussex. Multi-Service Networks July 2011 Routing Domains in Data Centre Networks Morteza Kheirkhah Informatics Department University of Sussex Multi-Service Networks July 2011 What is a Data Centre? Large-scale Data Centres (DC) consist of tens

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters Motivation MapReduce: Simplified Data Processing on Large Clusters These are slides from Dan Weld s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google,

More information

NetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data Centres

NetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data Centres : Using Middleboxes for Application-specific On-path regation in Data Centres Luo Mai Lukas Rupprecht Abdul Alim Paolo Costa Matteo Migliavacca Peter Pietzuch Alexander L. Wolf Imperial College London

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

A Network-aware Scheduler in Data-parallel Clusters for High Performance

A Network-aware Scheduler in Data-parallel Clusters for High Performance A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters

More information

Price in thousands of US $ Cost normalized by server price Memory size (GB)

Price in thousands of US $ Cost normalized by server price Memory size (GB) Nobody ever got fired for using Hadoop on a cluster Antony Rowstron Dushyanth Narayanan Austin Donnelly Greg O Shea Andrew Douglas Microsoft Research, Cambridge Abstract The norm for data analytics is

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Hardware Evolution in Data Centers

Hardware Evolution in Data Centers Hardware Evolution in Data Centers 2004 2008 2011 2000 2013 2014 Trend towards customization Increase work done per dollar (CapEx + OpEx) Paolo Costa Rethinking the Network Stack for Rack-scale Computers

More information

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Data Centers. Tom Anderson

Data Centers. Tom Anderson Data Centers Tom Anderson Transport Clarification RPC messages can be arbitrary size Ex: ok to send a tree or a hash table Can require more than one packet sent/received We assume messages can be dropped,

More information

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

MidoNet Scalability Report

MidoNet Scalability Report MidoNet Scalability Report MidoNet Scalability Report: Virtual Performance Equivalent to Bare Metal 1 MidoNet Scalability Report MidoNet: For virtual performance equivalent to bare metal Abstract: This

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet WHITE PAPER Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet Contents Background... 2 The MapR Distribution... 2 Mellanox Ethernet Solution... 3 Test

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Accelerate Applications Using EqualLogic Arrays with directcache

Accelerate Applications Using EqualLogic Arrays with directcache Accelerate Applications Using EqualLogic Arrays with directcache Abstract This paper demonstrates how combining Fusion iomemory products with directcache software in host servers significantly improves

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

MixApart: Decoupled Analytics for Shared Storage Systems

MixApart: Decoupled Analytics for Shared Storage Systems MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto, NetApp Abstract Data analytics and enterprise applications have very

More information

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Presented by: Ahmed Elbagoury Outline Background & Motivation What is Restore? Types of Result Reuse System Architecture Experiments Conclusion Discussion Background

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler MSST 10 Hadoop in Perspective Hadoop scales computation capacity, storage capacity, and I/O bandwidth by

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

Application-Aware SDN Routing for Big-Data Processing

Application-Aware SDN Routing for Big-Data Processing Application-Aware SDN Routing for Big-Data Processing Evaluation by EstiNet OpenFlow Network Emulator Director/Prof. Shie-Yuan Wang Institute of Network Engineering National ChiaoTung University Taiwan

More information

Bazaar: Enabling Predictable Performance in Datacenters

Bazaar: Enabling Predictable Performance in Datacenters Bazaar: Enabling Predictable Performance in Datacenters Virajith Jalaparti, Hitesh Ballani, Paolo Costa, Thomas Karagiannis, Antony Rowstron MSR Cambridge, UK Technical Report MSR-TR-22-38 Microsoft Research

More information

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/TPDS.6.7, IEEE

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Thrill : High-Performance Algorithmic

Thrill : High-Performance Algorithmic Thrill : High-Performance Algorithmic Distributed Batch Data Processing in C++ Timo Bingmann, Michael Axtmann, Peter Sanders, Sebastian Schlag, and 6 Students 2016-12-06 INSTITUTE OF THEORETICAL INFORMATICS

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.

More information

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju Distributed Data Infrastructures, Fall 2017, Chapter 2 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale

More information

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Resource and Performance Distribution Prediction for Large Scale Analytics Queries Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia

More information

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture What Is Datacenter (Warehouse) Computing Distributed and Parallel Technology Datacenter, Warehouse and Cloud Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University,

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Technical Report. Recomputation-based data reliability for MapReduce using lineage. Sherif Akoush, Ripduman Sohan, Andy Hopper. Number 888.

Technical Report. Recomputation-based data reliability for MapReduce using lineage. Sherif Akoush, Ripduman Sohan, Andy Hopper. Number 888. Technical Report UCAM-CL-TR-888 ISSN 1476-2986 Number 888 Computer Laboratory Recomputation-based data reliability for MapReduce using lineage Sherif Akoush, Ripduman Sohan, Andy Hopper May 2016 15 JJ

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

CSE 344 MAY 2 ND MAP/REDUCE

CSE 344 MAY 2 ND MAP/REDUCE CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

CS-2510 COMPUTER OPERATING SYSTEMS

CS-2510 COMPUTER OPERATING SYSTEMS CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional

More information