Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Size: px
Start display at page:

Download "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks"

Transcription

1 Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Course: CS655 Rabiet Louis Colorado State University Thursday 19 September / 48

2 1 Motivation and Goal Why use Dryad? 2 Dryad Graph Example Properties 3 Some others Ants Pegasus 4 An improvement: Ciel Dynamic tasks 5 Comparison 6 Conclusion 2 / 48

3 Plan 1 Motivation and Goal Why use Dryad? 2 Dryad Graph Example Properties 3 Some others Ants Pegasus 4 An improvement: Ciel Dynamic tasks 5 Comparison 6 Conclusion 3 / 48

4 Why use Dryad? Parallel Programming Goal Make parallel programs easier for developers. 4 / 48

5 Why use Dryad? Parallel Programming Goal Make parallel programs easier for developers. Current Choices Google s MapReduce is a popular framework for developing cloud-scale applications. 4 / 48

6 Why use Dryad? Parallel Programming Goal Make parallel programs easier for developers. Current Choices Google s MapReduce is a popular framework for developing cloud-scale applications. GPU shader languages. 4 / 48

7 Why use Dryad? Parallel Programming Goal Make parallel programs easier for developers. Current Choices Google s MapReduce is a popular framework for developing cloud-scale applications. GPU shader languages. Parallel databases. 4 / 48

8 Why use Dryad? Solution and Results How esle can we achieve that? Using graphs for describing the scheduling/ relation between jobs. 5 / 48

9 Why use Dryad? Solution and Results How esle can we achieve that? Using graphs for describing the scheduling/ relation between jobs. The system handles the rest i.e. managing execution of stages that compose the graph. 5 / 48

10 Why use Dryad? Solution and Results How esle can we achieve that? Using graphs for describing the scheduling/ relation between jobs. The system handles the rest i.e. managing execution of stages that compose the graph. Results Can be performance efficient and simpler. 5 / 48

11 Plan 1 Motivation and Goal Why use Dryad? 2 Dryad Graph Example Properties 3 Some others Ants Pegasus 4 An improvement: Ciel Dynamic tasks 5 Comparison 6 Conclusion 6 / 48

12 System overview 7 / 48

13 Graph Dryad program is a graph Several operators over them can be specified 8 / 48

14 Graph Dryad program is a graph Several operators over them can be specified Adding and cloning vertices 8 / 48

15 Graph Dryad program is a graph Several operators over them can be specified Adding and cloning vertices 8 / 48

16 Graph Dryad program is a graph Several operators over them can be specified Adding and cloning vertices Adding edges 8 / 48

17 Graph Dryad program is a graph Several operators over them can be specified Adding and cloning vertices Adding edges 8 / 48

18 Graph Dryad program is a graph Several operators over them can be specified Adding and cloning vertices Adding edges Merging two graphs 8 / 48

19 Graph Dryad program is a graph Several operators over them can be specified Adding and cloning vertices Adding edges Merging two graphs 8 / 48

20 Graph Merge 9 / 48

21 Graph Merge 9 / 48

22 Graph Merge Merge with bypass 9 / 48

23 Graph Merge Merge with bypass 9 / 48

24 Graph Merge Merge with bypass Asymmetric join 9 / 48

25 Graph Merge Merge with bypass Asymmetric join 9 / 48

26 Example Code example select distinct p.objid from photoobjall p join neighbors n -- call this join \X" on p.objid = n.objid and n.objid < n.neighborobjid and p.mode = 1 join photoobjall l -- call this join \Y" on l.objid = n.neighborobjid and l.mode = 1 and abs((p.u-p.g)-(l.u-l.g))<0.05 and abs((p.g-p.r)-(l.g-l.r))<0.05 and abs((p.r-p.i)-(l.r-l.i))<0.05 and abs((p.i-p.z)-(l.i-l.z))< / 48

27 Example Corresponding graph 11 / 48

28 Example Corresponding vertex program G r a p h B u i l d e r XSet = modulexˆn; G r a p h B u i l d e r DSet = moduledˆn; G r a p h B u i l d e r MSet = modulem ˆ(N 4) ; GraphBuilder SSet = modules ˆ(N 4) ; G r a p h B u i l d e r YSet = moduleyˆn; GraphBuilder HSet = moduleh ˆ1; GraphBuilder XInputs = ( ugriz1 >= XSet ) ( neighbor >= XSet ) ; G r a p h B u i l d e r Y I n p u t s = u g r i z 2 >= YSet ; G r a p h B u i l d e r XToY = XSet >= DSet >> MSet >= SSet ; f o r ( i = 0 ; i < N 4; ++i ) { XToY = XToY ( SSet. G e t V e r t e x ( i ) >= YSet. G e t V e r t e x ( i /4) ) ; } G r a p h B u i l d e r YToH = YSet >= HSet ; GraphBuilder HOutputs = HSet >= output ; GraphBuilder f i n a l = XInputs YInputs XToY YToH HOutputs ; 12 / 48

29 Properties Some nice properties... Channels Communication can be chosen between Files TCP Pipe Shared-Memory FIFO 13 / 48

30 Properties Some nice properties... Channels Communication can be chosen between Files TCP Pipe Shared-Memory FIFO Support for legacy applications Node/vertices can execute old code not specifically written for Dryad. 13 / 48

31 Properties Some nice properties Runtime aggregation 14 / 48

32 Properties Some nice properties Runtime aggregation Fault tolerance Restart failed vertex. 14 / 48

33 Properties Scalability / 48

34 Properties Scalability 16 / 48

35 Properties SQL 17 / 48

36 Plan 1 Motivation and Goal Why use Dryad? 2 Dryad Graph Example Properties 3 Some others Ants Pegasus 4 An improvement: Ciel Dynamic tasks 5 Comparison 6 Conclusion 18 / 48

37 Ants Some limitations of Dryad 19 / 48

38 Ants Some limitations of Dryad Fault tolerance: what happens if a stage manager fails. 19 / 48

39 Ants Some limitations of Dryad Fault tolerance: what happens if a stage manager fails. No way of defining dynamic graphs. 19 / 48

40 Ants Linking ACO and Dryad 20 / 48

41 Ants Linking ACO and Dryad 20 / 48

42 Ants Linking ACO and Dryad Finding a path between producers and consumers 20 / 48

43 Ants Linking ACO and Dryad Finding a path between producers and consumers Is finding a path sufficient? 20 / 48

44 Ants Linking ACO and Dryad Finding a path between producers and consumers Is finding a path sufficient? What happens if a node is already close to overutilization? 20 / 48

45 Ants Linking ACO and Dryad Finding a path between producers and consumers Is finding a path sufficient? What happens if a node is already close to overutilization? 1 Need a way of evaluating the quality of the path. 20 / 48

46 Ants Linking ACO and Dryad Finding a path between producers and consumers Is finding a path sufficient? What happens if a node is already close to overutilization? 1 Need a way of evaluating the quality of the path. 2 Avoid interference between two tasks. 20 / 48

47 Ants Abstraction of the problem 21 / 48

48 Ants Abstraction of the problem Each cell is a server 21 / 48

49 Ants Abstraction of the problem Each cell is a server In a cell C, we have N(C), the neighbours Φ C D,n: probability if the ant has the destination D that the ant is going to the cell n when currently on C. Routing table: mean and variance of the time to go from C to D. 21 / 48

50 Ants Abstraction of the problem Each cell is a server In a cell C, we have N(C), the neighbours Φ C D,n: probability if the ant has the destination D that the ant is going to the cell n when currently on C. Routing table: mean and variance of the time to go from C to D. 21 / 48

51 Ants First type of ant: routing ants An ant records its path, knows the final destination D and the source C s. At each step, when an ant is on a cell C (different from D). 22 / 48

52 Ants First type of ant: routing ants An ant records its path, knows the final destination D and the source C s. At each step, when an ant is on a cell C (different from D). 1 Record queue waiting time at C. 22 / 48

53 Ants First type of ant: routing ants An ant records its path, knows the final destination D and the source C s. At each step, when an ant is on a cell C (different from D). 1 Record queue waiting time at C. 2 Choose next hop (unknown neighbour first) using Φ C D,n. [Detection of cycles] 22 / 48

54 Ants First type of ant: routing ants An ant records its path, knows the final destination D and the source C s. At each step, when an ant is on a cell C (different from D). 1 Record queue waiting time at C. 2 Choose next hop (unknown neighbour first) using Φ C D,n. [Detection of cycles] 3 En-queue on the link to go to n. 22 / 48

55 Ants First type of ant: routing ants An ant records its path, knows the final destination D and the source C s. At each step, when an ant is on a cell C (different from D). 1 Record queue waiting time at C. 2 Choose next hop (unknown neighbour first) using Φ C D,n. [Detection of cycles] 3 En-queue on the link to go to n. 4 Record waiting time and transmission time. 22 / 48

56 Ants First type of ant: routing ants An ant records its path, knows the final destination D and the source C s. At each step, when an ant is on a cell C (different from D). 1 Record queue waiting time at C. 2 Choose next hop (unknown neighbour first) using Φ C D,n. [Detection of cycles] 3 En-queue on the link to go to n. 4 Record waiting time and transmission time. 5 If the ant has not arrived at D, insert itself in the queue, repeat / 48

57 Ants First type of ant: routing ants An ant records its path, knows the final destination D and the source C s. At each step, when an ant is on a cell C (different from D). 1 Record queue waiting time at C. 2 Choose next hop (unknown neighbour first) using Φ C D,n. [Detection of cycles] 3 En-queue on the link to go to n. 4 Record waiting time and transmission time. 5 If the ant has not arrived at D, insert itself in the queue, repeat If the ant arrived at D, start to go back to C s 22 / 48

58 Ants Second type: scouting ants Try to find alternative path to go from C s to D. 23 / 48

59 Ants Second type: scouting ants Try to find alternative path to go from C s to D. If the computing time of the new tasks L new and the computing time of the old tasks L prev does not surpass a threshold L T (user defined). 23 / 48

60 Ants Second type: scouting ants Try to find alternative path to go from C s to D. If the computing time of the new tasks L new and the computing time of the old tasks L prev does not surpass a threshold L T (user defined). if L T is not defined, the ant just ensures that the load on the server is inferior to α (random variable). g is an other random variable to determine the number of tasks the ant will place on the server. 23 / 48

61 Ants Second type: scouting ants Try to find alternative path to go from C s to D. If the computing time of the new tasks L new and the computing time of the old tasks L prev does not surpass a threshold L T (user defined). if L T is not defined, the ant just ensures that the load on the server is inferior to α (random variable). g is an other random variable to determine the number of tasks the ant will place on the server. The system generate scout ants with various α and g. If some ants are not able to place all the tasks and have reached D then they will crawl back and modify the Φ C D,n (Negative Reinforcement). 23 / 48

62 Ants Third type: enforcement ants 24 / 48

63 Ants Third type: enforcement ants When enough scout ants are reporting success, some enforcement ants are released. 24 / 48

64 Ants Third type: enforcement ants When enough scout ants are reporting success, some enforcement ants are released. Except if extraordinary results are shown by scouts reports, the number of successful scout ants need to be higher than a threshold. 24 / 48

65 Ants Third type: enforcement ants When enough scout ants are reporting success, some enforcement ants are released. Except if extraordinary results are shown by scouts reports, the number of successful scout ants need to be higher than a threshold. The enforcement ants recompute the cost (and compare it with the scout results); if everything is fine the placement is done. 24 / 48

66 Ants Some drawbacks 25 / 48

67 Ants Some drawbacks Evaluation is done via simulation 25 / 48

68 Ants Some drawbacks Evaluation is done via simulation No theoretical proof of correctness for placement algorithm 25 / 48

69 Ants Some drawbacks Evaluation is done via simulation No theoretical proof of correctness for placement algorithm Very few optimizations what about overhead costs? 25 / 48

70 Pegasus 26 / 48

71 Pegasus 27 / 48

72 Pegasus Scheduling horizon 28 / 48

73 Plan 1 Motivation and Goal Why use Dryad? 2 Dryad Graph Example Properties 3 Some others Ants Pegasus 4 An improvement: Ciel Dynamic tasks 5 Comparison 6 Conclusion 29 / 48

74 Dynamic tasks Control Flow 30 / 48

75 Dynamic tasks Publishing: how to handle dynamic graphs 31 / 48

76 Dynamic tasks Lazy evaluation Lazy evaluation is used to schedule task. 32 / 48

77 Dynamic tasks Lazy evaluation Lazy evaluation is used to schedule task. 1 Start by the (expected) output of the root task 32 / 48

78 Dynamic tasks Lazy evaluation Lazy evaluation is used to schedule task. 1 Start by the (expected) output of the root task 2 If T has only concrete dependencies, execute it immediately 32 / 48

79 Dynamic tasks Lazy evaluation Lazy evaluation is used to schedule task. 1 Start by the (expected) output of the root task 2 If T has only concrete dependencies, execute it immediately 3 Else block T until the input becomes concrete. 32 / 48

80 Dynamic tasks An optimization result = spawn_exec(executor, args, n); Unique Naming for the i th output: name = executor:sha1(arg n):i Memoisation If a new task s outputs have already been produced by a previous task, the new task need not be executed at all. 33 / 48

81 Master Ciel support master failure 34 / 48

82 Persistent logging Write when a new job is launched 35 / 48

83 Persistent logging Write when a new job is launched The descriptors (graphs) are periodically written to the disk. 35 / 48

84 Persistent logging Write when a new job is launched The descriptors (graphs) are periodically written to the disk. If the master reboots, it looks for unmatched jobs. They are then restarted (the graph has been stored). 35 / 48

85 Secondary masters 36 / 48

86 Secondary masters Everything is sent to every master (primary and secondaries). 36 / 48

87 Secondary masters Everything is sent to every master (primary and secondaries). Every secondary records workers address. 36 / 48

88 Object table reconstruction In a case of a master rebooting (failure then recovering): 37 / 48

89 Object table reconstruction In a case of a master rebooting (failure then recovering): 1 Workers detect master failure (heartbeat). 37 / 48

90 Object table reconstruction In a case of a master rebooting (failure then recovering): 1 Workers detect master failure (heartbeat). 2 If a failure is detected, workers send registration to the master. 37 / 48

91 Object table reconstruction In a case of a master rebooting (failure then recovering): 1 Workers detect master failure (heartbeat). 2 If a failure is detected, workers send registration to the master. 3 Master resends objects list. 37 / 48

92 Speed comparison 38 / 48

93 Plan 1 Motivation and Goal Why use Dryad? 2 Dryad Graph Example Properties 3 Some others Ants Pegasus 4 An improvement: Ciel Dynamic tasks 5 Comparison 6 Conclusion 39 / 48

94 Map Reduce 40 / 48

95 Map Reduce Simple Efficient for a specific kind of programs: web indexing 40 / 48

96 Map Reduce Simple Efficient for a specific kind of programs: web indexing Reliability very simple 40 / 48

97 Map Reduce Simple Efficient for a specific kind of programs: web indexing Reliability very simple Workflow hardcoded 40 / 48

98 Map Reduce Simple Efficient for a specific kind of programs: web indexing Reliability very simple Workflow hardcoded New? 40 / 48

99 Map Reduce Simple Efficient for a specific kind of programs: web indexing Reliability very simple Workflow hardcoded New? PL-SQL, Teradata. 40 / 48

100 Map Reduce Simple Efficient for a specific kind of programs: web indexing Reliability very simple Workflow hardcoded New? PL-SQL, Teradata. Improvement Instead of having a Map phase then a Reduce phase: do a combination of Map-Reduce. 40 / 48

101 Map Reduce Simple Efficient for a specific kind of programs: web indexing Reliability very simple Workflow hardcoded New? PL-SQL, Teradata. Improvement Instead of having a Map phase then a Reduce phase: do a combination of Map-Reduce. Computation Reuse! 40 / 48

102 Dryad 41 / 48

103 Dryad Some optimizations: aggregation (data locality). 41 / 48

104 Dryad Some optimizations: aggregation (data locality). Much more general than Map Reduce 41 / 48

105 Dryad Some optimizations: aggregation (data locality). Much more general than Map Reduce Not so simple (look at the code) 41 / 48

106 Dryad Some optimizations: aggregation (data locality). Much more general than Map Reduce Not so simple (look at the code) No cycles 41 / 48

107 Dryad Some optimizations: aggregation (data locality). Much more general than Map Reduce Not so simple (look at the code) No cycles Coming from SQL. 41 / 48

108 Dryad Some optimizations: aggregation (data locality). Much more general than Map Reduce Not so simple (look at the code) No cycles Coming from SQL. Improvement Ciel 41 / 48

109 Ants 42 / 48

110 Ants Analogy with nature. 42 / 48

111 Ants Analogy with nature. Distributed 42 / 48

112 Ants Analogy with nature. Distributed General 42 / 48

113 Ants Analogy with nature. Distributed General Account for machine load 42 / 48

114 Ants Analogy with nature. Distributed General Account for machine load Difficult to predict behaviour 42 / 48

115 Ants Analogy with nature. Distributed General Account for machine load Difficult to predict behaviour Overhead!! 42 / 48

116 Ants Analogy with nature. Distributed General Account for machine load Difficult to predict behaviour Overhead!! 42 / 48

117 Ants Analogy with nature. Distributed General Account for machine load Difficult to predict behaviour Overhead!! Improvement 42 / 48

118 Ants Analogy with nature. Distributed General Account for machine load Difficult to predict behaviour Overhead!! Improvement What happens if an ant is faulty? 42 / 48

119 Ants Analogy with nature. Distributed General Account for machine load Difficult to predict behaviour Overhead!! Improvement What happens if an ant is faulty? I would replace it with a simpler search of optimal paths. 42 / 48

120 Pegasus 43 / 48

121 Pegasus Performance issue 43 / 48

122 Pegasus Performance issue Not general enough 43 / 48

123 Pegasus Performance issue Not general enough 43 / 48

124 Pegasus Performance issue Not general enough Improvement Reliability (only partition level) 43 / 48

125 Ciel 44 / 48

126 Ciel Much more general. 44 / 48

127 Ciel Much more general. Dynamic task dependencies 44 / 48

128 Ciel Much more general. Dynamic task dependencies 44 / 48

129 Ciel Much more general. Dynamic task dependencies Improvement Random funtions 44 / 48

130 Ciel Much more general. Dynamic task dependencies Improvement Random funtions How far are you from machine performance? 44 / 48

131 Perpectives 45 / 48

132 Perpectives Adding user s transformations for graphs: tiling, merge, etc / 48

133 Perpectives Adding user s transformations for graphs: tiling, merge, etc... BGP (distance-vector routing protocol) 45 / 48

134 Perpectives Adding user s transformations for graphs: tiling, merge, etc... BGP (distance-vector routing protocol) Look at this problem using maximum flow problem. 45 / 48

135 Plan 1 Motivation and Goal Why use Dryad? 2 Dryad Graph Example Properties 3 Some others Ants Pegasus 4 An improvement: Ciel Dynamic tasks 5 Comparison 6 Conclusion 46 / 48

136 Conclusion Microsoft decided to stop developing Dryad for commercial reasons. 47 / 48

137 Conclusion Summary Several restrictions that make the result possible: Node results are deterministic Job manager doesn t fail Acyclic graph Doesn t add enough to MapReduce 48 / 48

Review: Dryad. Louis Rabiet. September 20, 2013

Review: Dryad. Louis Rabiet. September 20, 2013 Review: Dryad Louis Rabiet September 20, 2013 Who is the intended audi- What problem did the paper address? ence? The paper is proposing to solve the problem of being able to take advantages of distributed

More information

EEDC. Scientific Programming Models. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS

EEDC. Scientific Programming Models. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS EEDC Execution Environments for Distributed Computing 34330 Master in Computer Architecture, Networks and Systems - CANS Scientific Programming Models Group members: Francesc Lordan francesc.lordan@bsc.es

More information

Vertical Partitioning for Query Processing over Raw Data

Vertical Partitioning for Query Processing over Raw Data Vertical Partitioning for Query Processing over Raw Data University of California, Merced June 30, 2015 Outline 1 SDSS Data and Queries 2 Problem Statement 3 MIP Formulation 4 Heuristic Algorithm 5 Pipeline

More information

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS Authors: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly Presenter: Zelin Dai WHAT IS DRYAD Combines computational

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models CIEL: A Universal Execution Engine for

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks

More information

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev Transactum Business Process Manager with High-Performance Elastic Scaling November 2011 Ivan Klianev Transactum BPM serves three primary objectives: To make it possible for developers unfamiliar with distributed

More information

Remote Procedure Call. Tom Anderson

Remote Procedure Call. Tom Anderson Remote Procedure Call Tom Anderson Why Are Distributed Systems Hard? Asynchrony Different nodes run at different speeds Messages can be unpredictably, arbitrarily delayed Failures (partial and ambiguous)

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS by Google Research presented by Weichen Wang 2016.11.28 OUTLINE Introduction The Programming Model The Implementation Single

More information

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores CSE 444: Database Internals Lectures 26 NoSQL: Extensible Record Stores CSE 444 - Spring 2014 1 References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No. 4)

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed

More information

Dryads. University of Applied Sciences Rapperswil. Date June 1, Supervisor Prof. Dr. Josef Joller. Author Pascal Kesseli.

Dryads. University of Applied Sciences Rapperswil. Date June 1, Supervisor Prof. Dr. Josef Joller. Author Pascal Kesseli. University of Applied Sciences Rapperswil Seminar Fall Semester 2010 Dryads Author Pascal Kesseli Date June 1, 2010 Supervisor Prof. Dr. Josef Joller Abstract This document evaluates the evolution of the

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 2: From MapReduce to Spark (1/2) January 18, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD S.THIRUNAVUKKARASU 1, DR.K.P.KALIYAMURTHIE 2 Assistant Professor, Dept of IT, Bharath University, Chennai-73 1 Professor& Head, Dept of IT, Bharath

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Outline Introduction and Background MapReduce Iterative MapReduce Distributed Workflow Management

More information

Performing MapReduce on Data Centers with Hierarchical Structures

Performing MapReduce on Data Centers with Hierarchical Structures INT J COMPUT COMMUN, ISSN 1841-9836 Vol.7 (212), No. 3 (September), pp. 432-449 Performing MapReduce on Data Centers with Hierarchical Structures Z. Ding, D. Guo, X. Chen, X. Luo Zeliu Ding, Deke Guo,

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei VIAF: Verification-based Integrity Assurance Framework for MapReduce YongzhiWang, JinpengWei MapReduce in Brief Satisfying the demand for large scale data processing It is a parallel programming model

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

CSE 417: Algorithms and Computational Complexity

CSE 417: Algorithms and Computational Complexity CSE : Algorithms and Computational Complexity More Graph Algorithms Autumn 00 Paul Beame Given: a directed acyclic graph (DAG) G=(V,E) Output: numbering of the vertices of G with distinct numbers from

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MapReduce for Data Intensive Scientific Analyses

MapReduce for Data Intensive Scientific Analyses apreduce for Data Intensive Scientific Analyses Jaliya Ekanayake Shrideep Pallickara Geoffrey Fox Department of Computer Science Indiana University Bloomington, IN, 47405 5/11/2009 Jaliya Ekanayake 1 Presentation

More information

Parallel Programming Concepts

Parallel Programming Concepts Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

COMP6511A: Large-Scale Distributed Systems. Windows Azure. Lin Gu. Hong Kong University of Science and Technology Spring, 2014

COMP6511A: Large-Scale Distributed Systems. Windows Azure. Lin Gu. Hong Kong University of Science and Technology Spring, 2014 COMP6511A: Large-Scale Distributed Systems Windows Azure Lin Gu Hong Kong University of Science and Technology Spring, 2014 Cloud Systems Infrastructure as a (IaaS): basic compute and storage resources

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together Fault-Tolerant Computer System Design ECE 695/CS 590 Putting it All Together Saurabh Bagchi ECE/CS Purdue University ECE 695/CS 590 1 Outline Looking at some practical systems that integrate multiple techniques

More information

Resource Control and Reservation

Resource Control and Reservation 1 Resource Control and Reservation Resource Control and Reservation policing: hold sources to committed resources scheduling: isolate flows, guarantees resource reservation: establish flows 2 Usage parameter

More information

RSVP 1. Resource Control and Reservation

RSVP 1. Resource Control and Reservation RSVP 1 Resource Control and Reservation RSVP 2 Resource Control and Reservation policing: hold sources to committed resources scheduling: isolate flows, guarantees resource reservation: establish flows

More information

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University Azure MapReduce Thilina Gunarathne Salsa group, Indiana University Agenda Recap of Azure Cloud Services Recap of MapReduce Azure MapReduce Architecture Application development using AzureMR Pairwise distance

More information

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems Fall 2017 Exam 3 Review Paul Krzyzanowski Rutgers University Fall 2017 December 11, 2017 CS 417 2017 Paul Krzyzanowski 1 Question 1 The core task of the user s map function within a

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Lecture 23 Database System Architectures

Lecture 23 Database System Architectures CMSC 461, Database Management Systems Spring 2018 Lecture 23 Database System Architectures These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used

More information

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file

More information

Reinforcement learning algorithms for non-stationary environments. Devika Subramanian Rice University

Reinforcement learning algorithms for non-stationary environments. Devika Subramanian Rice University Reinforcement learning algorithms for non-stationary environments Devika Subramanian Rice University Joint work with Peter Druschel and Johnny Chen of Rice University. Supported by a grant from Southwestern

More information

Distributed Programming the Google Way

Distributed Programming the Google Way Distributed Programming the Google Way Gregor Hohpe Software Engineer www.enterpriseintegrationpatterns.com Scalable & Distributed Fault tolerant distributed disk storage: Google File System Distributed

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Lecture 7. Reminder: Homework 2, Programming Project 1 due today. Homework 3, Programming Project 2 out, due Thursday next week. Questions?

Lecture 7. Reminder: Homework 2, Programming Project 1 due today. Homework 3, Programming Project 2 out, due Thursday next week. Questions? Lecture 7 Reminder: Homework 2, Programming Project 1 due today. Homework 3, Programming Project 2 out, due Thursday next week. Questions? Thursday, September 15 CS 475 Networks - Lecture 7 1 Outline Chapter

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment International Journal of Scientific & Engineering Research Volume 3, Issue 8, August-2012 1 Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment VPraveenkumar, DrSSujatha, RChinnasamy

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71 Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71 What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29,

More information

Data Centers. Tom Anderson

Data Centers. Tom Anderson Data Centers Tom Anderson Transport Clarification RPC messages can be arbitrary size Ex: ok to send a tree or a hash table Can require more than one packet sent/received We assume messages can be dropped,

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

One Trillion Edges. Graph processing at Facebook scale

One Trillion Edges. Graph processing at Facebook scale One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica. Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster

More information

L22: SC Report, Map Reduce

L22: SC Report, Map Reduce L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source

More information

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models

More information

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud Microsoft Azure Databricks for data engineering Building production data pipelines with Apache Spark in the cloud Azure Databricks As companies continue to set their sights on making data-driven decisions

More information

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups).

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups). 1/21/2015 CS 739 Distributed Systems Fall 2014 PmWIki / Project1 PmWIki / Project1 The goal for this project is to implement a small distributed system, to get experience in cooperatively developing a

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CHAPTER 7 CONCLUSION AND FUTURE SCOPE 121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Data Intensive Scalable Computing

Data Intensive Scalable Computing Data Intensive Scalable Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina MapReduce: Simplified Data Processing on Large Clusters By Stephen Cardina The Problem You have a large amount of raw data, such as a database or a web log, and you need to get some sort of derived data

More information

Pregel. Ali Shah

Pregel. Ali Shah Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation

More information

MapReduce Algorithm Design

MapReduce Algorithm Design MapReduce Algorithm Design Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big

More information

DryadLINQ. Distributed Computation. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

DryadLINQ. Distributed Computation. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Dryad Distributed Computation Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Distributed Batch Processing 1/34 Outline Motivation 1 Motivation

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

Process groups and message ordering

Process groups and message ordering Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information