PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Size: px
Start display at page:

Download "PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University"

Transcription

1 PaaS and Hadoop Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University laiping@tju.edu.cn 1

2 Outline PaaS Hadoop: HDFS and Mapreduce YARN Single-Processor Scheduling Hadoop Scheduling Dominant-Resource Fair Scheduling 2

3 PaaS Platform as a Service (PaaS) is a computing platform that abstracts the infrastructure, OS, and middleware to drive developer productivity. 3

4 PaaS Deliver the computing platform as a service Developing applications using programming languages and tools supported by the PaaS provider. Deploying consumer-created applications onto the cloud infrastructure. 4

5 Core Platform PaaS providers can provide a runtime environment for the developer platform. Runtime environment is automatic control such that consumers can focus on their services Dynamic provisioning On-demand resource provisioning Load balancing Distribute workload evenly among resources Fault tolerance Continuously operating in the presence of failures System monitoring Monitor the system status and measure the usage of resources. 5

6 PaaS PaaS Venders 6

7 Hadoop Distributed File System GFS vs HDFS Distributed Data Processing Mapreduce 7

8 Motivation: Large Scale Data Processing Many tasks composed of processing lots of data to produce lots of other data. Large-Scale Data Processing Want to use 1000s of CPUs. But don t want hassle of managing things. Storage devices fail 1.7% year 1-8.6% year 3(Google, 2007) 10,000 nodes, 7disks per node, year1,1190 failures/yr or 3.3failures/day. 8

9 Example: in Astronomy SKA Square Kilometer Array ( 平 方公 里里阵列列望远镜 ) Investment: $2 billion Data volume: over 12TB per second. 9

10 Motivation Data processing system provides: User-defined functions. Automatic parallelization and distribution. Fault tolerance. I/O scheduling. Status and monitoring. 10

11 Google Cloud Computing GFS (Google File System): Large Scale Distributed File System Paper: Interpreting the Data: Parallel Analysis with Sawzall (2005) Paper: MapReduce: Simplified Data Processing on Large Clusters (2004) White paper: The Datacenter as a Computer. An Introduction to the Design of Warehouse-Scale Machines (2009) Web Search Programming Language Sawzall Parallel Programming Model MapReduce Distributed Database BigTable Distributed File System Google File System (GFS) Distributed lock System Application Service Log Analysis Chubby Gmail Datacenter Construction Google s platform Google Maps Paxos Paper: Bigtable: A Distributed Storage System for Structured Data (2006) Paper: The Google File System (2003) Paper: The Chubby lock service for looselycoupled distributed systems (2006) Paper: Failure Trends in a Large Disk Drive Population (2007) 11

12 Google Cloud Computing Google Technologies vs Open Source Technologies MapReduce BigTable Hadoop MapReduce HBase Google File System Hadoop Distributed File System Google s Technologies Open Source Technologies 12

13 Motivation: GFS Google needs a file system supporting storing massive data. Buy one (Probably including both software and hardware) Expensive! 13

14 Motivation: GFS Why not using the existing file system [2003]? Examples: RedHat GFS, IBM GPFS, Sun Lustre etc. The problem is different: Different workload Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common Running on commodity hardware. Compatible with Google s service. 14

15 Whether could we build a file system which runs on commodity hardware? 15

16 Motivation: GFS Design overview: The system is built from many inexpensive commodity components that often fail. The system stores a modest number of large files. The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. The workloads also have many large, sequential writes that append data to files. The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. High sustained bandwidth is more important than low latency. 16

17 Google File System Design of GFS Client: Implement file system API, communicate with master and chunk servers. Master: A single master maintains all file system metadata. Chunk Server: store data chunks on local disk as linux files. 17

18 Google File System Minimize the master s involvement in all operations. Decouple the flow of data from the flow of control to use the network efficiently. A large chunk size: 64MB/128MB Reduces clients need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. A client is more likely to perform many operations on a large chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Reduces the size of the metadata stored on the master. 18

19 Google File System Master operations: Namespace management and locking Replica placement By default, each chunk is replicated for 3 times. Creation, Re-replication, Rebalancing Garbage collection After a file is deleted, GFS does not immediately reclaim the available physical storage. Stale replica detection ( 过期的数据副本 ) The master removes stale replicas in its regular garbage collection. 19

20 HDFS The Apache Hadoop software library: A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. HDFS (Hadoop Distributed File System): A distributed file system that provides high-throughput access to application data. Written in Java programming language. 20

21 HDFS Namenode Corresponding to GFS s Master. Secondary Namenode Namenode s backup. Datanode Corresponding to GFS s Chunkserver. 21

22 Hadoop Hadoop Cluster 22

23 Heartbeat DataNode periodically NameNode 1. I am alive; 2. Blocks table. 23

24 HDFS Read RPC 24

25 HDFS Read Network topology and Hadoop 25

26 HDFS Write 26

27 HDFS Write HDFS replica placement A trade-off between reliability and write bandwidth and read bandwidth here. Default strategy: The first replica on the same node as the client. (random when client is out of the system) The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. 27

28 Mapreduce Jeff Dean 28

29 Motivation 29

30 Motivation Every search: 200+ CPU 200TB Data 0.1 second response time 5 revenue 30

31 Motivation Web data sets can be very large: Tens to hundreds of terabytes Cannot mine on a single server Data processing examples: Word Count Google Trends PageRank 31

32 Motivation Simple problem, difficult to solve: How to solve the problem within bounded time. Divide and Conquer! 32

33 Mapreduce Mapreduce: A programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of realworld tasks. Map: takes an input pair and produces a set of intermediate key/value pairs. Reduce: accepts an intermediate key I and a set of values for that key. It merges these values together to form a possibly smaller set of values. 33

34 Mapreduce 34

35 Mapreduce 35

36 Example: WordCount Input: Page 1: the weather is good Page 2: today is good Page 3: good weather is good 36

37 Example: WordCount map output: Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1). 37

38 Example: WordCount Input of Reduce: Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1) 38

39 Example: WordCount Reduce output: Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4) 39

40 40

41 clear 41

42 42

43 Hadoop Mapreduce Programming Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Maps input key/value pairs to a set of intermediate key/value pairs. map() method: Called once for each key/value pair in the input split. Most applications should override this. 43

44 Hadoop Mapreduce Programming Context object: Allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output. Applications can use the Context: to report progress to set application-level status messages update Counters indicate they are alive to get the values that are stored in job configuration across map/reduce phase. 44

45 Hadoop Mapreduce Programming Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases: Shuffle: Copy the sorted output from each Mapper using HTTP across the network. Sort: Sort Reducer inputs by keys. Reduce reduce() method is called. 45

46 Hadoop Mapreduce Programming reduce() method: This method is called once for each key. Most applications will define their reduce class by overriding this method. 46

47 Apache Hadoop The project include these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop Distributed File System (HDFS) Hadoop MapReduce Other Hadoop-related projects at Apache include: Ambari Avro Cassandra (A scalable database) HBase (A distributed database) Hive (data summarization and ad hoc querying) Pig (data-flow language) Spark (A fast and general compute engine) Tez (execute an arbitrary DAG of tasks) Chukwa Zookeeper (coordination service) 47

48 What is YARN? Yet Another Resource Negotiator. Provides resource management services Scheduling Monitoring Control Replaces the resource management services of the JobTracker. Bundled with Hadoop 0.23 and Hadoop 2.x. 48

49 Why YARN? 49

50 Why YARN? Hadoop JobTracker was a barrier for scaling: Primary reason Hadoop 1.x is recommended for clusters no larger than 4000 nodes. Thousands of applications each running tens of thousands of tasks. JobTracker not able to schedule resources as fast as they became available. Distinct map and reduce slots led to artificial bottlenecks and low cluster utilization. 50

51 Why YARN? MapReduce was being abused by other application frameworks Frameworks trying to work around sort and shuffle. Iterative algorithms were suboptimal. YARN strives to be application framework agnostic ( 中 立的, 不不可知的 ). Different application types can share the same cluster. Runs MapReduce out of the box as part of Apache Hadoop. 51

52 YARN High-Level Architecture ResourceManager Single, centralized daemon for scheduling containers. Monitors nodes and applications. NodeManager Daemon running on each worker node in the cluster. Launches, monitors, and controls containers. ApplicationMaster Provides scheduling, monitor, control for an application instance. RM launches an AM for each application submitted to the cluster. AM requests containers via RM; launches containers via NM. Containers Unit of allocation and control for YARN. ApplicationMaster and application-specific tasks run within containers. 52

53 YARN High-Level Architecture 53

54 Mapreduce on YARN 54

55 Mapreduce on YARN 55

56 Mapreduce on YARN 56

57 Mapreduce on YARN 57

58 Mapreduce on YARN 58

59 Mapreduce on YARN 59

60 Mapreduce on YARN 60

61 Scheduling 61

62 Why Scheduling? Multiple tasks to schedule The processes on a single-core OS. The tasks of a Hadoop job. The tasks of multiple Hadoop jobs. Limited resources that these tasks require Processor(s) Memory (Less contentious) disk, network Scheduling goals 1. Good throughput or response time for tasks (or jobs) 2. High utilization of resources. 62

63 Single Processor Scheduling 63

64 FIFO Scheduling/FCFS Maintain tasks in a queue in order of arrival. When processor free, dequeue head and schedule it. 64

65 FIFO/FCFS Performance Average completion time may be high. For our example on previous slides. Average completion time of FIFO/FCFS = (Task 1 + Task 2 + Task 3)/3 = ( )/3 = 43/3 =

66 STF Scheduling (Shortest Task First) Maintain all tasks in a queue, in increasing order of running time. When processor free, dequeue head and schedule. 66

67 STF Is Optimal! Average completion of STF is the shortest among all scheduling approaches! For our example on previous slides, Average completion time of STF = (Task 1 + Task 2 + Task 3)/3 = (18+8+3)/3 = 29/3 = 9.66 (versus for FIFO/FCFS) In general, STF is a special case of priority scheduling. Instead of using time as priority, scheduler could use user-provided priority. 67

68 Round-Robin Scheduling Use a quantum (say 1 time unit) to run portion of task at queue head. Pre-empts processes by saving their state, and resuming later. After pre-empting, add to end of queue. 68

69 Round-Robin vs. STF/FIFO Round-Robin preferable for: Interactive applications. User needs quick responses from system. FIFO/STF preferable for Batch applications User submits jobs, goes away, comes back to get result. 69

70 Hadoop Scheduling Activities: Mappers, Reducers Resources: Tasktracker Scheduling goals: Time efficiency Scheduler: Jobtracker (MRv.1)/RM (YARN). Default scheduling algorithm: FIFO 70

71 FIFO in Hadoop Support 5 levels of priorities. Tasks are sorted according to their priority and submission time. Step 1: Select from the list of tasks with highest priority. Step 2: Select the task with earliest submission time in the list. Assign the selected task to a task tracker nearest to target data. 71

72 FIFO in Hadoop Improve data locality to reduce communications. Same% node Same% rack Remote% rack 72

73 FIFO in Hadoop A later submitted short task may have to wait a very long time if the previously submitted task is quite time-consuming. The$job$queue: User$1 User$2 User$3 User$4 73

74 Hadoop Fair Scheduler Job Scheduling for Multi-User MapReduce Clusters, Delay Scheduling : A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, M. Zaharia et al. Eurosys

75 Hadoop Fair Scheduler Design goals: Isolation Give each user (job) the illusion of owning (running) a private cluster. Statistical Multiplexing Redistribute capacity unused by some users (jobs) to other users (jobs). 75

76 Hadoop Fair Scheduler A two-level hierarchy. At the top level, FAIR allocates task slots across pools. Each pool will receive its minimum share. At the second level, each pool allocates its slots among multiple jobs in the pool. Pools 1 and 3 have minimum shares of 60, and 10 slots, respectively. Because Pool 3 is not using its share, its slots are given to Pool 2. Each user can decide his own internal scheduling algorithm (FIFO or Fair). 76

77 Hadoop Fair Scheduler d:#the#demand#capacity m:#the#minimum#share 77

78 Hadoop Fair Scheduler FAIR operates in three phases. Phase 1: It fills each unmarked bucket, i.e., it satisfies the demand of each bucket whose minimum share is larger than its demand. Phase 2: It fills all remaining buckets up to their marks. With this step, the isolation property is enforced as each bucket has received either its minimum share, or its demand has been satisfied. Phase 3: FAIR implements statistical multiplexing by pouring the remaining water evenly into unfilled buckets, starting with the bucket with the least water and continuing until all buckets are full or the water runs out. 78

79 Hadoop Fair Scheduler FAIR uses two timeouts: One for guaranteeing the minimum share (Tmin), One for guaranteeing the fair share (Tfair) Tmin < Tfair. If a newly started job does not get its minimum share before Tmin expires, FAIR kills other pools tasks and re-allocates them to the job. Then, if the job has not achieved its fair share by Tfair, FAIR kills more tasks. Pick the most recently launched tasks in over-scheduled jobs to minimize wasted computation. 79

80 Estimating Task Lengths HCS/HFS use FIFO May not be optimal (as we know!) Why not use shortest-task-first instead? It s optimal (as we know!) Challenge: Hard to know expected running time of task (before it s completed) Solution: Estimate length of task Some approaches Within a job: Calculate running time of task as proportional to size of its input Across tasks: Calculate running time of task in a given job as average of other tasks in that given job (weighted by input size) Lots of recent research results in this area! 80

81 Dominant-Resource Fair Scheduling Ali Ghodsi, Matei Zaharia, et al., Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. NSDI

82 Challenge What about scheduling VMs in a cloud (cluster)? Jobs may have multi-resource requirements: Job 1 s tasks: 2 CPUs, 8 GB Job 2 s tasks: 6 CPUs, 2 GB How do you schedule these jobs in a fair manner? That is, how many tasks of each job do you allow the system to run concurrently? What does fairness even mean? 82

83 Dominant Resource Fairness (DRF) Proposed by researchers from U. California Berkeley. Proposes notion of fairness across jobs with multiresource requirements. They showed that DRF is: Fair for multi-tenant systems. Strategy-proof: tenant can t benefit by lying. Envy-free: tenant can t envy another tenant s allocations. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. NSDI

84 Where is DRF Useful? DRF is: Usable in scheduling VMs in a cluster. Usable in scheduling Hadoop in a cluster. DRF used in Mesos, an OS intended for cloud environments. DRF-like strategies also used some cloud computing company s distributed OS s. 84

85 How DRF Works? Our example: Job 1 s tasks: 2 CPUs, 8 GB => Job 1 s resource vector = <2 CPUs, 8 GB> Job 2 s tasks: 6 CPUs, 2 GB => Job 2 s resource vector = <6 CPUs, 2 GB> Consider a cloud with <18 CPUs, 36 GB RAM>. 85

86 How DRF Works? Our example Job 1 s tasks: 2 CPUs, 8 GB => Job 1 s resource vector = <2 CPUs, 8 GB> Job 2 s tasks: 6 CPUs, 2 GB => Job 2 s resource vector = <6 CPUs, 2 GB> Consider a cloud with <18 CPUs, 36 GB RAM>. Each Job 1 s task consumes % of total CPUs = 2/18 = 1/9 Each Job 1 s task consumes % of total RAM = 8/36 = 2/9 1/9 < 2/9 => Job 1 s dominant resource is RAM, i.e., Job 1 is more memory- intensive than it is CPU-intensive. 86

87 How DRF Works? Our example Job 1 s tasks: 2 CPUs, 8 GB => Job 1 s resource vector = <2 CPUs, 8 GB> Job 2 s tasks: 6 CPUs, 2 GB => Job 2 s resource vector = <6 CPUs, 2 GB> Consider a cloud with <18 CPUs, 36 GB RAM> Each Job 2 s task consumes % of total CPUs = 6/18 = 6/18 Each Job 2 s task consumes % of total RAM = 2/36 = 1/18 6/18 > 1/18 => Job 2 s dominant resource is CPU, i.e., Job 2 is more CPUintensive than it is memory-intensive. 87

88 DRF Fairness For a given job, the % of its dominant resource type that it gets cluster-wide, is the same for all jobs: Job1 s % of RAM = Job2 s % of CPU Can be written as linear equations, and solved. 88

89 DRF Solution DRF Ensures Job1 s % of RAM = Job2 s % of CPU Solution for our example: Job 1 gets 3 tasks each with <2 CPUs, 8 GB> Job 2 gets 2 tasks each with <6 CPUs, 2 GB> Job1 s % of RAM: = Number of tasks * RAM per task / Total cluster RAM = 3*8/36 = 2/3 Job2 s % of CPU: = Number of tasks * CPU per task / Total cluster CPUs = 2*6/18 = 2/3 89

90 Other DRF Details DRF generalizes to multiple jobs. DRF also generalizes to more than 2 resource types CPU, RAM, Network, Disk, etc. DRF ensures that each job gets a fair share of that type of resource which the job desires the most. Hence fairness. 90

91 Summary Scheduling very important problem in cloud computing Limited resources, lots of jobs requiring access to these jobs Single-processor scheduling FIFO/FCFS, STF, Priority, Round-Robin Hadoop scheduling FIFO scheduler, Fair scheduler Dominant-Resources Fairness. 91

Cloud Computing Application I: Virtualization, Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Cloud Computing Application I: Virtualization, Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University Cloud Computing Application I: Virtualization, Hadoop Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University laiping@tju.edu.cn 1 Outline Virtualization GFS/HDFS Mapreduce: Part I 2 IaaS

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment. Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

L5-6:Runtime Platforms Hadoop and HDFS

L5-6:Runtime Platforms Hadoop and HDFS Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences SE256:Jan16 (2:1) L5-6:Runtime Platforms Hadoop and HDFS Yogesh Simmhan 03/

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System GFS: Google File System Google C/C++ HDFS: Hadoop Distributed File System Yahoo Java, Open Source Sector: Distributed Storage System University of Illinois at Chicago C++, Open Source 2 System that permanently

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

Chase Wu New Jersey Institute of Technology

Chase Wu New Jersey Institute of Technology CS 644: Introduction to Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Institute of Technology Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin at Columbia

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

CS60021: Scalable Data Mining. Sourangshu Bhattacharya CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce

More information

Hadoop Distributed File System(HDFS)

Hadoop Distributed File System(HDFS) Hadoop Distributed File System(HDFS) Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc. MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

MapReduce & BigTable

MapReduce & BigTable CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis:

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Introduction to the Hadoop Ecosystem - 1

Introduction to the Hadoop Ecosystem - 1 Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the

More information

Distributed Systems CS6421

Distributed Systems CS6421 Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

Distributed System. Gang Wu. Spring,2018

Distributed System. Gang Wu. Spring,2018 Distributed System Gang Wu Spring,2018 Lecture7:DFS What is DFS? A method of storing and accessing files base in a client/server architecture. A distributed file system is a client/server-based application

More information