# TI2736-B Big Data Processing. Claudia Hauff

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 TI2736-B Big Data Processing Claudia Hauff

2 Intro Streams Streams Map Reduce HDFS Pig Pig Design Pattern Hadoop Mix Graphs Giraph Spark Zoo Keeper Spark

3 But first Partitioner & Combiner

4 Reminder: map & reduce Key/value pairs form the basic data structure. Apply a map operation to each record in the input to compute a set of intermediate key/value pairs map: (k i,v i )! [(k j,v j )] map: (k i,v i )! [(k j,v x ), (k m,v y ), (k j,v n ),...] Apply a reduce operation to all values that share the same key reduce: (k j, [v x,v n ])! [(k h,v a ), (k h,v b ), (k l,v a )] There are no limits on the number of key/value pairs. 4

5 Combiner overview Combiner: local aggregation of key/value pairs after map() and before the shuffle & sort phase (occurs on the same machine as map()) Also called mini-reducer Sometimes the reducer code can be used. Instead of emitting 100 times (the,1), the combiner emits (the,100) Can lead to great speed-ups Needs to be employed with care 5

6 There is more: the combiner Setup: a mapper which outputs (term,termfreqindoc) and a combiner which is simply a copy of the reducer. Task 1: total term frequency of a term in the corpus without combiner (the,2),(the,2),(the,1) reduce (the,5) with combiner (reducer copy) (the,2),(the,3) reduce (the,5) correct! Task 2: average frequency of a term in the documents without combiner (the,2),(the,2),(the,1) reduce (the,(2+2+1)/3=1.66) 6 with combiner (reducer copy) (the,2),(the,1.5) reduce (the,(2+1.5)/2=1.75) wrong! (2+1)/2=1.5

7 There is more: the combiner Each combiner operates in isolation, has no access to other mapper s key/value pairs A combiner cannot be assumed to process all values associated with the same key (may not run at all! Hadoop s decision) Emitted key/value pairs must be the same as those emitted by the mapper Most often, combiner code!= reducer code Exception: associative & commutative reduce operations 7

8 Hadoop in practice Specified by the user: Mapper Reducer Combiner (optional) Partitioner (optional) Driver/job configuration 90% of the code comes from given templates 8

9 Hadoop in practice Mapper: counting inlinks import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; input key/value: (sourceurl, content) output key/value: (targeturl, 1) public class InlinkCount extends Mapper<Object,Text,Text,IntWritable> { IntWritable one = new IntWritable(1); Pattern linkpattern = Pattern.compile("\\[\\[.+?\\]\\]"); } public void map(object key, Text value, Context con) throws Exception { String page = value.tostring(); Matcher m = linkpattern.matcher(page); while(m.find()) { String match = m.group(); con.write(new Text(match),one); } } template differs slightly in diff. Hadoop versions 9

10 Hadoop in practice Reducer: counting inlinks import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; input key/value:(targeturl, 1) output key/value:(targeturl, count) public class SumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(text key,iterable<intwritable> values,context con) throws Exception { int sum = 0; for(intwritable iw : values) sum += iw.get(); } } con.write(key, new IntWritable(sum)); template differs slightly in diff. Hadoop versions 10

11 Hadoop in practice Driver: counting inlinks import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class InlinkCountDriver { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser (conf,args).getremainingargs(); Job job = new Job(conf, InlinkAccumulator"); job.setmapperclass(inlinkcountmapper.class); job.setcombinerclass(sumupreducer.class); job.setreducerclass(sumupreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); } } FileInputFormat.addInputPath(job,new Path("/user/in/")); FileOutputFormat.setOutputPath(job,new Path("/user/out/")); job.waitforcompletion(true); 11

12 Hadoop in practice Partitioner: two URLs that are the same apart from their #fragment should be sent to the same reducer. import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class CustomPartitioner extends Partitioner { public int getpartition(object key, Object value, int numpartitions) { String s = ((Text)key).toString(); String newkey = s.substring(0,s.lastindexof('#')); } return newkey.hashcode() % numpartitions; } 12

13 GFS / HDFS

14 Learning objectives Explain the design considerations behind GFS/HDFS Explain the basic procedures for data replication, recovery from failure, reading and writing Design alternative strategies to handle the issues GFS/HDFS was created for Decide whether GFS/HDFS is a good fit given a usage scenario Implement strategies for handling small files 14

15 GFS introduction Hadoop is heavily inspired by it. One way (not the only way) to design a distributed file system. 15

16 History of MapReduce & GFS Developed by engineers at Google around 2003 Built on principles in parallel and distributed processing Seminal Google papers: The Google file system by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003) MapReduce: Simplified Data Processing on Large Clusters. by Jeffrey Dean and Sanjay Ghemawat (2004) Yahoo! paper: The Hadoop distributed file system by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (2010) 16

17 What is a file system? File systems determine how data is stored and retrieved Distributed file systems manage the storage across a network of machines Added complexity due to the network GFS (Google) and HDFS (Hadoop) are distributed file systems HDFS inspired by GFS 17

18 GFS Assumptions based on Google s main use cases (at the time) Hardware failures are common (commodity hardware) Files are large (GB/TB) and their number is limited (millions, not billions) Two main types of reads: large streaming reads and small random reads Workloads with sequential writes that append data to files Once written, files are seldom modified (!=append) again; random modification in files possible, but not efficient in GFS High sustained bandwidth trumps low latency 18

19 Disclaimer GFS/HDFS are not a good fit for: Low latency data access (in the ms range) Solutions: HBase, Hive, Many small files Solution: stuffing of binary files Constantly changing data Not all details of GFS are public knowledge (HDFS developers filled in the details) 19

20 GFS architecture user level processes: they can run on the same physical machine Remember: one way, not the only way. several clients a single master (metadata) Data does not(!) flow across the GFS master several data servers 20 Image source:

21 GFS: files 21

22 Files on GFS A single file can contain many objects (e.g. Web documents) Files are divided into fixed size chunks (64MB) with unique 64 bit identifiers IDs assigned by GFS master at chunk creation time chunkservers store chunks on local disk as normal Linux files Reading & writing of data specified by the tuple (chunk_handle, byte_range) 22

23 File information at Master level Files are replicated (by default 3 times) across all chunk servers Master maintains all file system metadata Namespace, access control information, mapping from file to chunks, chunk locations, garbage collection of orphaned chunks, chunk migration, distributed systems are complex! Heartbeat messages between master and chunk servers Is the chunk server still alive? What chunks are stored at the chunkserver? To read/write data: client communicates with master (metadata operations) and chunk servers (data) 23

24 Files on GFS Seek time: 10ms (0.01s) Transfer rate: 100MB/s What is the chunk size to make the seek time 1% of the transfer rate? 100 MB Clients cache metadata Clients do not cache file data Chunkservers do not cache file data (responsibility of the underlying file system: Linux s buffer cache) Advantages of (large) fixed-size chunks: Disk seek time small compared to transfer time A single file can be larger than a node s disk space Fixed size makes allocation computations easy 24

25 GFS: Master 25

26 One master Single master simplifies the design tremendously Chunk placement and replication with global knowledge Single master in a large cluster can become a bottleneck Goal: minimize the number of reads and writes (thus metadata vs. data) 26

27 A read operation (in detail) 1. Client translates filename and byte offset specified by the application into a chunk index within the file. Sends request to master. 27 Image source:

28 A read operation (in detail) 2. Master replies with chunk handle and locations. 28 Image source:

29 A read operation (in detail) 3. Client caches the metadata. 4. Client sends a data request to one of the replicas (the closest one). Byte range indicates wanted part of the chunk. More than one chunk can be included in a single request. 29 Image source:

30 A read operation (in detail) 5. Contacted chunk server replies with the requested data. 30 Image source:

31 Metadata on the master 3 types of metadata Files and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas All metadata is kept in master s memory (fast random access) Sets limits on the entire system s capacity Operation log is kept on master s local disk: in case of the master s crash, master state can be recovered Namespaces and mappings are logged Chunk locations are not logged 31

32 GFS: Chunks 32

33 Chunks 1 chunk = 64MB or 128MB (can be changed); chunk stored as a plain Linux file on a chunk server Advantages of large (but not too large) chunk size Reduced need for client/master interaction 1 request per chunk suits the target workloads Client can cache all the chunk locations for a multi-tb working set Reduced size of metadata on the master (kept in memory) Disadvantage: chunkserver can become hotspot for popular file(s) Question: how could the hotspot issue be solved? 33

34 Chunk locations Master does not keep a persistent record of chunk replica locations Master polls chunkservers about their chunks at startup Master keeps up to date through periodic HeartBeat messages Master/chunkservers easily kept in sync when chunk servers leave/join/fail/restart [regular event] Chunkserver has the final word over what chunks it has 34

35 master writes to operation log Operation log Persistent record of critical metadata changes Critical to the recovery of the system checkpoint creation; log file reset recovery too large log size? Changes to metadata are only made visible to clients after they have been written to the operation log crashed! Operation log replicated on multiple remote machines Before responding to client operation, log record must have been flashed locally and remotely Master recovers its file system from checkpoint + operation 35

36 master writes to operation log Operation log Persistent record of critical metadata changes Critical to the recovery of the system checkpoint creation; log file reset recovery too large log size? Changes C1 to metadata are only made visible to clients after they have been written to the operation log C2 mv file_name1 file_name2 get file_name1 crashed! Operation log replicated on multiple master remote machines Before get file_name1 responding to client operation, log record must C3 have been flashed locally and remotely Question: when does the master relay the new information to Master recovers its file system from checkpoint + operation the clients? Before or after having written it to the op. log? 36

37 Chunk replica placement Creation of (initially empty) chunks Use under-utilised chunk servers; spread across racks Limit number of recent creations on each chunk server Re-replication Started once the available replicas fall below setting Master instructs chunkserver to copy chunk data directly from existing valid replica Number of active clone operations/bandwidth is limited Re-balancing Changes in replica distribution for better load balancing; gradual filling of new chunk servers 37

38 GFS: Data integrity 38

39 Garbage collection Question: how can a file be deleted from the cluster? Deletion logged by master File renamed to a hidden file, deletion timestamp kept Periodic scan of the master s file system namespace Hidden files older than 3 days are deleted from master s memory (no further connection between file and its chunk) Periodic scan of the master s chunk namespace Orphaned chunks (not reachable from any file) are identified, their metadata deleted HeartBeat messages used to synchronise deletion between master/chunkserver 39

40 Stale replica detection Scenario: a chunkserver misses a change ( mutation ) applied to a chunk, e.g. a chunk was appended Master maintains a chunk version number to distinguish up-to-date and stale replicas Before an operation on a chunk, master ensures that version number is advanced Stale replicas are removed in the regular garbage collection cycle 40

41 Data corruption Data corruption or loss can occur at the read and write stage Question: how can chunk servers detect whether or not Chunkservers use checksums to detect corruption of stored data their stored data is corrupt? Alternative: compare replicas across chunk servers Chunk is broken into 64KB blocks, each has a 32 bit checksum Kept in memory and stored persistently Read requests: chunkserver verifies checksum of data blocks that overlap read range (i.e. corruptions not send to clients) 41

42 HDFS: Hadoop Distributed File System 42

43 GFS vs. HDFS GFS HDFS Master chunkserver operation log chunk random file writes possible multiple writer, multiple reader model chunk: 64KB data and 32bit checksum pieces default block size: 64MB NameNode DataNode journal, edit log block only append is possible single writer, multiple reader model per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp) default block size: 128MB 43

44 MapReduce 1 Hadoop s architecture O.X and 1.X NameNode Master of HDFS, directs the slave DataNode daemons to perform low-level I/O tasks Keeps track of file splitting into blocks, replication, block location, etc. Secondary NameNode: takes snapshots of the NameNode DataNode: each slave machine hosts a DataNode daemon 44

46 46 Image source: MapReduce 1 JobTracker and TaskTracker

47 MapReduce 1 What about the jobs? Hadoop job : unit of work to be performed (by a client) Input data MapReduce program Configuration information Hadoop divides job into tasks (two types: map, reduce) Hadoop divides input data into fixed size input splits One map task per split One map function call for each record in the split Splits are processed in parallel (if enough DataNodes exist) Job execution controlled by JobTracker and TaskTrackers 47

48 Hadoop in practice: Yahoo! (2010) 40 nodes/rack sharing one IP switch 16GB RAM per cluster node, 1-gigabit Ethernet 70% of disk space allocated to HDFS Remainder: operating system, data emitted by Mappers (not in HDFS) NameNode: up to 64GB RAM Total storage: 9.8PB -> 3.3PB net storage (replication: 3) 60 million files, 63 million blocks 54,000 blocks hosted per DataNode 1-2 nodes lost per day Time for cluster to re-replicate lost blocks: 2 minutes 48 HDFS cluster with 3,500 nodes

49 YARN (MapReduce 2) JobTracker/TaskTrackers setup becomes a bottleneck in clusters with thousands of nodes As answer YARN has been developed (Yet Another Resource Negotiator) YARN splits the JobTracker s tasks (job scheduling and task progress monitoring) into two daemons: Resource manager (RM) Application master (negotiates with RM for cluster resources; each Hadoop job has a dedicated master) 49

50 YARN Advantages Scalability: larger clusters are supported Availability: high availability (high uptime) supported Utilization: more fine-grained use of resources Multitenancy: MapRedue is just one application among many 50

51 Recommended reading Chapter 3 51

52 THE END

Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

### Google File System. By Dinesh Amatya

Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

### CA485 Ray Walshe Google File System

Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

### 18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

### ! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

### 18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

### Distributed File Systems II

Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

### CS 345A Data Mining. MapReduce

CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

### Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

### Google Disk Farm. Early days

Google Disk Farm Early days today CS 5204 Fall, 2007 2 Design Design factors Failures are common (built from inexpensive commodity components) Files large (multi-gb) mutation principally via appending

### Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

### Introduction to MapReduce

Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

### Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

### The Google File System GFS

The Google File System GFS Common Goals of GFS and most Distributed File Systems Performance Reliability Scalability Availability Other GFS Concepts Component failures are the norm rather than the exception.

### Seminar Report On. Google File System. Submitted by SARITHA.S

Seminar Report On Submitted by SARITHA.S In partial fulfillment of requirements in Degree of Master of Technology (MTech) In Computer & Information Systems DEPARTMENT OF COMPUTER SCIENCE COCHIN UNIVERSITY

### HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

GFS: Google File System Google C/C++ HDFS: Hadoop Distributed File System Yahoo Java, Open Source Sector: Distributed Storage System University of Illinois at Chicago C++, Open Source 2 System that permanently

### Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

### Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization

### Introduction to MapReduce

732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

### Introduction to MapReduce

Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

### The amount of data increases every day Some numbers ( 2012):

1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

### The MapReduce Abstraction

The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

### 2/26/2017. The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

### CS 345A Data Mining. MapReduce

CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Dis Commodity Clusters Web data sets can be ery large Tens to hundreds of terabytes

### TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

### Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

### TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Software Virtual machine-based: Cloudera CDH 5.8, based on CentOS Saves us from a manual Hadoop installation (especially difficult on Windows)

### Large-scale Information Processing

Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,

### CS370 Operating Systems

CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 24 Mass Storage, HDFS/Hadoop Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ What 2

### Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

### Introduction to Map Reduce

Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

### International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

### UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

### Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

### Map Reduce.

Map Reduce dacosta@irit.fr Divide and conquer at PaaS Second Third Fourth 100 % // Fifth Sixth Seventh Cliquez pour 2 Typical problem Second Extract something of interest from each MAP Third Shuffle and

### Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

### Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System

Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System Qi Qi Instituto Superior Técnico - IST (Portugal) Royal Institute of Technology - KTH (Sweden) Abstract.

### What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

What Is Datacenter (Warehouse) Computing Distributed and Parallel Technology Datacenter, Warehouse and Cloud Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University,

### MapReduce-style data processing

MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

### UNIT-IV HDFS. Ms. Selva Mary. G

UNIT-IV HDFS HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system The Design of HDFS HDFS is a file system designed for storing very large files with

### Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

### Introduction to Cloud Computing

Introduction to Cloud Computing Distributed File Systems 15 319, spring 2010 12 th Lecture, Feb 18 th Majd F. Sakr Lecture Motivation Quick Refresher on Files and File Systems Understand the importance

### Google big data techniques (2)

Google big data techniques (2) Lecturer: Jiaheng Lu Fall 2016 10.12.2016 1 Outline Google File System and HDFS Relational DB V.S. Big data system Google Bigtable and NoSQL databases 2016/12/10 3 The Google

### Big Data landscape Lecture #2

Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

### Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

### Programming Systems for Big Data

Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

### Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

Volume-6, Issue-5, September-October 2016 International Journal of Engineering and Management Research Page Number: 281-286 Hadoop File Management System Swaraj Pritam Padhy 1, Sashi Bhusan Maharana 2

### CLOUD- SCALE FILE SYSTEMS THANKS TO M. GROSSNIKLAUS

Data Management in the Cloud CLOUD- SCALE FILE SYSTEMS THANKS TO M. GROSSNIKLAUS While produc7on systems are well disciplined and controlled, users some7mes are not Ghemawat, Gobioff & Leung 1 Google File

### CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

### HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,

### What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

### CSE6331: Cloud Computing

CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce

### High Performance Computing on MapReduce Programming Framework

International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

### Trinity File System (TFS) Specification V0.8

Trinity File System (TFS) Specification V0.8 Jiaran Zhang (v-jiarzh@microsoft.com), Bin Shao (binshao@microsoft.com) 1. Introduction Trinity File System (TFS) is a distributed file system designed to run

### We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

### CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

### Distributed Systems [Fall 2012]

Distributed Systems [Fall 2012] Lec 20: Bigtable (cont ed) Slide acks: Mohsen Taheriyan (http://www-scf.usc.edu/~csci572/2011spring/presentations/taheriyan.pptx) 1 Chubby (Reminder) Lock service with a

### PWBRR Algorithm of Hadoop Platform

PWBRR Algorithm of Hadoop Platform Haiyang Li Department of Computer Science and Information Technology, Roosevelt University, Chicago, 60616, USA Email: hli01@mail.roosevelt.edu ABSTRACT With cloud computing

### Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks

### Introduction to Distributed Data Systems

Introduction to Distributed Data Systems Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook January

### Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources

### Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

HADOOP Lecture 5 Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug s son

### Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

### Java in MapReduce. Scope

Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

### Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

### Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

### Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

### Improved MapReduce k-means Clustering Algorithm with Combiner

2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

### MapReduce: Algorithm Design for Relational Operations

MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Projection π Projection in MapReduce Easy Map over tuples, emit

### CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

### CS370 Operating Systems

CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 25 RAIDs, HDFS/Hadoop Slides based on Text by Silberschatz, Galvin, Gagne (not) Various sources 1 1 FAQ Striping:

International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

### Introduction to MapReduce

Introduction to MapReduce Based on slides by Juliana Freire Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Required Reading! Data-Intensive Text Processing with MapReduce,

### Introduction to the Hadoop Ecosystem - 1

Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the

### BigTable: A System for Distributed Structured Storage

BigTable: A System for Distributed Structured Storage Jeff Dean Joint work with: Mike Burrows, Tushar Chandra, Fay Chang, Mike Epstein, Andrew Fikes, Sanjay Ghemawat, Robert Griesemer, Bob Gruber, Wilson

### Big Data Technologies

Big Data Technologies Big Data Technologies Syllabus 1. Introduction to Big Data 1.1 Big data overview 1.2 Background of data analytics 1.3 Role of distributed system in big data 1.4 Role of data scientist

Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

### Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

### HDFS What is New and Futures

HDFS What is New and Futures Sanjay Radia, Founder, Architect Suresh Srinivas, Founder, Architect Hortonworks Inc. Page 1 About me Founder, Architect, Hortonworks Part of the Hadoop team at Yahoo! since

### Introduction to MapReduce

Introduction to MapReduce Jacqueline Chame CS503 Spring 2014 Slides based on: MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004. MapReduce: The Programming

### Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

### Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

### Improving Distributed Filesystem Performance by Combining Replica and Network Path Selection

Improving Distributed Filesystem Performance by Combining Replica and Network Path Selection by Xi Li A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree

### Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

### Big Data XML Parsing in Pentaho Data Integration (PDI)

Big Data XML Parsing in Pentaho Data Integration (PDI) Change log (if you want to use it): Date Version Author Changes Contents Overview... 1 Before You Begin... 1 Terms You Should Know... 1 Selecting

### How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

### BigTable. CSE-291 (Cloud Computing) Fall 2016

BigTable CSE-291 (Cloud Computing) Fall 2016 Data Model Sparse, distributed persistent, multi-dimensional sorted map Indexed by a row key, column key, and timestamp Values are uninterpreted arrays of bytes

Table of contents 1 Purpose...2 2 Pre-requisites...2 3 Installation...2 4 Configuration... 2 4.1 Configuration Files...2 4.2 Site Configuration... 3 5 Cluster Restartability... 10 5.1 Map/Reduce...10 6

### A Survey on Big Data

A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

### Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,