TI2736-B Big Data Processing. Claudia Hauff

Similar documents
Google File System (GFS) and Hadoop Distributed File System (HDFS)

CLOUD-SCALE FILE SYSTEMS

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System. Alexandru Costan

The Google File System

Google File System. By Dinesh Amatya

The Google File System

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

The Google File System

The Google File System

The Google File System

CA485 Ray Walshe Google File System

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Distributed Systems 16. Distributed File Systems II

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Distributed Filesystem

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

CS 345A Data Mining. MapReduce

CSE 124: Networked Services Lecture-16

Distributed File Systems II

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

CSE 124: Networked Services Fall 2009 Lecture-19

Google File System. Arun Sundaram Operating Systems

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System

The Google File System

MapReduce. U of Toronto, 2014

BigData and Map Reduce VITMAC03

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System (GFS)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

GFS: The Google File System

Google File System 2

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CS370 Operating Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Map-Reduce. Marco Mura 2010 March, 31th

7680: Distributed Systems

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Google Disk Farm. Early days

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

A BigData Tour HDFS, Ceph and MapReduce

Distributed System. Gang Wu. Spring,2018

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Dept. Of Computer Science, Colorado State University

Map Reduce. Yerevan.

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Semantics with Failures

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop Distributed File System(HDFS)

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Introduction to MapReduce

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Seminar Report On. Google File System. Submitted by SARITHA.S

Staggeringly Large File Systems. Presented by Haoyan Geng

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

The Google File System GFS

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

MapReduce Simplified Data Processing on Large Clusters

Hadoop and HDFS Overview. Madhu Ankam

Lecture 11 Hadoop & Spark

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Clustering Lecture 8: MapReduce

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Staggeringly Large Filesystems

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

L5-6:Runtime Platforms Hadoop and HDFS

Distributed Systems. GFS / HDFS / Spanner

Introduction to MapReduce

Introduction to MapReduce

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Hadoop. copyright 2011 Trainologic LTD

itpass4sure Helps you pass the actual test with valid and latest training material.

HADOOP FRAMEWORK FOR BIG DATA

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Transcription:

TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl

Intro Streams Streams Map Reduce HDFS Pig Pig Design Pattern Hadoop Mix Graphs Giraph Spark Zoo Keeper Spark

But first Partitioner & Combiner

Reminder: map & reduce Key/value pairs form the basic data structure. Apply a map operation to each record in the input to compute a set of intermediate key/value pairs map: (k i,v i )! [(k j,v j )] map: (k i,v i )! [(k j,v x ), (k m,v y ), (k j,v n ),...] Apply a reduce operation to all values that share the same key reduce: (k j, [v x,v n ])! [(k h,v a ), (k h,v b ), (k l,v a )] There are no limits on the number of key/value pairs. 4

Combiner overview Combiner: local aggregation of key/value pairs after map() and before the shuffle & sort phase (occurs on the same machine as map()) Also called mini-reducer Sometimes the reducer code can be used. Instead of emitting 100 times (the,1), the combiner emits (the,100) Can lead to great speed-ups Needs to be employed with care 5

There is more: the combiner Setup: a mapper which outputs (term,termfreqindoc) and a combiner which is simply a copy of the reducer. Task 1: total term frequency of a term in the corpus without combiner (the,2),(the,2),(the,1) reduce (the,5) with combiner (reducer copy) (the,2),(the,3) reduce (the,5) correct! Task 2: average frequency of a term in the documents without combiner (the,2),(the,2),(the,1) reduce (the,(2+2+1)/3=1.66) 6 with combiner (reducer copy) (the,2),(the,1.5) reduce (the,(2+1.5)/2=1.75) wrong! (2+1)/2=1.5

There is more: the combiner Each combiner operates in isolation, has no access to other mapper s key/value pairs A combiner cannot be assumed to process all values associated with the same key (may not run at all! Hadoop s decision) Emitted key/value pairs must be the same as those emitted by the mapper Most often, combiner code!= reducer code Exception: associative & commutative reduce operations 7

Hadoop in practice Specified by the user: Mapper Reducer Combiner (optional) Partitioner (optional) Driver/job configuration 90% of the code comes from given templates 8

Hadoop in practice Mapper: counting inlinks import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; input key/value: (sourceurl, content) output key/value: (targeturl, 1) public class InlinkCount extends Mapper<Object,Text,Text,IntWritable> { IntWritable one = new IntWritable(1); Pattern linkpattern = Pattern.compile("\\[\\[.+?\\]\\]"); } public void map(object key, Text value, Context con) throws Exception { String page = value.tostring(); Matcher m = linkpattern.matcher(page); while(m.find()) { String match = m.group(); con.write(new Text(match),one); } } template differs slightly in diff. Hadoop versions 9

Hadoop in practice Reducer: counting inlinks import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; input key/value:(targeturl, 1) output key/value:(targeturl, count) public class SumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(text key,iterable<intwritable> values,context con) throws Exception { int sum = 0; for(intwritable iw : values) sum += iw.get(); } } con.write(key, new IntWritable(sum)); template differs slightly in diff. Hadoop versions 10

Hadoop in practice Driver: counting inlinks import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class InlinkCountDriver { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser (conf,args).getremainingargs(); Job job = new Job(conf, InlinkAccumulator"); job.setmapperclass(inlinkcountmapper.class); job.setcombinerclass(sumupreducer.class); job.setreducerclass(sumupreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); } } FileInputFormat.addInputPath(job,new Path("/user/in/")); FileOutputFormat.setOutputPath(job,new Path("/user/out/")); job.waitforcompletion(true); 11

Hadoop in practice Partitioner: two URLs that are the same apart from their #fragment should be sent to the same reducer. import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class CustomPartitioner extends Partitioner { public int getpartition(object key, Object value, int numpartitions) { String s = ((Text)key).toString(); String newkey = s.substring(0,s.lastindexof('#')); } return newkey.hashcode() % numpartitions; } 12

GFS / HDFS

Learning objectives Explain the design considerations behind GFS/HDFS Explain the basic procedures for data replication, recovery from failure, reading and writing Design alternative strategies to handle the issues GFS/HDFS was created for Decide whether GFS/HDFS is a good fit given a usage scenario Implement strategies for handling small files 14

GFS introduction Hadoop is heavily inspired by it. One way (not the only way) to design a distributed file system. 15

History of MapReduce & GFS Developed by engineers at Google around 2003 Built on principles in parallel and distributed processing Seminal Google papers: The Google file system by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003) MapReduce: Simplified Data Processing on Large Clusters. by Jeffrey Dean and Sanjay Ghemawat (2004) Yahoo! paper: The Hadoop distributed file system by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (2010) 16

What is a file system? File systems determine how data is stored and retrieved Distributed file systems manage the storage across a network of machines Added complexity due to the network GFS (Google) and HDFS (Hadoop) are distributed file systems HDFS inspired by GFS 17

GFS Assumptions based on Google s main use cases (at the time) Hardware failures are common (commodity hardware) Files are large (GB/TB) and their number is limited (millions, not billions) Two main types of reads: large streaming reads and small random reads Workloads with sequential writes that append data to files Once written, files are seldom modified (!=append) again; random modification in files possible, but not efficient in GFS High sustained bandwidth trumps low latency 18

Disclaimer GFS/HDFS are not a good fit for: Low latency data access (in the ms range) Solutions: HBase, Hive, Many small files Solution: stuffing of binary files Constantly changing data Not all details of GFS are public knowledge (HDFS developers filled in the details) 19

GFS architecture user level processes: they can run on the same physical machine Remember: one way, not the only way. several clients a single master (metadata) Data does not(!) flow across the GFS master several data servers 20 Image source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

GFS: files 21

Files on GFS A single file can contain many objects (e.g. Web documents) Files are divided into fixed size chunks (64MB) with unique 64 bit identifiers IDs assigned by GFS master at chunk creation time chunkservers store chunks on local disk as normal Linux files Reading & writing of data specified by the tuple (chunk_handle, byte_range) 22

File information at Master level Files are replicated (by default 3 times) across all chunk servers Master maintains all file system metadata Namespace, access control information, mapping from file to chunks, chunk locations, garbage collection of orphaned chunks, chunk migration, distributed systems are complex! Heartbeat messages between master and chunk servers Is the chunk server still alive? What chunks are stored at the chunkserver? To read/write data: client communicates with master (metadata operations) and chunk servers (data) 23

Files on GFS Seek time: 10ms (0.01s) Transfer rate: 100MB/s What is the chunk size to make the seek time 1% of the transfer rate? 100 MB Clients cache metadata Clients do not cache file data Chunkservers do not cache file data (responsibility of the underlying file system: Linux s buffer cache) Advantages of (large) fixed-size chunks: Disk seek time small compared to transfer time A single file can be larger than a node s disk space Fixed size makes allocation computations easy 24

GFS: Master 25

One master Single master simplifies the design tremendously Chunk placement and replication with global knowledge Single master in a large cluster can become a bottleneck Goal: minimize the number of reads and writes (thus metadata vs. data) 26

A read operation (in detail) 1. Client translates filename and byte offset specified by the application into a chunk index within the file. Sends request to master. 27 Image source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 2. Master replies with chunk handle and locations. 28 Image source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 3. Client caches the metadata. 4. Client sends a data request to one of the replicas (the closest one). Byte range indicates wanted part of the chunk. More than one chunk can be included in a single request. 29 Image source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 5. Contacted chunk server replies with the requested data. 30 Image source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Metadata on the master 3 types of metadata Files and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas All metadata is kept in master s memory (fast random access) Sets limits on the entire system s capacity Operation log is kept on master s local disk: in case of the master s crash, master state can be recovered Namespaces and mappings are logged Chunk locations are not logged 31

GFS: Chunks 32

Chunks 1 chunk = 64MB or 128MB (can be changed); chunk stored as a plain Linux file on a chunk server Advantages of large (but not too large) chunk size Reduced need for client/master interaction 1 request per chunk suits the target workloads Client can cache all the chunk locations for a multi-tb working set Reduced size of metadata on the master (kept in memory) Disadvantage: chunkserver can become hotspot for popular file(s) Question: how could the hotspot issue be solved? 33

Chunk locations Master does not keep a persistent record of chunk replica locations Master polls chunkservers about their chunks at startup Master keeps up to date through periodic HeartBeat messages Master/chunkservers easily kept in sync when chunk servers leave/join/fail/restart [regular event] Chunkserver has the final word over what chunks it has 34

master writes to operation log Operation log Persistent record of critical metadata changes Critical to the recovery of the system checkpoint creation; log file reset recovery too large log size? Changes to metadata are only made visible to clients after they have been written to the operation log crashed! Operation log replicated on multiple remote machines Before responding to client operation, log record must have been flashed locally and remotely Master recovers its file system from checkpoint + operation 35

master writes to operation log Operation log Persistent record of critical metadata changes Critical to the recovery of the system checkpoint creation; log file reset recovery too large log size? Changes C1 to metadata are only made visible to clients after they have been written to the operation log C2 mv file_name1 file_name2 get file_name1 crashed! Operation log replicated on multiple master remote machines Before get file_name1 responding to client operation, log record must C3 have been flashed locally and remotely Question: when does the master relay the new information to Master recovers its file system from checkpoint + operation the clients? Before or after having written it to the op. log? 36

Chunk replica placement Creation of (initially empty) chunks Use under-utilised chunk servers; spread across racks Limit number of recent creations on each chunk server Re-replication Started once the available replicas fall below setting Master instructs chunkserver to copy chunk data directly from existing valid replica Number of active clone operations/bandwidth is limited Re-balancing Changes in replica distribution for better load balancing; gradual filling of new chunk servers 37

GFS: Data integrity 38

Garbage collection Question: how can a file be deleted from the cluster? Deletion logged by master File renamed to a hidden file, deletion timestamp kept Periodic scan of the master s file system namespace Hidden files older than 3 days are deleted from master s memory (no further connection between file and its chunk) Periodic scan of the master s chunk namespace Orphaned chunks (not reachable from any file) are identified, their metadata deleted HeartBeat messages used to synchronise deletion between master/chunkserver 39

Stale replica detection Scenario: a chunkserver misses a change ( mutation ) applied to a chunk, e.g. a chunk was appended Master maintains a chunk version number to distinguish up-to-date and stale replicas Before an operation on a chunk, master ensures that version number is advanced Stale replicas are removed in the regular garbage collection cycle 40

Data corruption Data corruption or loss can occur at the read and write stage Question: how can chunk servers detect whether or not Chunkservers use checksums to detect corruption of stored data their stored data is corrupt? Alternative: compare replicas across chunk servers Chunk is broken into 64KB blocks, each has a 32 bit checksum Kept in memory and stored persistently Read requests: chunkserver verifies checksum of data blocks that overlap read range (i.e. corruptions not send to clients) 41

HDFS: Hadoop Distributed File System 42

GFS vs. HDFS GFS HDFS Master chunkserver operation log chunk random file writes possible multiple writer, multiple reader model chunk: 64KB data and 32bit checksum pieces default block size: 64MB NameNode DataNode journal, edit log block only append is possible single writer, multiple reader model per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp) default block size: 128MB 43

MapReduce 1 Hadoop s architecture O.X and 1.X NameNode Master of HDFS, directs the slave DataNode daemons to perform low-level I/O tasks Keeps track of file splitting into blocks, replication, block location, etc. Secondary NameNode: takes snapshots of the NameNode DataNode: each slave machine hosts a DataNode daemon 44

MapReduce 1 JobTracker and TaskTracker JobTracker (job scheduling + task progress monitoring) One JobTracker per Hadoop cluster Middleman between your application and Hadoop (single point of contact) Determines the execution plan for the application (files to process, assignment of nodes to tasks, task monitoring) Takes care of (supposed) task failures TaskTracker One TaskTracker per DataNode Manages individual tasks Keeps in touch with the JobTracker (via HeartBeats) - sends progress report & signals empty task slots 45

46 Image source: http://lintool.github.io/mapreducealgorithms/ MapReduce 1 JobTracker and TaskTracker

MapReduce 1 What about the jobs? Hadoop job : unit of work to be performed (by a client) Input data MapReduce program Configuration information Hadoop divides job into tasks (two types: map, reduce) Hadoop divides input data into fixed size input splits One map task per split One map function call for each record in the split Splits are processed in parallel (if enough DataNodes exist) Job execution controlled by JobTracker and TaskTrackers 47

Hadoop in practice: Yahoo! (2010) 40 nodes/rack sharing one IP switch 16GB RAM per cluster node, 1-gigabit Ethernet 70% of disk space allocated to HDFS Remainder: operating system, data emitted by Mappers (not in HDFS) NameNode: up to 64GB RAM Total storage: 9.8PB -> 3.3PB net storage (replication: 3) 60 million files, 63 million blocks 54,000 blocks hosted per DataNode 1-2 nodes lost per day Time for cluster to re-replicate lost blocks: 2 minutes 48 HDFS cluster with 3,500 nodes

YARN (MapReduce 2) JobTracker/TaskTrackers setup becomes a bottleneck in clusters with thousands of nodes As answer YARN has been developed (Yet Another Resource Negotiator) YARN splits the JobTracker s tasks (job scheduling and task progress monitoring) into two daemons: Resource manager (RM) Application master (negotiates with RM for cluster resources; each Hadoop job has a dedicated master) 49

YARN Advantages Scalability: larger clusters are supported Availability: high availability (high uptime) supported Utilization: more fine-grained use of resources Multitenancy: MapRedue is just one application among many 50

Recommended reading Chapter 3 51

THE END