CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Similar documents
Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CS November 2017

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

UNIT-IV HDFS. Ms. Selva Mary. G

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

CA485 Ray Walshe Google File System

HDFS Architecture Guide

Distributed Systems 16. Distributed File Systems II

Distributed Filesystem

Hadoop and HDFS Overview. Madhu Ankam

The Google File System. Alexandru Costan

Hadoop Distributed File System(HDFS)

MI-PDB, MIE-PDB: Advanced Database Systems

CS370 Operating Systems

A BigData Tour HDFS, Ceph and MapReduce

MapReduce. U of Toronto, 2014

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Introduction to MapReduce

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Dept. Of Computer Science, Colorado State University

A brief history on Hadoop

BigData and Map Reduce VITMAC03

Distributed Systems. CS422/522 Lecture17 17 November 2014

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Lecture 11 Hadoop & Spark

CSE 124: Networked Services Lecture-16

CS370 Operating Systems

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CLOUD-SCALE FILE SYSTEMS

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Introduction to Cloud Computing

CSE 124: Networked Services Fall 2009 Lecture-19

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

HADOOP FRAMEWORK FOR BIG DATA

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Data Storage Infrastructure at Facebook

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Fall HDFS Basics

Distributed File Systems II

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

The Google File System

Chase Wu New Jersey Institute of Technology

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Programming Systems for Big Data

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

CS370 Operating Systems

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Dept. Of Computer Science, Colorado State University

Chuck Cartledge, PhD. 24 September 2017

Introduction to Hadoop and MapReduce

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Google File System

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

50 Must Read Hadoop Interview Questions & Answers

Hadoop/MapReduce Computing Paradigm

Hadoop Lab 2 Exploring the Hadoop Environment

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Analytics in the cloud

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Lecture 12 DATA ANALYTICS ON WEB SCALE

Chapter 4: Apache Spark

The Google File System (GFS)

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Getting Started with Hadoop

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Mounica B, Aditya Srivastava, Md. Faisal Alam

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop

Introduction to MapReduce

Hadoop An Overview. - Socrates CCDH

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

The Google File System

Google File System. By Dinesh Amatya

Introduction to MapReduce

Distributed System. Gang Wu. Spring,2018

A Review Approach for Big Data and Hadoop Technology

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

The Google File System

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Map-Reduce. Marco Mura 2010 March, 31th

Transcription:

CS60021: Scalable Data Mining Sourangshu Bhattacharya

In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer Science and Engg.

Hadoop Map Reduce q Provides: q Automatic parallelization and Distribution q Fault Tolerance q Methods for interfacing with HDFS for colocation of computation and storage of output. q Status and Monitoring tools q API in Java q Ability to define the mapper and reducer in many languages through Hadoop streaming.

What is Hadoop? q A scalable fault-tolerant distributed system for data storage and processing. q Core Hadoop: q Hadoop Distributed File System (HDFS) q Hadoop YARN: Job Scheduling and Cluster Resource Management q Hadoop Map Reduce: Framework for distributed data processing. q Open Source system with large community support. https://hadoop.apache.org/

HDFS

6 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS provides interfaces for applications to move themselves closer to data. HDFS is designed to just work, however a working knowledge helps in diagnostics and improvements.

HDFS q Design Assumptions q Hardware failure is the norm. q Streaming data access. q Write once, read many times. q High throughput, not low latency. q Large datasets. q Characteristics: q Performs best with modest number of large files q Optimized for streaming reads q Layer on top of native file system.

HDFS q Data is organized into file and directories. q Files are divided into blocks and distributed to nodes. q Block placement is known at the time of read q Computation moved to same node. q Replication is used for: q Speed q Fault tolerance q Self healing.

Components of HDFS There are two (and a half) types of machines in a HDFS cluster NameNode : is the heart of an HDFS filesystem, it maintains and manages the file system metadata. E.g; what blocks make up a file, and on which datanodes those blocks are stored. DataNode :- where HDFS stores the actual data, there are usually quite a few of these.

HDFS Architecture

HDFS User Commands (dfs) List directory contents hdfs dfs ls hdfs dfs -ls / hdfs dfs -ls -R /var Display the disk space used by files hdfs dfs -du /hbase/data/hbase/namespace/ hdfs dfs -du -h /hbase/data/hbase/namespace/ hdfs dfs -du -s /hbase/data/hbase/namespace/

HDFS User Commands (dfs) Copy data to HDFS hdfs dfs -mkdir tdata hdfs dfs -ls hdfs dfs -copyfromlocal tutorials/data/geneva.csv tdata hdfs dfs -ls R Copy the file back to local filesystem cd tutorials/data/ hdfs dfs copytolocal tdata/geneva.csv geneva.csv.hdfs md5sum geneva.csv geneva.csv.hdfs

HDFS User Commands (acls) List acl for a file hdfs dfs -getfacl tdata/geneva.csv List the file statistics (%r replication factor) hdfs dfs -stat "%r" tdata/geneva.csv Write to hdfs reading from stdin echo "blah blah blah" hdfs dfs -put - tdataset/tfile.txt hdfs dfs -ls R hdfs dfs -cat tdataset/tfile.txt

Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth User Space, runs on heterogeneous OS

Distributed File System Single Namespace for entire cluster Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode

NameNode Metadata Meta-data in Memory The entire metadata is in main memory No demand paging of meta-data Types of Metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log Records file creations, file deletions. etc

DataNode A Block Server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to Clients Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes

HDFS read client Source: Hadoop: The Definitive Guide

HDFS write Client Source: Hadoop: The Definitive Guide

Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable

NameNode Failure A single point of failure Transaction Log stored in multiple directories A directory on the local file system A directory on a remote file system (NFS/CIFS) Need to develop a real HA solution

Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file

Conclusion: We have seen: The structure of HDFS. The shell commands. The architecture of HDFS system. Internal functioning of HDFS. Sourangshu Bhattacharya Computer Science and Engg. 24

References: Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets. 2 nd edition. - Cambridge University Press. http://www.mmds.org/ Tom White. Hadoop: The definitive Guide. Oreilly Press. Sourangshu Bhattacharya Computer Science and Engg. 25