Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Size: px

Start display at page:

Download "Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018"

Emma Moody
6 years ago
Views:

1 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08

Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node

/network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

2 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on 00 MB/s = days Scan on 000-node cluster = 5 minutes Cost-efficiency: Commodity nodes /network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers) Functions Automatic parallelization & distribution A clean abstraction for programmers Fault-tolerance

Typical Cluster Aggregation switch Rack switch 0 nodes/rack, 000-000 nodes in cluster

3 Typical Cluster Aggregation switch Rack switch 0 nodes/rack, nodes in cluster Gbps bandwidth in rack, 8 Gbps out of rack Node specs : 8-6 cores, GB RAM, 8.5 TB disks

4 Why Cloud Computing? Cloud refers to large Internet services running on many machines (e.g. 0,000 Facebook, etc) Cloud computing refers to services by these companies that let external customers rent cycles Amazon EC: virtual machines at 8.5 /hour, billed hourly Amazon S: storage at 5 /GB/month Windows Azure: applications using Azure API Attractive features: Scale: 00s of nodes available in minutes Elastic computing with fine-grained billing: pay only for what you use Ease of use: sign up with credit card, get root access

5 Distributed Filesystems The interface is the same as a single-machine file system create(), open(), read(), write(), close() Distribute file data to a number of machines (storage units). Support replication Support concurrent data access Fetch content from remote servers. Local caching Google file system and Hadoop HDFS Optimized for Batch Processing Provides redundant storage of massive amounts of data on cheap and unreliable computers

6 API for Hadoop File System Shell command mkdir, ls, cat, cp hadoop fs -mkdir /user/deepak/dir hadoop fs -ls /user/deepak hadoop fs -cat /usr/deepak/file.txt Hapdoop hadoop fs -cp /user/deepak/dir/abc.txt /user/deepak/dir Copy data from the local file system to HDF hadoop fs -copyfromlocal <src:localfilesystem> <dest:hdfs> Ex: hadoop fs copyfromlocal /home/hduser/def.txt /user/deepak/dir Copy data from HDF to local hadoop fs -copytolocal <src:hdfs> <dest:localfilesystem> Other API Java mainly, Python access. a C language wrapper A HTTP browser to view HDFS files Local Linux User

7 Assumptions of GFS/Hadoop DFS High component failure rates Inexpensive commodity components fail all the time Why? Modest number of HUGE files Common in big data Just a few million applications Each is 00MB or larger; multi-gb files typical Files are write-once, mostly appended to Perhaps concurrently Large streaming reads High sustained throughput favored over low latency Good for batch processing

Hadoop Distributed File System Files split into 6 MB blocks Blocks replicated across several datanodes () as slaves Namenode stores metadata (file names,

8 Hadoop Distributed File System Files split into 6 MB blocks Blocks replicated across several datanodes () as slaves Namenode stores metadata (file names, locations, etc) as a master Files are append-only. Optimized for large files, sequential reads Read: use any copy (nearby) Write: append to replicas Namenode Datanodes File

9 Cluster Membersh HDFS Architecture NameNode Cluster Membership Client Secondary NameNode DataNodes NameNode : Maps a file to a file-id and list of blcok ids and data nodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: backup. Periodic merge of Transaction log

10 NameNode Metadata Meta-data in Memory The entire metadata is in main memory No demand paging of meta-data Types of Metadata List of files List of blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log save actions on file creation/deletion. etc Namenode Datanodes File

11 DataNode A Block Server (as a key-value store) Stores data in the local file system (e.g. ext) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to Clients Namenode File Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes Datanodes

12 Block Placement: Where to Place Replicas Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable Datanodes File

13 Data node failure detection with heartbeat A network partition can cause a subset of Datanodes to lose connectivity with the Namenode. Namenode detects this condition by the absence of a heartbeat message. Namenode marks Datanodes without Hearbeat and does not send any IO requests to them. Any data registered to the failed Datanode is not available to the HDFS. Namenode File Datanodes 5//8

14 Data Pipelining during Data Block Write to All Replicas Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file Client Namenode Datanodes File

15 How to Ensure Data Correctness during Reading Use Checksums to validate data Use CRC File Creation Client computes checksum per 5 byte DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If validation fails, Client tries other replicas Client Namenode Datanodes File

16 Some Properties of Hadoop DFS HDPS provides a write-once-read-many, append-only access model for data. HDFS is optimized for sequential reads of large files with large blocks (e.g. 6MB) HDFS maintains multiple copies of the data for fault tolerance. HDFS is designed for high-throughput, rather than low-latency. Hadoop jobs (e.g. MapReduce) tend to execute over several minutes and hours.

17 Questions : Hadoop Q: True _ False _ Hadoop is good to manage a large number of small files. Q: _ machine failures can be tolerated by Hadoop with replication degree. Q: True _ False_ Hadoop is good to support for an online shopping web service.

18 Summary Why cloud computing Large scale: 00s of nodes available in minutes Elastic computing: pay only for what you use Ease of use Hadoop: a petabyte-scale file system to handle bigdata sets. Provides redundant storage of massive amounts of data on cheap and unreliable computers Optimized for batch processing Replication for fault tolerance Useful Links HDFS Design Hadoop API:

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

CS60021: Scalable Data Mining. Sourangshu Bhattacharya CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer