UNIT-IV HDFS. Ms. Selva Mary. G

UNIT-IV HDFS

HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system

The Design of HDFS HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Very large files Very large means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access most efficient data processing pattern is a write-once, read-many-times pattern. A dataset generated or copied from source, and then various analyses are performed on that dataset over time.

Commodity hardware Hadoop doesn t require expensive, highly reliable hardware. It s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters.

Low-latency data access Applications that require low-latency access to data will not work well with HDFS. HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for lowlatency access.

Lots of small files Namenode holds file system the limit to the number of files in a file system is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes.

Features of HDFS It is suitable for the distributed storage and processing. Hadoop provides a command interface to interact with HDFS. The built-in servers of namenode and datanode help users to easily check the status of cluster. HDFS provides file permissions and authentication.

Goals of HDFS Fault detection and recovery: Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. Huge datasets: HDFS should have hundreds of nodes per cluster Hardware at data: A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

HDFS Architecture

Namenode The namenode is the commodity hardware that contains operating system and the namenode software. It is software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks: Manages the file system Regulates client s access to files. It also executes file system operations such as renaming, closing, and opening files and directories.

Secondary Namenode HDFS is based on master/slave architecture. Simplifies the overall HDFS architecture. It also creates a single point of failure losing the NameNode effectively means losing HDFS. To somewhat alleviate this problem, Hadoop implements a Secondary NameNode.

The Secondary NameNode is not a backup NameNode. It cannot take over the primary NameNode s function. It serves as a checkpointing mechanism for the primary NameNode. In addition to storing the state of the HDFS NameNode, it maintains two on-disk data structures an image file and an edit log. The image file represents metadata state edit log is a transactional log of every filesystem metadata change since the image file was created.

Datanode The datanode is a commodity hardware having operating system and datanode software. For every node in a cluster, there will be a datanode. These nodes manage the data storage of their system. Datanodes perform read-write operations on the file systems, as per client request. They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block A file will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Block advantages First, a file can be larger than any single disk in the network Secondly, making the unit of abstraction a block rather than a file simplifies the storage subsystem (it is easy to calculate how many can be stored on a given disk) Furthermore, blocks fit well with replication for providing fault tolerance and availability. If a block becomes unavailable, a copy can be read from another location

HDFS FEDERATION

Namespace Consists of directories, files and blocks. It supports file system operations such as create, delete, modify and list files and directories. Block Storage Service, which has two parts: Block Management (performed in the Namenode) Provides Datanode cluster membership by handling registrations Processes block reports and maintains location of blocks. Supports block related operations create, delete and get block location Manages replica placement, block replication Storage - is provided by Datanodes by storing blocks on the local file system and allowing read/write access.

The prior HDFS architecture allows only a single namespace for the entire cluster. a single Namenode manages the namespace. HDFS Federation addresses this limitation by adding support for multiple Namenodes/namespaces to HDFS.

Multiple Namenodes/Namespaces Federation uses multiple independent Namenodes/ namespaces. The Namenodes are federated; Namenodes are independent and do not require coordination with each other. The Datanodes are used as common storage for blocks Each Datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports. They also handle commands from the Namenodes.

Block Pool A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all the block pools in the cluster. Each Block Pool is managed independently. A Namenode failure does not prevent the Datanode from serving other Namenodes in the cluster. A Namespace and its block pool together are called Namespace Volume. It is a self-contained unit of management. When a Namenode/namespace is deleted, the corresponding block pool at the Datanodes is deleted.

ClusterID A ClusterID identifier is used to identify all the nodes in the cluster.

Key Benefits Namespace Scalability - allowing more Namenodes to be added to the cluster. Performance - Adding more Namenodes to the cluster scales the file system read/write throughput. Isolation - A single Namenode offers no isolation in a multi user environment. By using multiple Namenodes, different categories of applications and users can be isolated to different namespaces.

Hadoop - HDFS Operations Starting HDFS Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute the following command. $ hadoop namenode -format After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster. $ start-dfs.sh

Listing Files in HDFS After loading the information in the server, we can find the list of files in a directory, status of a file, using ls. Given below is the syntax of ls that you can pass to a directory or a filename as an argument. $ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system. Step 1 You have to create an input directory. $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input Step 2 Transfer and store a data file from local systems to the Hadoop file system using the put command. $ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input Step 3 You can verify the file using ls command. $ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file from the Hadoop file system. Step 1 Initially, view the data from HDFS using cat command. $ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile Step 2 Get the file from HDFS to the local file system using get command. $ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down the HDFS You can shut down the HDFS by using the following command. $ stop-dfs.sh

Other basic commands Running./bin/hadoop dfs with no additional arguments will list all the commands that can be run with the FsShell system. Furthermore, $HADOOP_HOME/bin/hadoop fs -help commandname will display a short usage summary for the operation in question, if you are stuck.

Sr.No Command and description 1. ls <path> Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry. 2. lsr <path> Behaves like -ls, but recursively displays entries in all subdirectories of path. 3. du <path> Shows disk usage, in bytes, for all the files which match path 4. dus <path> Like -du, but prints a summary of disk usage of all files/directories in the path. 5. mv <src><dest> Moves the file or directory indicated by src to dest, within HDFS.

Sr.No Command and description 6. cp <src> <dest> Copies the file or directory identified by src to dest, within HDFS. 7. rm <path> Removes the file or empty directory identified by path. 8. rmr <path> Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path). 9. put<localsrc> <dest> Copies the file or directory from the local file system identified by localsrc to dest within the DFS. 10. copyfromlocal <localsrc> <dest> Identical to put

Sr.No Command and description 11. movefromlocal <localsrc> <dest> Copies the file or directory from the local file system identified by localsrc to dest within HDFS, and then deletes the local copy on success. 12. get [-crc] <src> <localdest> Copies the file or directory in HDFS identified by src to the local file system path identified by localdest. 13. getmerge <src> <localdest> Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localdest. 14. cat <filen-ame> Displays the contents of filename on stdout. 15. copytolocal <src> <localdest> Identical to get

Sr.No Command and description 16. movetolocal <src> <localdest> Works like -get, but deletes the HDFS copy on success. 17. mkdir <path> Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., mkdir -p in Linux). 18. test -[ezd] <path> Returns 1 if path exists; has zero length; or is a directory or 0 otherwise. 19. stat [format] <path> Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y). 20. help <cmd-name> Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd.

DATA FLOW - ANATOMY OF A FILE READ

Step 1: open() Step 2: DistributedFile system calls the namenode using RPCs determine the locations of the first few blocks in the file. For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Step 3: The DistributedFile system returns an FSDataInputStream The client then calls read() on the stream. Step 4: DFSInputStream has stored the datanode addresses for the first few blocks in the file Connects to the first (closest) datanode for the first block in the file. calls read() repeatedly on the stream.

Step 5: End of the block, close the connection to the datanode find the best datanode for the next block. Blocks are read in order Step 6: close(). If error try the next closest one for that block. Remember datanodes that have failed Verifies checksums for the data transferred to it from the datanode. If a corrupted block, attempts to read a replica of the block from another datanode; Reports the corrupted block to the namenode.

Finding nearest data node and block

distance(/d1/r1/n1, /d1/r1/n1)

distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node) distance(/d1/r1/n1, /d1/r1/n2)

distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node) distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)

distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node) distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack) distance(/d1/r1/n1, /d1/r2/n3)

distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node) distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack) distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)

ANATOMY OF A FILE WRITE

We re going to consider the case of creating a new file, writing data to it, then closing the file Step 1: create() on DistributedFile system Step 2: RPC call to the namenode to create a new file in the file system s namespace, The namenode performs various checks to make sure the file doesn t already exist the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. Step 3: The DistributedFile system returns an FSDataOutputStream for the client to start writing data to. FSDataOutputStream handles communication with the datanodes and namenode. the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue. The list of datanodes forms a pipeline, assume the replication level is three, so there are three nodes in the pipeline.

Step 4: The DataStreamer streams the packets to the first datanode in the pipeline stores each packet and forwards it to the second datanode in the pipeline. the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline. Step 5: The DFSOutputStream maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline. Step 6: When the client has finished writing data, it calls close() on the stream. Step 7: flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete. The namenode already knows which blocks the file is made up of (because Data Streamer asks for block allocations), so it only has to wait for blocks to be minimally replicated before returning successfully.

Replica pipeline

Hadoop s default strategy to place the first replica on the same node as the client The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. system tries to avoid placing too many replicas on the same rack. Once the replica locations have been chosen, a pipeline is built. Overall, this strategy gives a good balance among reliability write bandwidth, read performance and block distribution across the cluster (clients only write a single block on the local rack).

End of Unit IV