HADOOP FRAMEWORK FOR BIG DATA - PDF Free Download

HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further requirement s in the future. We store the data and need to process, analyze it for business decision making. But Data is growing. Thesize of the databases isgrowing rapidly in day-to-day basis which is produced by differentsources. All the applications need to process terabytes or even more of day. This lead to a problem called as Bigdata. The problem cannot be handled by traditional database system and software tools. We need new technologies and tools.so one of the technology is Hadoop technology.hadoop is an open source framework maintained by Apache. It enables distributed, data intensive and parallel application by dividing larger jobs into smaller jobs such that each job is processed parallel.hadoop consists two corecomponents called as Hadoop common. They are Hadoop distributed file system (HDFS) for storing the data and amap Reduce as a programming model for processing the data. This paperprovides an overview, architecture and components of Hadoop. Keywords- Bigdata,Hadoop,HDFS,Mapreduce I. INTRODUCTION The source of generating data is by the users and a lot by the machines, such data is growing also called as Big Data. HandlingBig datasets like terabytes or even more is challenging. Hadoop uses Map Reduce as a data processing engine due to its e easiness,scalability and fault tolerant features. It consists of two processing functions: a Map function and a Reduce function. Map function takes input data.applies map function then results are stored on local disk. Then these intermediate results are given as inputdata to reduce function and finally the output is stored back on HDFS.Hadoop Distributed FileSystem(HDFS) is a file system where actual data is distributed and stored.hdfs partition the data into small chunks and those blocks are replicated and stored on different nodes for reliable and for fault tolerant purpose.there are other tools like Avro,Scoop,pig etc. to provide the abstraction as well as faster access to the data.hadoop architecture is master/slave model and follows Distributed approach. II. BIGDATA Big data in not only large in quantity but also comes from different in variety and should be processes at varied velocity.so there are three dimension attributes of big data: They are Volume = Gigabytes Terabytes Petabytes Velocity = Time sensitivity,streaming,real time Variety = Unstructured/semistructured @IJRTER-2015, All Rights Reserved 7

Figure1: 3-dimensions of Big data Big Data is typically large volume of un-structured (or semi structured) and structured data that gets created from various organized and unorganized applications, activities and channels such as emails, tweeter, web logs, Facebook, etc. The main difficulties with Big Data include capture, storage, search, sharing, analysis, and visualization. The core of Big Data is Hadoop which is a platform for distributing computing problems across a number of servers. The Hadoop technology brings new and more accessible programming methods for working on massive data sets with both structured and unstructured data. III. HADOOP Hadoop is a batch processing system for a cluster of nodes that provides Big Data analytic activities. it is a project from the Apache Software Foundation written in Java to support data intensive distributed applications. The inspiration comes from Google smap Reduce and Google File System papers. Yahoo is the Hadoop s biggest contributor, where Hadoop is extensively used across the business and scientific platforms. Hadoop acts like an umbrella. It contains sub-projects around distributed computing and although is best known for being a runtime environment for Map Reduce programs and its distributed file system HDFS, the other sub-projects provide complementary services, faster access and higher level abstractions. Some of the active sub-projects are: 1. MapReduce: Hadoop MapReduce is a programming model.it consists of two phases called as map and reduce, that rapidly process massive amounts of data in parallel on large clusters of computer nodes.it works in a master and a slave model.the job tracker manages all map/reduce jobs which are written by users.the tasktracker which takes the order from the jobtracker and do the actual job.mapreduce takes input from HDFS and stores back the results into it. 2. Hadoop distributed file system(hdfs):hdfs is basic filesystem for storage which is an implementation of Google s filesystem.it is a distributed filesystem provides high throughput by, creating multiple replicas of data blocks for reliable and fault tolerant computations. 3. Hadoop Streaming: A utility or an API to make MapReduce code in any language. 4. Hive:Hive converts sql written jobs.it sits on top of the hadoop. It provides a mechanism to put structure on data and it also provides a simple query language called Hive QL, based on SQL, enabling users familiar with SQL to query this data. 5. Pig: It is a high level scripting language environment to do map reducescoding. The structure of a pig programs can be substantially parallelized with simple syntax with built-in functionality @IJRTER-2015, All Rights Reserved 8

which provides abstraction that makes hadoop jobs development faster and easier than java map reduce. 6. Scoop: Provides data transfer between Hadoop and the interested relational database in a bidirectional manner. 7. Oozie:It Manages workflow of the Hadoop. It provide if-then-else branching and control within your Hadoop jobs. 8. HBase: A super-scalable key-value store. Its is a column oriented database. It works like a hash-map. It s like a dictionary (It is not a relational database despite the name HBase. 9. HCatalog: It is a Hadoop storage management layer. HCatalog table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS),in which user need not worry about the format of the data which is stored in HDFS. Figure2: High Level Architecture of Hadoop Hadoop architecture is similar to master/slave architecture. The architecture of Hadoop is shown in the below diagram: Figure3: Master/Slave model of Hadoop architecture Hadoop is master/slave architecture. The master is called as Name node and the slaves are called as Data nodes. The Name node controls the accesses of data done by clients. All the Data nodes stores the actual data. Hadoop splits the file into one or more blocks and these blocks are stored in the Data nodes. Each data block isreplicated to 3 different Datanodes to provide high availability of the Hadoop system. The block replication factor is configurable. @IJRTER-2015, All Rights Reserved 9

IV. HADOOP COMPONENTS Each node in a Hadoop cluster is either a master or a slave. Slave nodes are always both a Data Node and a Task Tracker. While it is possible for the same node to be both a Name Node and a JobTracker. The major components of Hadoop are: Name Node: Name Node is the heart of the Hadoop system. The Name Nodecontrols the file system namespace. It stores all the metadata information of the data blocks. This metadata is stored permanently on to local disk in the form of namespace image and edit log file. The Name Node also knows the location of the data blocks on the data node. However the Name Node does not store this information persistently. The Name Node creates the block to Data Node mapping when it is restarted. If the Name Node crashes, then the entire Hadoop system goes down Secondary Name Node: It periodically copy and merge the namespace image and edit log. If the Name Node crashes, then the namespace image stored in secondary Name Node can be used to restart the Name Node. DataNode: It stores the blocks of data and retrieves them. The Datanodes also reports the blocks information to the Namenode periodically for its liveliness. JobTracker: It s responsibility is to schedule the clients jobs. Job tracker creates map and reduce tasks and schedules them to run on the Datanodes (tasktrackers). It checks for any failed tasks and reschedules the failed tasks on another Datanodeautomatically. Job tracker can be run on the Namenode or a separate node. TaskTracker: It runs on the Datanodes. Task trackers responsibility is to run the map or reduce tasks allotted by the Namenode and to report the status of the tasks to the Namenode. Hadoop Distributed File System (HDFS): An HDFS cluster consists of two kinds of nodes which work as master/slave pattern: a Name node which acts like a master and Data nodes acts like a slave. The Name node controls file system namespace. It maintain files System and the related metadata regarding files and directories. It also knows about Data nodes, which holds blocks of data originally. Data nodes are the actual workers of the implementation. When client makes a request, the Name node takes the job and directs it to the respective data node, which stores and retrieves the blocks. Data nodes periodically sends its heartbeat to the Name node with the list of blocks that is holding. The Name node decides about which blocks should store on which data nodes. It also decides about replication factor of the data. Normally every file is divided into blocks of 64MB size which has the replication factor as 3.Hadoop map reduce applications uses the storage on HDFS which is generally different from normal file system. To read a file from HDFS client simply uses java input stream.this input stream is manipulated and the request sent to the Name node. If name node grants permission for accessing then it sends a file which it contains ids of blocks. Data nodes of the blocks back to the clients. Then client opens connection with closet Data node and make a request of specific block id. The requested HDFS block is returned to the client on the same connection and data is delivered again without interference of Name node.to write the data onto the HDFS, client uses it as output stream.this stream internally is divides the data as blocks of HDFS sized blocks.then name node will assign the blocks to the respective data nodes.data nodes will again send the acknowledge to the namenode,that particular data has been committed. @IJRTER-2015, All Rights Reserved 10

figure4: Hadoop Distributed Cluster File System Architechture V. MAPREDUCE Map reduce is data processing model introduced by Google. Hadoop MapReduce is a powerful processing engine and also a software framework for writing applications very easily which processes a massive amount of the data in parallel on large cluster.a Map Reduce job normally breaks the input data set into small chunks which are then processes by a mapper in a parallelized manner.the output from a mapper are sorted and shuffled and given as input to the reducer. The total scheduling tasks is done by the framework itself. It also monitors and re-executes the tasks if it gets fail automatically. Generally the storage nodes as well as compute nodes are same.that is HDFS and a map reduce framework are running on the same nodes.this configuration allows the task of data processing is moved to the data where it is resides. But this requires high level bandwidth across the cluster for connecting them. The Hadoop MapReduce architecture consists of one single job tracker and one or more Tasktracker nodes on each cluster. The master, jobtracker is responsible for scheduling and monitoring the tasks. The slave, tasktracker is responsible for doing the task which is directed by the master. The mapreduce refers to the two distinct and separate tasks or functions which are written by clients. These two functions are called as a map function and a reduce function. First is the map task, which takes the input dataset and apply the map function and produce it into another set of data, where individual elements are broken down into tuples (key/value pairs) stored in local disk.the reduce job takes the output from a map as input and aggregates those data tuples into a smaller set of tuples and the final result is stored back on HDFS. Asalways the reduce job is always performed after the map job.althoughmapreduce tasks are written in java client can also write it another language by using HadoopStreming Figure5: Map Reduce Architecture @IJRTER-2015, All Rights Reserved 11

Advantages of Hadoop Scalable The nodes on cluster are added and deleted as per the requirement. Cost effective Clusters is built with commodity hardware, which are less in cost. Flexible Hadoop can be used with business applications as well other applications also. Fast In Hadoop computation of data is moved to the data where it is stored which makes hadoop faster. Fault tolerant As data is replicated, in case of failure the other node comes up automatically. REFERNCES 1. HDFS (hadoop distributed file system) architecture. http://hadoop.apache.org/common/docs/current/hdfs 2. R. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics BMC bioinformatics,11(suppl 12):S1, 2010. 3. DhrubaBorthaku, TheHadoop Distributed File System: Architecture and Design. Retrieved from, 2010, http://hadoop.apache.org/common/. 4. Hadoop- The Definitive Guide, O Reilly 2009, Yahoo! Press 5. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat @IJRTER-2015, All Rights Reserved 12