BIG DATA & HDFS. Anuradha Bhatia, Big Data & HDFS, NoSQL 1

BIG DATA & HDFS 1

OUTLINE Big Data Characteristics of Big Data Traditional v/s Streaming Data Hadoop Hadoop Architecture 2

BIG DATA 3 Big data is a collection of both structured and unstructured data that is too large, fast and distinct to be managed by traditional database management tools or traditional data processing applications. For e.g., Data managed by e-commerce websites for request search, consumer/customer recommendations, current trend and merchandising Data managed by social media for providing a social network platform Data managed by real-time auction / bidding in online environment

IMPLEMENTATION Natural Systems Wildfire management Water management Water Water 4 Stock market Impact of weather on securities prices 5 million messages per second, trade in 150 microseconds Law Enforcement Real-time multimodal surveillance Transportation Intelligent traffic management Global air traffic management Manufacturing Process control for microchip fabrication Health & Life Sciences Neonatal ICU monitoring Epidemic early warning system Remote healthcare monitoring Fraud prevention Detecting multi-party fraud Real time fraud prevention Radio Astronomy Detection of transient events

BIG DATA USES 5

BIG DATA CONSISTS OF 6

WHICH PLATFORM DO YOU CHOOSE? 7 Hadoop Analytic Database General Purpose RDBMS Structured Semi-Structured Unstructured

CHARACTERISTICS OF BIG DATA 8 Volume System/Users generating TeraBtes, PetaBytes and ZetaBytes of data Storage Velocity System generated streams of data Multiple sources feeding data for one system Processing Variety Structured data Unstructured data-blogs, Images, Audio etc. Presentation

VALUE CHAIN OF BIG DATA 9 Data Generation Source of data e.g., Users, Enterprises, Systems etc Data Collection Companies, Tools, Sites aggregating data e.g, IMS Data Analysis Research and Analytics Firms e.g., MuSigms, etc. Application of Insights Management consulting firms, MNC s

TRADITIONAL COMPUTING Historical fact finding with data-at-rest Batch paradigm, pull model Query-driven: submits queries to static data Relies on Databases, Data Warehouses 10 Queries Data Results a) static data

STREAM COMPUTING Real time analysis of data-in-motion Streaming data A stream of structured or unstructured data-inmotion Stream Computing Analytic operations on streaming data in real-time 11

STREAM COMPUTING EVENTS 12 Large Spectrum of Events/Data Structured data RFID click streams financial data network packet traces text and transactional data ATM transactions pervasive sensor data Unstructured data phone conversations satellite data instant messages Unknown data/signal web searches news broadcasting digital audio, video and image data High usefulness density Simple analytics Well defined event High speed (million events per sec) Very low latency Low usefulness density Complex analytics Event needs to be detected High volume (TB/sec) Low latency

HADOOP 13 Ecosystem of open source projects Hosted by Apache Foundation Google developed concepts and shared Distributed file system that scales out on commodity servers with direct attached storage and automatic failover.

HADOOP SYSTEM 14 Source: Hortonworks

HADOOP DISTRIBUTED FILE SYSTEM - HDFS 15 Hadoop Distributed File System (HDFS) HDFS is the implementation of Hadoop file system, the java abstract class org.apache.hadoop.fs.filesystem that represents a file system in Hadoop. HDFS is designed to work efficiently in conjunction with MapReduce. Definition A distributed file system that provides big data storage solution through high-throughput access to application data. When data can potentially outgrow the storage capacity of a single machine, portioning it across a number of separate machines is necessary for storage of processing. This is achieved using a distributed file systems. Potential Challenges: Ensuring data integrity Data retention in case of nodes failure Integration across multiple nodes and systems

HADOOP DISTRIBUTED FILE SYSTEM - HDFS 16 Hadoop Distributed File System (HDFS) HDFS is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware Very Large Files Very large means files that are hundred of MB, GB, TB or PB in size. Streaming Data Access: HDFS implements write-once, read-many-times pattern. Data is copied from source for analysis over time. Each analysis involves a large portion of the dataset, so that time to read the whole dataset is more important than the latency in reading the first record. Commodity Hardware: Hadoop runs on clusters of commodity hardware (commodity available hardware) HDFS is designed to carry on working without a noticeable interruption to the user in the case of node failure.

HADOOP DISTRIBUTED FILE SYSTEM - HDFS 17 Where HDFS doesn t work well: HDFS is not designed for the following scenarios Low-Latency Data Access: HDFS is optimised for delivering a high throughput of data, and this may be at the expense of latency Lots of Small Files: File system metadata is stored in memory, hence the limit to the number of files in a file system is governed by the amount of memory on the namenode. As rule of thumb, each file, directory, and block takes about 150 bytes. Multiple Updates in the File: Files in HDFS may be written to by a single writer at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file.

HADOOP DISTRIBUTED FILE SYSTEM - CONCEPT Blocks NameNode DataNodes HDFS Federation HDFS High Availablity 18

HADOOP DISTRIBUTED FILE SYSTEM - BLOCKS 19 Files in HDFS are broken into blocks of 64 MB (default) and stored as independent units. Files in HDFS that is smaller than a single block does not occupy a full block s storage HDFS blocks are large compared to disk blocks to minimize the cost of seeks. Map tasks in MapReduce operate on one block at a time Block as a unit of abstraction rather than a file simplifies the storage subsystem which takes the metadata information Blocks fit well with replication for providing fault tolerance and availability HDFS s fsck command understands blocks For example, command to list the blocks that make up each file in the system: % hadoop fsck / -files -blocks

HDFS NAMENODES & DATANODES HDFS cluster consists of NameNodes DataNodes 20

HDFS NAMENODES & DATANODES 21 Client DFS 2-Get block location Namenode FSData InputStream Datanode Datanode Datanode

HDFS ARCHITECTURE 22

WHAT DOES IT DO? 23 Hadoop implements Google s MapReduce, using HDFS HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop s target is to run on clusters of the order of 10,000-nodes.

DATA CHARACTERISTICS 24 Batch processing rather than interactive user access. Large data sets and files: gigabytes to terabytes size High aggregate data bandwidth Scale to hundreds of nodes in a cluster Tens of millions of files in a single instance Write-once-read-many: a file once created, written and closed need not be changed this assumption simplifies coherency A map-reduce application or web-crawler application fits perfectly with this model.

DATA BLOCKS 25 HDFS support write-once-read-many with reads at streaming speeds. A typical block size is 64MB (or even 128 MB). A file is chopped into 64MB chunks and stored.

FILESYSTEM NAMESPACE 26 Hierarchical file system with directories and files Create, remove, move, rename etc. Namenode maintains the file system Any meta information changes to the file system recorded by the Namenode. An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.

FS SHELL, ADMIN AND BROWSER INTERFACE 27 HDFS organizes its data in files and directories. It provides a command line interface called the FS shell that lets the user interact with data in the HDFS. The syntax of the commands is similar to bash and csh. Example: to create a directory /foodir/bin/hadoop dfs mkdir /foodir There is also DFSAdmin interface available Browser interface is also available to view the namespace.

Steps for HDFS 28 BLOCKS REPLICATION STAGING

DATA REPLICATION 29 HDFS is designed to store very large files across machines in a large cluster. Each file is a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are replicated for fault tolerance. Block size and replicas are configurable per file. The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster. BlockReport contains all the blocks on a Datanode.

REPLICA PLACEMENT 30 The placement of the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement: Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.

REPLICA SELECTION 31 Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency. If there is a replica on the Reader node then that is preferred. HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.

STAGING 32 A client request to create a file does not reach Namenode immediately. HDFS client caches the data into a temporary file. When the data reached a HDFS block size the client contacts the Namenode. Namenode inserts the filename into its hierarchy and allocates a data block for it. The Namenode responds to the client with the identity of the Datanode and the destination of the replicas (Datanodes) for the block. Then the client flushes it from its local memory.

STAGING 33 The client sends a message that the file is closed. Namenode proceeds to commit the file for creation operation into the persistent store. If the Namenode dies before file is closed, the file is lost. This client side caching is required to avoid network congestion.

SAFEMODE STARTUP 34 On startup Namenode enters Safemode. Replication of data blocks do not occur in Safemode. Each DataNode checks in with Heartbeat and BlockReport. Namenode verifies that each block has acceptable number of replicas After a configurable percentage of safely replicated blocks check in with the Namenode, Namenode exits Safemode. It then makes the list of blocks that need to be replicated. Namenode then proceeds to replicate these blocks to other Datanodes.

NAMENODE 35 Keeps image of entire file system namespace and file Blockmap in memory. 8GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories. When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.

FILESYSTEM METADATA 36 The HDFS namespace is stored by Namenode. Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data. o o o For example, creating a new file. Change replication factor of a file EditLog is stored in the Namenode s local filesystem Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Namenode s local filesystem.

DATANODE A Datanode stores data in files in its local file system. Datanode has no knowledge about HDFS filesystem It stores each block of HDFS data in a separate file. Datanode does not create all files in the same directory. It uses heuristics to determine optimal number of files per directory and creates directories appropriately: Research issue? When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport. 37

NAMENODES & DATANODES 38 Master/slave architecture HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

NAMENODES & DATANODES 39 NameNode Manages the file namespace operation like opening, creating, renaming etc. File name to list blocks + location mapping File metadata Authorization and authentication Collect block reports from DataNodes on block locations Replicate missing blocks Keeps ALL namespace in memory plus checkpoints & journal

NAMENODES & DATANODES 40 DataNode Handles block storage on multiple volumes and data integrity. Clients access the blocks directly from data nodes for read and write Data nodes periodically send block reports to NameNode Block creation, deletion and replication upon instruction from the NameNode

CLIENT 41

FAULT TOLERANCE 42 Failure is the norm rather than exception A HDFS instance may consist of thousands of low end machines, each storing part of the file system s data. Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional. Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

DATANODE FAILURE & HEARTBEAT 43 A network partition can cause a subset of Datanodes to lose connectivity with the Namenode. Namenode detects this condition by the absence of a Heartbeat message. Namenode marks Datanodes without Hearbeat and does not send any IO requests to them. Any data registered to the failed Datanode is not available to the HDFS. Also the death of a Datanode may cause replication factor of some of the blocks to fall below their specified value.

RE-REPLICATION The necessity for re-replication may arise due to: 44 o A Datanode may become unavailable, o A replica may become corrupted, o A hard disk on a Datanode may fail, or o The replication factor on the block may be increased.

HDFS FAULT TOLERANCE 45 The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. Worker Failure: The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Master Failure: It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last check-pointed state. However, in most cases, the user restarts the job.

CLUSTER - REBALANCING 46 HDFS architecture is compatible with data rebalancing schemes. A scheme might move data from one Datanode to another if the free space on a Datanode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster.

DATA INTEGRITY 47 Consider a situation: a block of data fetched from Datanode arrives corrupted. This corruption may occur because of faults in a storage device, network faults, or buggy software. A HDFS client creates the checksum of every block of its file and stores it in hidden files in the HDFS namespace. When a clients retrieves the contents of file, it verifies that the corresponding checksums match. If does not match, the client can retrieve the block from a replica.

METADATA DISK FAILURE 48 FsImage and EditLog are central data structures of HDFS. A corruption of these files can cause a HDFS instance to be nonfunctional. For this reason, a Namenode can be configured to maintain multiple copies of the FsImage and EditLog. Multiple copies of the FsImage and EditLog files are updated synchronously. Meta-data is not data-intensive. The Namenode could be single point failure: automatic failover is NOT supported!

BACKUP 49

SAFEMODE 50 On startup NameNode enters SafeMode Replication of data blocks do not occur in SafeMode Each DataNode checks in with HeartBeat and BlockReport NameNode verifies that each block has acceptable number of replicas After a configurable percentage of safely replicated blocks check in with the NameNode, NameNode exists in SafeMode It then makes the list of blocks that need to be replicated NameNode then proceeds to replicate these blocks to other DataNodes

APPLICATION PROGRAMMING INTERFACE 51 HDFS provides Java API for application to use. Python access is also used in many applications. A C language wrapper for Java API is also available. A HTTP browser can be used to browse the files of a HDFS instance.

SPACE RECLAMATION 52 When a file is deleted by a client, HDFS renames file to a file in be the /trash directory for a configurable amount of time. A client can request for an undelete in this allowed time. After the specified time the file is deleted and the space is reclaimed. When the replication factor is reduced, the Namenode selects excess replicas that can be deleted. Next heartbeat(?) transfers this information to the Datanode that clears the blocks for use.

HADOOP ECOSYSTEM HDFS is the file system 53 MR is the job which runs on file system The MR job helps the user to ask question from HDFS files Pig and Hive are two projects built to replace coding the map reduce Pig and Hive interpreter turns the script and sql queries "INTO" MR job To save the map and reduce only dependency to be able to query on HDFS - Impala and Hive Impala Optimized for high latency queries-near real time Hive optimized for batch processing job Sqoop: Can put data from a relation DB to Hadoop ecosystem Flume can send data generated from external system to move to HDFS- Apt for high volume logging Hue: Graphical frontend to cluster Oozie: Workflow management tool Mahout : Machine learning Library

HADOOP ECOSYSTEM 54 PIG HIVE Select * from MR IMPALA HBASE HDFS HUE, OOZIE, MAHOUT CDH SQOOP FLUME CLOUDERA

STORAGE OF FILE IN HDFS 55

STORAGE OF FILE IN HDFS 56 When a 150MB file is being fed to Hadoop ecosystem it breaks itself in to multiple parts to achieve parallelism It breaks itself in to chunks where default chunk size is 64MB Data node is the demon which takes care of all the happening at an individual node Name node is the one which keeps a track on what goes where and when required how to collect the same group together Now think hard what could be the possible challenges?

HDFS APPLICATION 57

HDFS APPLICATION 58 Application of HDFS-Moving Data File for Analysis Moving a file in hadoop Moved file in Hadoop ecosystem

MAPREDUCE STRUCTURE 59

NoSQL 60 What is NoSQL CAP Theorem What is lost Types of NoSQL Data Model Frameworks Demo Wrap-up

SCALING UP Issues with scaling up when the dataset is just too big RDBMS were not designed to be distributed Began to look at multi-node database solutions Known as scaling out or horizontal scaling Different approaches include: o Master-slave o Sharding 61

Master-Slave RDBMS - MASTER/SLAVE 62 o All writes are written to the master. All reads performed against the replicated slave databases o Critical reads may be incorrect as propagated down writes may not have been o Large data sets can pose problems as master needs to duplicate data to slaves

RDBMS - SHARDING Partition or Sharding o Scales well for both reads and writes o Not transparent, application needs to be partition-aware o Can no longer have relationships/joins across partitions o Loss of referential integrity across shards 63

SCALING RDBMS Multi-Master replication INSERT only, not UPDATES/DELETES No JOINs, thereby reducing query time o This involves de-normalizing data In-memory databases 64

NoSQL 65 Stands for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor do they use the concept of joins All NoSQL offerings relax one or more of the ACID properties

WHY NoSQL?? 66 For data storage, an RDBMS cannot be the be-all/end-all Just as there are different programming languages, need to have other data storage tools in the toolbox A NoSQL solution is more acceptable to a client now than even a year ago

BIG TABLE Three major papers were the seeds of the NoSQL movement o BigTable (Google) o Dynamo (Amazon) Gossip protocol (discovery and error detection) Distributed key-value data store Eventual consistency o CAP Theorem 67

CAP THEOREM 68 Three properties of a system: Consistency, Availability and Partitions You can have at most two of these three properties for any shared-data system To scale out, you have to partition. That leaves either consistency or availability to choose from o In almost all cases, you would choose availability over consistency

CHARACTERISTICS OF NoSQL NoSQL solutions fall into two major areas: 69 o Key/Value or the big hash table. Amazon S3 (Dynamo) Voldemort Scalaris o Schema-less which comes in multiple flavors, column-based, document-based or graph-based. Cassandra (column-based) CouchDB (document-based) Neo4J (graph-based) HBase (column-based)

Pros o Very fast o Very scalable o Simple model o Able to distribute horizontally KEY VALUE 70 Cons o Many data structures (objects) can't be easily modeled as key value pairs

Pros SCHEMA- LESS o Schema-less data model is richer than key/value pairs o Eventual consistency o Many are distributed o Still provide excellent performance and scalability 71 Cons o Typically no ACID transactions or joins

SQL TO NoSQL 72 Joins Group by Order by ACID transactions SQL as a sometimes frustrating but still powerful query language Easy integration with other applications that support SQL

SEARCHING 73 Relational o SELECT `column` FROM `database`,`table` WHERE `id` = key; o SELECT product_name FROM rockets WHERE id = 123; Cassandra (standard) o keyspace.getslice(key, column_family, "column") o keyspace.getslice(123, new ColumnParent( rockets ), getslicepredicate());

NoSQL API 74 Basic API access: o get(key) -- Extract the value given a key o put(key, value) -- Create or update the value given its key o delete(key) -- Remove the key and its associated value o execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map... etc).

DATA MODEL Within Cassandra data set o Column: smallest data element, a tuple with a name and a value :hadoop, '1' might return: {'name' => Hadoop Model', toon' => Ready Set Zoom', inventoryqty' => 5, producturl => hadoop\1.gif } 75

DATA MODEL 76 o ColumnFamily: There s a single structure used to group both the Columns and SuperColumns. Called a ColumnFamily (think table), it has two types, Standard & Super. Column families must be defined at startup o Key: the permanent name of the record o Keyspace: the outer-most level of organization. This is usually the name of the application. For example, Acme' (think database name).

HASHING 77 V A B C S D R H M

HASHING Partition using consistent hashing 78 o Keys hash to a point on a fixed circular space o Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots Nodes take positions on the circle. A, B, and D exists. o B responsible for AB range. o D responsible for BD range. o A responsible for DA range. C joins. o B, D split ranges. o C gets BC from D.

DATA TYPE 79 Columns are always sorted by their name. Sorting supports: o BytesType o UTF8Type o LexicalUUIDType o TimeUUIDType o AsciiType o LongType Each of these options treats the Columns' name as a different data type

CASE STUDY 80 Facebook Search MySQL > 50 GB Data o Writes Average : ~300 ms o Reads Average : ~350 ms Rewritten with NoSQL > 50 GB Data o Writes Average : 0.12 ms o Reads Average : 15 ms

IMPLEMENTATION OF NoSQL 81 Log Analysis Social Networking Feeds (many firms hooked in through Facebook or Twitter) External feeds from partners (EAI) Data that is not easily analyzed in a RDBMS such as time-based data Large data feeds that need to be massaged before entry into an RDBMS

SHARED DATA ARCHITECTURE 82 Shared Data 1 2 3 4 A B C D

SHARED NOTHING ARCHITECTURE 83 Shared Nothing A B C D

LIST OF NoSQL DATABASES Wide Column Store o Hadoop / Hbase o Cloudera o Cassandra o Hypertable o Accumulo o Amazon Simple DB o Cloudata o MonetDB 84

Document Store o OrientDB o MongoDB o Couchbase Server o CouchDB o RavenDB o Marklogic Server o JSON ODM LIST OF NoSQL DATABASES 85

Key Value Store o DynamoDB o Azure o Riak o Redis o Aerospike o LevelDB o RocksDB LIST OF NoSQL DATABASES 86

Graph Databases o Neo4J o ArangoDB o Infinite Graph o Sparksee o TITAN o InfoGrid o Graph Base LIST OF NoSQL DATABASES 87

MapReduce Map Operations Reduce Operations Submitting a MapReduce Job Shuffle Data types 88

Map Reduce 89 Map reduce is a programming model for processing and generating large data sets. Use of functional model with user specified Map and reduce operations allows to parallelize large computations. map(k1,v1) list(k2,v2)

Map OPERATION The common array operation var a = [1, 2, 3] ; for( i = 0 ; i < a.length ; i++) a[i] = a[i] * 2; The output is var a = [2, 4, 6] 90

Map OPERATION When fn is passed as an function argument function map(fn, a) { for( i = 0; i < a.length; i++) a[i] = fn(a[i]); } The map function is invoked as map(function(x) {return x * 2;}, a); 91

Reduce FUNCTION Merges together the intermediate key value accepted from the user, I with the set of values from the key. Merges the values to form a smaller set of values. 92 reduce(k2, list(v2)) list(v2)

EXECUTION OVERVIEW 93 The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function. The number of partitions (R) and the partitioning functions are specified by the user.

Map FUNCTION FOR WORDCOUNT Key : Document Name Value : Document Contents 94 map ( String key, String value) for each word w in value : EmitIntermediate(w, 1 );

Reduce FUNCTION FOR WORDCOUNT Key : A Word Values : A list of counts 95 reduce( String key, Iterator values): int result = 0; for each v in values: result += ParseInt (v); Emit(AsString(result));

MapReduce AT HIGH LEVEL 96 MapReduce job submitted by the Client Machine Master Node Job Tracker Slave Node Task Tracker Slave Node Task Tracker Task Instance Task Instance

ANATOMY OF MapReduce 97 Partitioning I N P U T D A T A NODE 1 NODE 2 NODE 3 MAP MAP MAP Interim D Interim D Interim D Reduce Reduce Reduce Node to store Output Node to store Output Node to store Output

SUMMARY 98 A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, Shuffle the sorted output based on its key and then input to the reduce tasks. The input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

QUESTIONS 99

100