Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

Size: px

Start display at page:

Download "Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos"

Laurence Haynes
6 years ago
Views:

1 Instituto Politécnico de Tomar Introduction to Big Data Hadoop Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016

2 Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:

3 Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:

5 AGENDA What is this talk about? Overview HDFS Map Reduce Ecosystem 4 Hadoop Distributions 5 Q&A 6

6 Hadoop ( currently the most widely adopted Big Data platform, is an Apache-managed software framework derived from MapReduce and Big Table (a distributed storage system developed by Google and intended to manage highly scalable structured data). Please consider reading the following articles: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles - SOSP 03 (2003): Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (2004). Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro, vol. 23 (2003), pp

It was originally built by a Yahoo! engineer named Doug Cutting in 2006 and is now an open source project managed by the Apache Software Foundation.

7 It was originally built by a Yahoo! engineer named Doug Cutting in 2006 and is now an open source project managed by the Apache Software Foundation. It is and is designed to enable the distributed processing of large, complex data sets (huge amounts of structured and unstructured data) across a set of clustered computers.

9 Scalable: New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. Cost effective: Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Fault tolerant: When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

11 Hadoop consists of three primary resources: Hadoop Distributed File System: a massively scalable distributed file system that can support petabytes of data and the management of related files across machines. MapReduce engine: A high-performance parallel / distributed data-processing implementation of the MapReduce algorithm. Hadoop ecosystem: A collection of tools that use or sit beside MapReduce and HDFS to store and organize data, and manage the machines that run Hadoop

13 Operating System layer: The first layer is the Operating System on the host machine. Hadoop is installed on top of the operating system and runs the same regardless of the host operating system Hadoop layer: This is the base installation of Hadoop, which includes the file system and MapReduce components. DBMS layer: On top of Hadoop, the various Hadoop DBMS and related applications are installed. Typically, Hadoop installations include a data warehousing or database package, such as Hive or HBase. Application layer: The Application layer is the top layer, which includes the tools that provide data management, analysis, and other capabilities.

14 To store data, Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes. A distributed file system is a file system that can store large files spread across the nodes of a cluster, A typical Hadoop usage pattern involves three stages: Loading data into HDFS; MapReduce operations, and retrieving results from HDFS.

15 Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis. It includes a NameNode and multiple data nodes running on a commodity hardware cluster. In essence, the NameNode keeps track of where data is physically stored. HDFS works by breaking large files into smaller pieces called blocks. When data is stored in Hadoop, the NameNode file automatically stores and replicates the data in multiple blocks (64 MB or 128 MB by default) across the various DataNode. This is done to ensure fault tolerance and high availability.

16 Hadoop implementations rely on a master-slave style of distribution, where the master node stores all the metadata, access rights, mapping and location of files and blocks, and so on. The slaves are nodes where the actual data is stored. All the requests go to the master and then are handled by the appropriate slave node.

17 Data nodes are not very smart, but the NameNode is. The data nodes constantly ask the NameNode whether there is anything for them to do. This continuous behavior also tells the NameNode what data nodes are out there and how busy they are. HDFS metadata is stored in the NameNode When the file was created, accessed, modified, deleted, and so on. Where the blocks of the file are stored in the cluster Who has the rights to view or modify the file

18 How many files are stored on the cluster How many data nodes exist in the cluster The location of the transaction log for the cluster Data nodes are not smart, but they are resilient. They are servers that contain the blocks for a given set of files. It is reasonable to think of data nodes as block servers

19 What exactly does a block server do? Stores (and retrieves) the data blocks in the local file system of the server. HDFS is available on many different operating systems and behaves the same whether on Windows, Mac OS, or Linux. Stores the metadata of a block in the local file system based on the metadata template in the NameNode. Sends regular reports to the NameNode about what blocks are available for file operations. Tutorial Links: Yahoo! has published an excellent guide for configuring and exploring a basic system.

20 Hadoop MapReduce is the heart of the Hadoop system, an implementation of the algorithm (MapReduce). It is helpful to think about this implementation as a MapReduce engine, because that is exactly how it works. You provide input (fuel), the engine converts the input into output quickly and efficiently, and you get the answers you need. The process starts with a user request to run a MapReduce program and continues until the results are written back to the HDFS.

21 In the early 2000s, some engineers at Google looked into the future and determined that while their current solutions for applications such as web crawling, query frequency, and so on were adequate for most existing requirements, they were inadequate for the complexity they anticipated as the web scaled to more and more users. These engineers determined that if work could be distributed across inexpensive computers and then connected on the network in the form of a cluster, they could solve the problem. MapReduce, which was designed by Google (popularized by Yahoo! (Hadoop)), is a software framework that enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors.

22 Parallel data processing involves the simultaneous execution of multiple sub-tasks that collectively comprise a larger task. The goal is to reduce the execution time by dividing a single larger task into multiple smaller tasks that run concurrently. Although parallel data processing can be achieved through multiple networked machines, it is more typically achieved within the confines of a single machine with multiple processors or cores

23 Distributed data processing is closely related to parallel data processing in that the same principle of divide-and-conquer is applied. However, distributed data processing is always achieved through physically separate machines that are networked together as a cluster. Take a very large problem and break it into smaller, more manageable chunks, operate on each chunk independently, and then pull it all together at the end.

24 Hadoop s MapReduce involves distributing a dataset among multiple servers and operating on the data: the map stage. The partial results are then recombined: the reduced stage. Map The map component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures The map function is commutative in other words, the order that a function is executed doesn t matter. MapReduce can perform its work on different machines in a network and get the same result as if all the work was done on a single machine.

25 Reduce After the distributed computation is completed, another function called reduce aggregates all the elements back together to provide a result. Tutorial Links: A good place to start is the official Apache documentation ( but Yahoo! has also put together a tutorial module (

26 Check this example:

27 While MapReduce is great for certain categories of tasks, it falls short with others. This led to fracturing in the ecosystem and a variety of tools that live outside of your Hadoop cluster but attempt to communicate with HDFS. Engine Spark

28 Spark MapReduce is the primary workhorse at the core of most Hadoop clusters. While highly effective for very large batch-analytic jobs, MapReduce has proven to be suboptimal for applications like graph analysis. Unlike Pig and Hive, Spark is not a tool for making MapReduce easier to use. It is a complete replacement for MapReduce that includes its own work execution engine that enables Hadoop in-memory data processing. Tutorial Links: A quick start for Spark can be found on the project home page (

29 Database and Data Management Cassandra Hbase MongoDB Hive

Cassandra Oftentimes you may need to simply organize some of your big data for easy retrieval. One common way to do this is to use a key-value datastore.

30 Cassandra Oftentimes you may need to simply organize some of your big data for easy retrieval. One common way to do this is to use a key-value datastore. Cassandra is a distributed key-value database designed with simplicity and scalability in mind. Your data is organized by a unique key, and values are associated with that key. Cassandra is an all-inclusive system, which means it does not require a Hadoop environment or any other big data tools. Tutorial Links: DataStax, a company that provides commercial support for Cassandra, offers a set of freely available videos (

31 HBase HBase is a NoSQL database system included in the standard Hadoop distributions that works with HDFS for data storage and access. It is based on Google's BigTable. It is a key-value store, logically. This means that rows are defined by a key, and have associated with them a number of bins (or columns) where the associated values are stored. Physically, groups of similar columns are stored together in column families. Most often, Hbase is accessed via Java code, but APIs exist for using HBase with Pig. HBase is often used for applications that may require sparse rows. That is, each row may use only a few of the defined columns.

While Cassandra and MongoDB might still be the predominant NoSQL databases today, HBase is gaining in popularity and may well be the leader in the near future.

32 While Cassandra and MongoDB might still be the predominant NoSQL databases today, HBase is gaining in popularity and may well be the leader in the near future. Tutorial Links: The folks at Coreservlets.com have put together a handful of Hadoop tutorials including an excellent series on Hbase ( There s also a handful of video tutorials available on the Internet, including this one ( which we found particularly helpful.

33 MongoDB MongoDB is a document-oriented database, the document being a JSON object. In relational databases, you have tables and rows. In MongoDB, the equivalent of a row is a JSON document, and the analog to a table is a collection, a set of JSON documents. At the start of 2015, it is one of the most popular NoSQL databases. Unlike some other database systems, MongoDB supports secondary indexes meaning it is possible to quickly search on other than the primary key that uniquely identifies each document in the Mongo database. Tutorial Links: The tutorials section on the official project page is a great place to get started ( There are also plenty of videos available on the Internet, including this informative series (

34 Hive At first, all access to data in your Hadoop cluster came through MapReduce jobs written in Java. This worked fine during Hadoop s infancy when all Hadoop users had a stable of Java-savvy coders. However, as Hadoop emerged into the broader world, many wanted to adopt Hadoop but had stables of SQL coders for whom writing MapReduce would be a steep learning curve.

The goal of Hive is to allow SQL access to data in the HDFS. Hive defines a simple SQLlike query language, called HQL, that enables users familiar with SQL to query the data.

35 The goal of Hive is to allow SQL access to data in the HDFS. Hive defines a simple SQLlike query language, called HQL, that enables users familiar with SQL to query the data. Queries written in HQL are converted into MapReduce code by Hive and executed by Hadoop. Tutorial Links: A couple of great resources are the official Hive tutorial ( and this video published by the folks at HortonWorks ( huxe&feature=youtu.be).

36 Analytics Once you ve ingested your data into the system, you may be satisfied to simply push it into a more traditional data store, such as a relational database, and consider your big data work to be done. On the other hand, you may want to continue to work with your data, running specialized machine-learning algorithms to categorize your data Pig Mahout

37 Pig Pig is the framework for executing MapReduce on HDFS data using its own scripting language. Pig is a tool that creates an abstraction layer on top of MapReduce to enable simpler and faster analysis. Pig is a scripting language designed to facilitate query-like data operations that can be executed with just several lines of code.

38 Why would you want to use Pig rather than MapReduce? Native MapReduce applications written in Java are effective and powerful tools, but the time to develop and test the applications is time-consuming and complex. Pig solves this problem by offering a simpler development and testing process that takes advantage of the power of MapReduce, without the need to build large Java applications. Whereas Java programs may require lines, Pig scripts often have ten lines of code or less. In many ways, Pig is an admirable extract, transform, and load (ETL) tool. Pig is translated or compiled into MapReduce code

Tutorial Links: There s a fairly complete guide (https://pig.apache.org/docs/r0.13.0/start.

39 Tutorial Links: There s a fairly complete guide ( to get you through the process of installing Pig and writing your first couple scripts. Working with Pig is a great overview of the Pig technology (

40 DataTransfer Data transfer tools provide three basic capabilities: File Transfer: help move files and flat text, such as long entries, into your Hadoop cluster; Database transfer: provide a simple mechanism for moving data between traditional relational databases, such as Oracle or SQL Server, and your Hadoop cluster; Data triage: can be used to quickly evaluate and categorize new data as it arrives onto your Hadoop system

41 DataTransfer Sqoop; Flume

Sqoop It s likely that some of your data may originate in a relational database management system (RDBMS) that is usually accessed normally by SQL.

42 Sqoop It s likely that some of your data may originate in a relational database management system (RDBMS) that is usually accessed normally by SQL. Sqoop (meaning SQL to Hadoop) is designed to transfer data between relational databases and Hadoop clusters. You ll start your import to Hadoop with a database table that is read into Hadoop as a text file. You can also export an HDFS file into an RDBMS. Tutorial Links: There s an excellent series of lectures on this topic available on YouTube (

Flume Flume is a reliable distributed system for collecting, aggregating, and moving large amounts of log data from multiple sources into HDFS. Tutorial Links: Dr.

43 Flume Flume is a reliable distributed system for collecting, aggregating, and moving large amounts of log data from multiple sources into HDFS. Tutorial Links: Dr. Dobb s Journal published an informative article on Flume ( Readers who enjoy a lecture should check out this interesting presentation from 2011 (

Mahout You have a bunch of data in your Hadoop cluster. What are you going to do with it? You might want to do some analytics, or data science, or machine learning.

44 Mahout You have a bunch of data in your Hadoop cluster. What are you going to do with it? You might want to do some analytics, or data science, or machine learning. Much of this can be done in some of the tools that come with the standard Apache distribution, such as Pig, MapReduce, or Hive. But more sophisticated uses will involve algorithms that you will not want to code yourself. So you turn to Mahout. Mahout is a collection of scalable machine-learning algorithms that run on Hadoop. Tutorial Links: The Mahout folks have an entire page of curated links to books, tutorials, and talks ((

45 Cloudera ( Hortonworks ( MapR ( IBM ( Intel (hadoop.intel.com) EMC ( Amazon (aws.amazon.com/ec2) Apache Bigtop (

46 Cloudera vs. Hortonworks vs. MapR Please consider reading the following articles:

Why install it yourself? My answer to that is, by installing it by yourself, you can learn more about how it all fits together, and gives you a better understanding of the whole Hadoop ecosystem.

47 Why install it yourself? My answer to that is, by installing it by yourself, you can learn more about how it all fits together, and gives you a better understanding of the whole Hadoop ecosystem. If you prefer you can use a VM distribution that you can download and play with. Cloudera QuickStart VM ( ) Hortonworks Sandbox ( MapR ( )

48 The Hadoop modes deploy these components as follows: Local standalone mode Pseudo-distributed mode Fully distributed mode

49 Local standalone mode This is the default mode if, you don't configure anything else. In this mode, all the components of Hadoop, such as NameNode, DataNode, JobTracker, and TaskTracker, run in a single Java process. Local, or standalone, mode is the easiest to set up, but you interact with it in a different manner than you would with the fully distributed mode.

50 Pseudo-distributed mode In this mode, a separate JVM is spawned for each of the Hadoop components and they communicate across network sockets, effectively giving a fully functioning minicluster on a single host. We shall generally prefer the pseudo-distributed mode even when using examples on a single host, as everything done in the pseudo-distributed mode is almost identical to how it works on a much larger cluster.

Fully distributed mode In this mode, Hadoop is spread across multiple machines, some of which will be generalpurpose workers and others will be dedicated hosts for components, such as NameNode

51 Fully distributed mode In this mode, Hadoop is spread across multiple machines, some of which will be generalpurpose workers and others will be dedicated hosts for components, such as NameNode and JobTracker Fully distributed mode is obviously the only one that can scale Hadoop across a cluster of machines, but it requires more configuration work, not to mention the cluster of machines.

Big Data Hadoop Stack

Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware