Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński
Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further reading: D. deroos, P. C. Zikopoulos, R. B. Melnyk, B. Brown and R. Coss, Hadoop for Dummies
Apache Hadoop Project http://hadoop.apache.org/ open-source software for reliable, scalable, distributed computing software library (a framework) that allows for the distributed processing of large data sets across clusters of computers using simple programming models designed to scale up from single servers to thousands of machines, each offering local computation and storage rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures
Apache Hadoop Project the project includes: Hadoop Common - common utilities that support other Hadoop modules Hadoop Distributed File System (HDFS) - a distributed file system that provides high-throughput access to application data Hadoop YARN - job scheduler and cluster resource manager Hadoop MapReduce - a YARN-based system for parallel processing of large data sets
Apache Hadoop Project Other Hadoop-related projects at Apache: Ambari - a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters (support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop); a dashboard for viewing cluster health (e.g. heatmaps); ability to view MapReduce, Pig and Hive applications visually Avro - a data serialization system Cassandra - a scalable multi-master database with no single points of failure Chukwa - a data collection system for managing large distributed systems Flume - a data flow service for the movement of large volumes of log data into Hadoop Giraph - an iterative graph processing system built for high scalability HBase - a scalable, distributed database that supports structured data storage for large tables HCatalog - a service for providing a relational view of data stored in Hadoop, including a standard approach for tabular data Hive - a data warehouse infrastructure that provides data summarization and ad hoc querying Hue - a Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows
Apache Hadoop Project Other Hadoop-related projects at Apache: Hue - a Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows Mahout - a scalable machine learning and data mining library Oozie - a workflow management tool that can handle the scheduling and chaining together of Hadoop applications Pig - a high-level data-flow language and execution framework for parallel computation Spark - a fast and general compute engine for Hadoop data with a simple and expressive programming model for ETL, machine learning, stream processing, and graph computation Sqoop - a tool for efficiently moving large amounts of data between relational databases and HDFS Tez - a generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases ZooKeeper - a high-performance coordination service for distributed applications
Apache Hadoop Project
Log data analysis Common use cases most common use case for an inaugural Hadoop project fits perfectly for HDFS scenario: write once & read often log data often grows quickly, and because of the high volumes produced, it can be tedious to analyze consider a typical web-based browsing and buying experience: you surf the site, looking for items to buy you click to read descriptions of a product that catches your eye eventually, you add an item to your shopping cart and proceed to the checkout (the buying action) after seeing the cost of shipping, however, you decide that the item isn t worth the price and you close the browser window
Common use cases Log data analysis (continued) every click you ve made and then stopped making has the potential to offer valuable insight to the company behind this e-commerce site
Common use cases Data Warehouse Modernization rapid rise in the amount of data generated in the world affects data warehouses (the volumes of data they manage are increasing) processing power in data warehouses is often used to perform transformations of the relational data as it either enters the warehouse itself or is loaded into a child data mart the need is increasing for analysts to issue new queries against the structured data stored in warehouses, and these ad hoc queries can often use significant data processing resources Hadoop can live alongside data warehouses and fulfill some of the purposes that they aren t designed for
Fraud detection Common use cases a major concern across all industries traditional approaches to fraud prevention aren t particularly efficient sampling data and using the sample to build a set of fraudprediction and -detection models Hadoop based solution no data sampling, full data set manages new varietes of data enables different kinds of analysis and changes to existing models
Risk modeling Common use cases closely matches the use case of fraud detection (a modelbased discipline) risk can take on a lot of meanings Hadoop based solution: offers the opportunity to extend the data sets used to build the models is not bound by the data models used in data warehouses can free up the warehouse for regular business reporting can handle unstructured data (raw text in particular)
Common use cases Social sentiment analysis the most overhyped of the Hadoop use cases leverages content from forums, blogs, and other social media resources to develop a sense of what people are doing (for example, life events) and how they re reacting to the world around them (sentiment) text-based data doesn t naturally fit into a relational database Hadoop is a practical place to explore and run analytics on this data
Common use cases Social sentiment analysis (continued)
Image classification Common use cases it requires a training set used by computers to learn how to identify and classify what they re looking at having more data helps systems to better classify images a significant amount of data processing resources required a hot topic in the Hadoop world no mainstream technology was capable until Hadoop came along of opening doors for this kind of expensive processing on such a massive and efficient scale
Common use cases Image classification (continued) Hadoop provides a massively parallel processing environment to create classifier models (iterating over training sets) it provides nearly limitless scalability to process and run those classifiers across massive sets of unstructured data volumes
Common use cases Image classification (continued)
Graph analysis Common use cases graphs can represent any kind of relationship one of the most common applications for graph processing now is mapping the Internet most PageRank algorithms use a form of graph processing to calculate the weightings of each page, which is a function of how many other pages point to it
Common use cases Repeating patterns of the use cases when you use more data, you can make better decisions and predictions and guide better outcomes. in cases where you need to retain data for regulatory purposes and provide a level of query access, Hadoop is a cost-effective solution the more a business depends on new and valuable analytics that are discovered in Hadoop, the more it wants (new purposes for Hadoop clusters)
supported platforms GNU/Linux Setting up Hadoop Hadoop has been demonstrated on clusters with 2000 nodes Windows https://wiki.apache.org/hadoop/hadoop2onwindows required software Java ssh for recommended versions of Java look at https://wiki.apache.org/hadoop/hadoopjavaversions optional software pdsh - issue commands to groups of hosts in parallel
supported platforms GNU/Linux Setting up Hadoop Hadoop has been demonstrated on clusters with 200 nodes Windows https://wiki.apache.org/hadoop/hadoop2onwindows required software Java ssh for recommended versions of Java look at https://wiki.apache.org/hadoop/hadoopjavaversions optional software pdsh - issue commands to groups of hosts in parallel
choosing the architecture Setting up Hadoop local (standalone) mode on a single node default configuration a single Java process useful for debugging pseudo-distributed mode on a single node all Hadoop services, including the master and slave services, are running on a single node useful for quick testing convenient way to experiment with Hadoop fully distributed mode on a cluster of nodes the master and slave services are running on different nodes in the cluster appropriate for development and production environments
Setting up Hadoop download the software http://www.apache.org/dyn/closer.cgi/hadoop/common/ unpack the downloaded distribution tar zxvf hadoop-3.0.0.tar.gz set the root of Java installation edit the file etc/hadoop/hadoop-env.sh add the following lines # set to the root of your Java installation export JAVA_HOME=/usr/java/latest
Setting up Hadoop test of Java configuration in the distribution directory try bin/hadoop this will display the usage documentation for the hadoop script
default configuration Standalone mode no additional steps required to run Hadoop example: mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoopmapreduce-examples-3.0.0.jar grep input output 'dfs[a-z.]+' cat output/*
Pseudo-distributed mode each Hadoop daemon runs in a separate Java process Hadoop configuration
Pseudo-distributed mode check, if you can ssh to the localhost without a passphrase if you cannot, execute the following commands
Pseudo-distributed mode to run a MapReduce job locally: format the file system bin/hdfs namenode -format start NameNode daemon and DataNode daemon sbin/start-dfs.sh make HDFS directories required to execute MapReduce jobs bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/<username>
Pseudo-distributed mode to run a MapReduce job locally (continued): copy the input files into the distributed file system bin/hdfs dfs -mkdir input bin/hdfs dfs -put etc/hadoop/*.xml input run an example bin/hadoop jar share/hadoop/mapreduce/hadoopmapreduce-examples-3.0.0.jar grep input output 'dfs[a-z.]+' copy the output files from the distributed filesystem and examine them bin/hdfs dfs -get output output cat output/*
Pseudo-distributed mode to run a MapReduce job locally (continued): alternatively, you can output the files on the distributed file system bin/hdfs dfs -cat output/* stop the daemons when you are done sbin/stop-dfs.sh
Pseudo-distributed mode running a MapReduce job with YARN: steps 1-4 from previous example have to be executed already two additional daemons needed: ResourceManager NodeManager configure the daemons etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Pseudo-distributed mode running a MapReduce job with YARN: configure the daemons etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value> JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME, HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME,HADOOP_MAPRED_HOME </value> </property> </configuration>
Pseudo-distributed mode running a MapReduce job with YARN: start the daemons sbin/start-yarn.sh browse the web interface for the ResourceManager; by default it is available at http://localhost:8088/ run a MapReduce job stop the daemons when you are done sbin/stop-yarn.sh
A shortcut Hadoop appliances Hadoop distributions various combinations of open source components from ASF and elsewhere integrated into one single product vendors typically offer proprietary software, support, consulting services and training not all distributions have the same components not all components in one particular distribution are compatible with other distributions some of them offer virtual machine appliance for quick and easy set up
Hortonworks HDP Sandbox prerequisites Oracle VM VirtualBox https://www.virtualbox.org/wiki/downloads VMWare Workstation for Linux/Windows or VMWare Fusion for Mac https://www.vmware.com/products/workstation-player.html Docker for Linux, Windows or Mac https://docs.docker.com/install/
Hortonworks HDP Sandbox install VirtualBox download the Hortonworks Sandbox import the Hortonworks Sandbox into Virtualbox open VirtualBox navigate to File Import Appliance select the downloaded Sandbox image and click Open
Hortonworks HDP Sandbox
Hortonworks HDP Sandbox click Import and wait for VirtualBox to import the Sandbox once the Sandbox has finished being imported, start the virtual machine
Hortonworks HDP Sandbox
Hortonworks HDP Sandbox login credentials may be found at https://hortonworks.com/tutorial/learning-the-ropes-of-the-hortonworks-sandbox/#login-credentials