Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Similar documents
Chase Wu New Jersey Institute of Technology

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

Hortonworks Data Platform

Hadoop Quickstart. Table of contents

docs.hortonworks.com

Innovatus Technologies

Big Data Hadoop Stack

Hadoop. Introduction / Overview

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Big Data Architect.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

BIG DATA TRAINING PRESENTATION

Hadoop An Overview. - Socrates CCDH

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Configuring and Deploying Hadoop Cluster Deployment Templates

Getting Started with Hadoop/YARN

Microsoft Big Data and Hadoop

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data Hadoop Course Content

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Certified Big Data Hadoop and Spark Scala Course Curriculum

Oracle Big Data Fundamentals Ed 2

Certified Big Data and Hadoop Course Curriculum

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

50 Must Read Hadoop Interview Questions & Answers

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Installation of Hadoop on Ubuntu

MapR Enterprise Hadoop

Stages of Data Processing

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Top 25 Big Data Interview Questions And Answers

@Pentaho #BigDataWebSeries

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Hands-on Exercise Hadoop

Getting Started with Spark

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

microsoft

Big Data Analytics. Description:

Hadoop, Yarn and Beyond

Big Data Analytics using Apache Hadoop and Spark with Scala

Oracle Big Data Fundamentals Ed 1

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Oracle Data Integrator 12c: Integration and Administration

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Oracle Data Integrator 12c: Integration and Administration

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop and MapReduce

Hadoop. Introduction to BIGDATA and HADOOP

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Open Studio for Big Data. Getting Started Guide 5.3.2

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

HDInsight > Hadoop. October 12, 2017

Big Data with Hadoop Ecosystem

Introduction to the Hadoop Ecosystem - 1

A Survey on Big Data

Talend Open Studio for Big Data. Getting Started Guide 5.4.0

<Insert Picture Here> Introduction to Big Data Technology

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Talend Open Studio for Big Data. Getting Started Guide 5.4.2

Hadoop Online Training

Microsoft Exam

Hadoop. copyright 2011 Trainologic LTD

Inria, Rennes Bretagne Atlantique Research Center

Embedded Technosolutions

International Journal of Advance Engineering and Research Development. A study based on Cloudera's distribution of Hadoop technologies for big data"

Introduction to BigData, Hadoop:-

Informatica Cloud Spring Hadoop Connector Guide

Hadoop Development Introduction

Hortonworks and The Internet of Things

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Hortonworks University. Education Catalog 2018 Q1

arxiv: v1 [cs.dc] 20 Aug 2015

Hortonworks Data Platform

Databases 2 (VU) ( / )

Hadoop Overview. Lars George Director EMEA Services

Copyright 2015 EMC Corporation. All rights reserved. A long time ago

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor)

Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

Cmprssd Intrduction To

New Approaches to Big Data Processing and Analytics

Facebook data extraction using R & process in Data Lake

Exam Questions

Transcription:

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński

Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further reading: D. deroos, P. C. Zikopoulos, R. B. Melnyk, B. Brown and R. Coss, Hadoop for Dummies

Apache Hadoop Project http://hadoop.apache.org/ open-source software for reliable, scalable, distributed computing software library (a framework) that allows for the distributed processing of large data sets across clusters of computers using simple programming models designed to scale up from single servers to thousands of machines, each offering local computation and storage rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures

Apache Hadoop Project the project includes: Hadoop Common - common utilities that support other Hadoop modules Hadoop Distributed File System (HDFS) - a distributed file system that provides high-throughput access to application data Hadoop YARN - job scheduler and cluster resource manager Hadoop MapReduce - a YARN-based system for parallel processing of large data sets

Apache Hadoop Project Other Hadoop-related projects at Apache: Ambari - a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters (support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop); a dashboard for viewing cluster health (e.g. heatmaps); ability to view MapReduce, Pig and Hive applications visually Avro - a data serialization system Cassandra - a scalable multi-master database with no single points of failure Chukwa - a data collection system for managing large distributed systems Flume - a data flow service for the movement of large volumes of log data into Hadoop Giraph - an iterative graph processing system built for high scalability HBase - a scalable, distributed database that supports structured data storage for large tables HCatalog - a service for providing a relational view of data stored in Hadoop, including a standard approach for tabular data Hive - a data warehouse infrastructure that provides data summarization and ad hoc querying Hue - a Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows

Apache Hadoop Project Other Hadoop-related projects at Apache: Hue - a Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows Mahout - a scalable machine learning and data mining library Oozie - a workflow management tool that can handle the scheduling and chaining together of Hadoop applications Pig - a high-level data-flow language and execution framework for parallel computation Spark - a fast and general compute engine for Hadoop data with a simple and expressive programming model for ETL, machine learning, stream processing, and graph computation Sqoop - a tool for efficiently moving large amounts of data between relational databases and HDFS Tez - a generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases ZooKeeper - a high-performance coordination service for distributed applications

Apache Hadoop Project

Log data analysis Common use cases most common use case for an inaugural Hadoop project fits perfectly for HDFS scenario: write once & read often log data often grows quickly, and because of the high volumes produced, it can be tedious to analyze consider a typical web-based browsing and buying experience: you surf the site, looking for items to buy you click to read descriptions of a product that catches your eye eventually, you add an item to your shopping cart and proceed to the checkout (the buying action) after seeing the cost of shipping, however, you decide that the item isn t worth the price and you close the browser window

Common use cases Log data analysis (continued) every click you ve made and then stopped making has the potential to offer valuable insight to the company behind this e-commerce site

Common use cases Data Warehouse Modernization rapid rise in the amount of data generated in the world affects data warehouses (the volumes of data they manage are increasing) processing power in data warehouses is often used to perform transformations of the relational data as it either enters the warehouse itself or is loaded into a child data mart the need is increasing for analysts to issue new queries against the structured data stored in warehouses, and these ad hoc queries can often use significant data processing resources Hadoop can live alongside data warehouses and fulfill some of the purposes that they aren t designed for

Fraud detection Common use cases a major concern across all industries traditional approaches to fraud prevention aren t particularly efficient sampling data and using the sample to build a set of fraudprediction and -detection models Hadoop based solution no data sampling, full data set manages new varietes of data enables different kinds of analysis and changes to existing models

Risk modeling Common use cases closely matches the use case of fraud detection (a modelbased discipline) risk can take on a lot of meanings Hadoop based solution: offers the opportunity to extend the data sets used to build the models is not bound by the data models used in data warehouses can free up the warehouse for regular business reporting can handle unstructured data (raw text in particular)

Common use cases Social sentiment analysis the most overhyped of the Hadoop use cases leverages content from forums, blogs, and other social media resources to develop a sense of what people are doing (for example, life events) and how they re reacting to the world around them (sentiment) text-based data doesn t naturally fit into a relational database Hadoop is a practical place to explore and run analytics on this data

Common use cases Social sentiment analysis (continued)

Image classification Common use cases it requires a training set used by computers to learn how to identify and classify what they re looking at having more data helps systems to better classify images a significant amount of data processing resources required a hot topic in the Hadoop world no mainstream technology was capable until Hadoop came along of opening doors for this kind of expensive processing on such a massive and efficient scale

Common use cases Image classification (continued) Hadoop provides a massively parallel processing environment to create classifier models (iterating over training sets) it provides nearly limitless scalability to process and run those classifiers across massive sets of unstructured data volumes

Common use cases Image classification (continued)

Graph analysis Common use cases graphs can represent any kind of relationship one of the most common applications for graph processing now is mapping the Internet most PageRank algorithms use a form of graph processing to calculate the weightings of each page, which is a function of how many other pages point to it

Common use cases Repeating patterns of the use cases when you use more data, you can make better decisions and predictions and guide better outcomes. in cases where you need to retain data for regulatory purposes and provide a level of query access, Hadoop is a cost-effective solution the more a business depends on new and valuable analytics that are discovered in Hadoop, the more it wants (new purposes for Hadoop clusters)

supported platforms GNU/Linux Setting up Hadoop Hadoop has been demonstrated on clusters with 2000 nodes Windows https://wiki.apache.org/hadoop/hadoop2onwindows required software Java ssh for recommended versions of Java look at https://wiki.apache.org/hadoop/hadoopjavaversions optional software pdsh - issue commands to groups of hosts in parallel

supported platforms GNU/Linux Setting up Hadoop Hadoop has been demonstrated on clusters with 200 nodes Windows https://wiki.apache.org/hadoop/hadoop2onwindows required software Java ssh for recommended versions of Java look at https://wiki.apache.org/hadoop/hadoopjavaversions optional software pdsh - issue commands to groups of hosts in parallel

choosing the architecture Setting up Hadoop local (standalone) mode on a single node default configuration a single Java process useful for debugging pseudo-distributed mode on a single node all Hadoop services, including the master and slave services, are running on a single node useful for quick testing convenient way to experiment with Hadoop fully distributed mode on a cluster of nodes the master and slave services are running on different nodes in the cluster appropriate for development and production environments

Setting up Hadoop download the software http://www.apache.org/dyn/closer.cgi/hadoop/common/ unpack the downloaded distribution tar zxvf hadoop-3.0.0.tar.gz set the root of Java installation edit the file etc/hadoop/hadoop-env.sh add the following lines # set to the root of your Java installation export JAVA_HOME=/usr/java/latest

Setting up Hadoop test of Java configuration in the distribution directory try bin/hadoop this will display the usage documentation for the hadoop script

default configuration Standalone mode no additional steps required to run Hadoop example: mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoopmapreduce-examples-3.0.0.jar grep input output 'dfs[a-z.]+' cat output/*

Pseudo-distributed mode each Hadoop daemon runs in a separate Java process Hadoop configuration

Pseudo-distributed mode check, if you can ssh to the localhost without a passphrase if you cannot, execute the following commands

Pseudo-distributed mode to run a MapReduce job locally: format the file system bin/hdfs namenode -format start NameNode daemon and DataNode daemon sbin/start-dfs.sh make HDFS directories required to execute MapReduce jobs bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/<username>

Pseudo-distributed mode to run a MapReduce job locally (continued): copy the input files into the distributed file system bin/hdfs dfs -mkdir input bin/hdfs dfs -put etc/hadoop/*.xml input run an example bin/hadoop jar share/hadoop/mapreduce/hadoopmapreduce-examples-3.0.0.jar grep input output 'dfs[a-z.]+' copy the output files from the distributed filesystem and examine them bin/hdfs dfs -get output output cat output/*

Pseudo-distributed mode to run a MapReduce job locally (continued): alternatively, you can output the files on the distributed file system bin/hdfs dfs -cat output/* stop the daemons when you are done sbin/stop-dfs.sh

Pseudo-distributed mode running a MapReduce job with YARN: steps 1-4 from previous example have to be executed already two additional daemons needed: ResourceManager NodeManager configure the daemons etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>

Pseudo-distributed mode running a MapReduce job with YARN: configure the daemons etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value> JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME, HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME,HADOOP_MAPRED_HOME </value> </property> </configuration>

Pseudo-distributed mode running a MapReduce job with YARN: start the daemons sbin/start-yarn.sh browse the web interface for the ResourceManager; by default it is available at http://localhost:8088/ run a MapReduce job stop the daemons when you are done sbin/stop-yarn.sh

A shortcut Hadoop appliances Hadoop distributions various combinations of open source components from ASF and elsewhere integrated into one single product vendors typically offer proprietary software, support, consulting services and training not all distributions have the same components not all components in one particular distribution are compatible with other distributions some of them offer virtual machine appliance for quick and easy set up

Hortonworks HDP Sandbox prerequisites Oracle VM VirtualBox https://www.virtualbox.org/wiki/downloads VMWare Workstation for Linux/Windows or VMWare Fusion for Mac https://www.vmware.com/products/workstation-player.html Docker for Linux, Windows or Mac https://docs.docker.com/install/

Hortonworks HDP Sandbox install VirtualBox download the Hortonworks Sandbox import the Hortonworks Sandbox into Virtualbox open VirtualBox navigate to File Import Appliance select the downloaded Sandbox image and click Open

Hortonworks HDP Sandbox

Hortonworks HDP Sandbox click Import and wait for VirtualBox to import the Sandbox once the Sandbox has finished being imported, start the virtual machine

Hortonworks HDP Sandbox

Hortonworks HDP Sandbox login credentials may be found at https://hortonworks.com/tutorial/learning-the-ropes-of-the-hortonworks-sandbox/#login-credentials