Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Similar documents
Big Data Hadoop Stack

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

CISC 7610 Lecture 2b The beginnings of NoSQL

Microsoft Big Data and Hadoop

Map Reduce & Hadoop Recommended Text:

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

BIG DATA TESTING: A UNIFIED VIEW

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Embedded Technosolutions

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

A Survey on Big Data

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Hadoop/MapReduce Computing Paradigm

New Approaches to Big Data Processing and Analytics

A Review Paper on Big data & Hadoop

Lecture 11 Hadoop & Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data with Hadoop Ecosystem

Introduction to Hadoop and MapReduce

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

HADOOP FRAMEWORK FOR BIG DATA

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Massive Online Analysis - Storm,Spark

Data Informatics. Seon Ho Kim, Ph.D.

Big Data Architect.

Big Data Analytics using Apache Hadoop and Spark with Scala

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

A brief history on Hadoop

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Certified Big Data and Hadoop Course Curriculum

Cloud Computing & Visualization

Big Data Hadoop Course Content

Chapter 5. The MapReduce Programming Model and Implementation

Data Analysis Using MapReduce in Hadoop Environment

Stages of Data Processing

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Typical size of data you deal with on a daily basis

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Advanced Database Technologies NoSQL: Not only SQL

Hadoop An Overview. - Socrates CCDH

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Analytics. Rasoul Karimi

Hadoop. copyright 2011 Trainologic LTD

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Innovatus Technologies

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Data Storage Infrastructure at Facebook

Distributed Systems 16. Distributed File Systems II

50 Must Read Hadoop Interview Questions & Answers

Big Data and Cloud Computing

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MI-PDB, MIE-PDB: Advanced Database Systems

Databases 2 (VU) ( / )

Certified Big Data Hadoop and Spark Scala Course Curriculum

A Review Approach for Big Data and Hadoop Technology

Big Data Facebook

Prototyping Data Intensive Apps: TrendingTopics.org

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Big Data Infrastructure at Spotify

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Backtesting with Spark

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

Introduction to Big-Data

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CSE6331: Cloud Computing

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

Introduction to HDFS and MapReduce

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

DATA SCIENCE USING SPARK: AN INTRODUCTION

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Expert Lecture plan proposal Hadoop& itsapplication

A Glimpse of the Hadoop Echosystem

Apache Hadoop.Next What it takes and what it means

Acquiring Big Data to Realize Business Value

Online Bill Processing System for Public Sectors in Big Data

Scalable Tools - Part I Introduction to Scalable Tools

Challenges for Data Driven Systems

Application of machine learning and big data technologies in OpenAIRE system

Transcription:

Hadoop Introduction 1

Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2

Big Data Analytics

What is Big Data? Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing technologies > How do you capture, store, search, share, transfer, analyse, and visualize big data? Unstructured data is exploding 4

Big Data Examples Facebook > Has ~400 terabytes of data > ~20 terabytes of new data per day New York Stock Exchange (NYSE) > Generated 1 terabyte of trade data per day Internet Archive > Stores ~2 perabytes of data > Grows at a rate of 20 terabytes per month > http://archive.org/web/web.php Skybox imaging (satellite images) > 1 terabyte per day Ancestry.com > Stores around ~2.5 perabytes of data 5

Big Data Evolution 6

Challenges in Big Data (Using Traditional Data Processing System) Slow to process > Takes 11 days to read ~100TB on a single computer Slow to move data > Moving big data over the network is slow Can't scale > Scaling up (more memory, more powerful hardware) has limitation Hard drive capacity is limited > A single Hard drive cannot accommodate the size of big data Unreliable > Failures in computing devices are inevitable 7

Big Data Characteristics Very large, distributed aggregations of loosely structured (or unstructured) data Petabytes/exabytes of data Flat schemas with few complex interrelationships Often involving time-stamped events Often made up of incomplete data Often including connections between data elements that must be probabilistically inferred 8

Challenges in Big Data Handling Moving data from storage cluster to computation cluster is not feasible > Instead, moving computing logic to the data is more efficient In a large cluster, failure is expected > Machines, networks break down It is expensive to build reliability in each application > Reliability should be handled by the framework 9

Things to Consider in Big Data 10

Examples of Public Data Sets U.S. Government > http://usgovxml.com/ Amazon > http://aws.amazon.com/public-data-sets/ Weather data from NCDC > http://www.ncdc.noaa.gov/data-access Million Song Dataset > http://labrosa.ee.columbia.edu/millionsong/ 11

What is and Why Hadoop?

What is Hadoop? Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model It is designed to scale up from a single node to thousands of nodes, each providing computation and storage 13

Hadoop Historical Background Started as a sub-project of Apache Nutch > Nutch is for indexing the web and expose it for searching > Open source alternative to Google > Started by Doug Cutting In 2004, Google publishes Google File System (GFS) and Map Reduce framework papers Doug Cutting and Nutch team implemented GFS and Map Reduce in Nutch In 2006, Yahoo! hires Doug Cutting to work on Hadoop In 2008, Hadoop became Apache Top Level project > http://hadoop.apache.org 14

Why is Hadoop? Scalable by simply adding new nodes to the cluster Economical by using commodity machines Efficient by running tasks where data is located Flexible by handling schema-less, non-structured data Fault tolerant by self-discovering failed nodes and self-healing, by redistributing failed jobs and data reliability by replication Simple by using simple programming model Evolving by having vibrant ecosystem 15

Hadoop Design Principles Store and process large amounts of data Scales on performance, storage, processing Computing ( code ) moves to Data (Not the other way around) Self-heals recovery from failures is built-in Designed to run on commodity hardware 16

When to use or not to use Hadoop? Hadoop is good for > Indexing data > Log analysis > Image manipulation > Sorting large scale data > Data mining Hadoop is NOT good for > Real time processing (Hadoop is batch oriented) > Random access (Hadoop is not database) > Processing intensive tasks with little data The limitations of Hadoop are addressed by Hadoop ecosystem technologies, however 17

Comparison to Other Technologies

RDBMS vs. Small data (comp. to Hadoop) Strictly structured Schema on write Relational Transaction-oriented ACID Real-time query Frequent write/update Centralized Limited scalability (vertical) Hadoop Big data Loosely structured Schema on read Non relational Analytics-oriented Non-ACID Batch-oriented Initial write, No update/delete Distributed High scalability (horizontal) 19

Data Warehouse vs. Hadoop 20

Hadoop Architecture

Hadoop Architecture Hadoop is built on Map Reduce for processing and HDFS for distributed file system > Map Reduce (for processing) is a framework for performing high performance distributed data processing using the divide and aggregate programming paradigm > HDFS is a reliable distributed file system that provides highthroughput access to data Hadoop has master/slave architecture for both Map Reduce and HDFS > Map Reduce has Job Tracker as a master and multiple Task Trackers as slaves > HDFS has Name node as a master and multiple Data nodes as slaves 22

Hadoop Architecture Hadoop = Map Reduce + HDFS 23

Hadoop Ecosystem

Hadoop Ecosystem 25

Cloudera Hadoop Distribution 26

Cloudera Manager 27

Hadoop Ecosystem Development 28

Hadoop Usage Examples

Institutions that are using Hadoop http://wiki.apache.org/hadoop/poweredby 30

Institutions that are using Hadoop Facebook > We use Apache Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. > Currently we have 2 major clusters: > A 1100-machine cluster with 8800 cores and about 12 PB raw storage > A 300-machine cluster with 2400 cores and about 3 PB raw storage > Each (commodity) node has 8 cores and 12 TB of storage > We are heavy users of both streaming as well as the Java APIs > We have built a higher level data warehousing framework using Hive > We have also developed a FUSE implementation over HDFS. 31

Institutions that are using Hadoop EBay > 532 nodes cluster (8 * 532 cores, 5.3PB). > Heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase > Using it for Search optimization and Research 32

Institutions that are using Hadoop Twitter > We use Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter > We store all data as compressed LZO files. > We use both Scala and Java to access Hadoop's MapReduce APIs > We use Apache Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements. > We employ committers on Apache Pig, Apache Avro, Apache Hive, and Apache Cassandra, and contribute much of our internal Hadoop work to opensource (see hadoop-lzo) 33

Institutions that are using Hadoop Yahoo > More than 100,000 CPUs in >40,000 computers running Hadoop > Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) > Used to support research for Ad Systems and Web Search > Also used to do scaling tests to support development of Apache Hadoop on larger clusters > Our Blog - Learn more about how we use Apache Hadoop. > >60% of Hadoop Jobs within Yahoo are Apache Pig jobs. 34