How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Similar documents
Hadoop Overview. Lars George Director EMEA Services

Big Data Hadoop Stack

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Microsoft Big Data and Hadoop

MapR Enterprise Hadoop

Hadoop An Overview. - Socrates CCDH

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Big Data Analytics using Apache Hadoop and Spark with Scala

Hadoop. Introduction / Overview

Hadoop & Big Data Analytics Complete Practical & Real-time Training

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

April Copyright 2013 Cloudera Inc. All rights reserved.

Configuring and Deploying Hadoop Cluster Deployment Templates

HBase... And Lewis Carroll! Twi:er,

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Importing and Exporting Data Between Hadoop and MySQL

Distributed Systems 16. Distributed File Systems II

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Hadoop Course Content

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Hadoop course content

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Innovatus Technologies

docs.hortonworks.com

TIE Data-intensive Programming. Dr. Timo Aaltonen Department of Pervasive Computing

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Apache Hadoop.Next What it takes and what it means

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Hadoop Development Introduction

Introduction to BigData, Hadoop:-

New Approaches to Big Data Processing and Analytics

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Hadoop. copyright 2011 Trainologic LTD

Ulf Andreasson. 7 th Oct 2014

Stages of Data Processing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Oracle Big Data Fundamentals Ed 1

A Survey on Big Data

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Expert Lecture plan proposal Hadoop& itsapplication

Big Data with Hadoop Ecosystem

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Lecture 11 Hadoop & Spark

The Technology of the Business Data Lake. Appendix

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Data Acquisition. The reference Big Data stack

Oracle Big Data Fundamentals Ed 2

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Cloudera Introduction

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Hortonworks and The Internet of Things

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Hadoop: The Definitive Guide

Big Data Infrastructure at Spotify

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Modern Data Warehouse The New Approach to Azure BI

@Pentaho #BigDataWebSeries

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Certified Big Data and Hadoop Course Curriculum

Data-Intensive Distributed Computing

Prototyping Data Intensive Apps: TrendingTopics.org

Cloudera Introduction

Introduction to Hadoop and MapReduce

The Evolving Apache Hadoop Ecosystem What it means for Storage Industry

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Data Informatics. Seon Ho Kim, Ph.D.

microsoft

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Cmprssd Intrduction To

Oracle Big Data Connectors

50 Must Read Hadoop Interview Questions & Answers

Data Acquisition. The reference Big Data stack

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Data Storage Infrastructure at Facebook

SpagoBI and Talend jointly support Big Data scenarios

Certified Big Data Hadoop and Spark Scala Course Curriculum

Big Data Architect.

itpass4sure Helps you pass the actual test with valid and latest training material.

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Techno Expert Solutions An institute for specialized studies!

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Transcription:

How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2

The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS (aggregated data) 1. Can t Explore Original High Fidelity Raw Data ETL Compute Grid 2. Moving Data To Compute Doesn t Scale Storage Only Grid (original raw data) Mostly Append Collection 3. Archiving = Premature Data Death Instrumentation 3

The Solution: A Combined Storage/Compute Layer BI Reports + Interactive Apps RDBMS (aggregated data) 1. Data Exploration & Advanced Analytics 2. Scalable Throughput For ETL & Aggregation Hadoop: Storage + Compute Grid Mostly Append 3. Keep Data Alive For Ever Collection Instrumentation 4

So What is Apache Hadoop? A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). Core Hadoop has two main systems: Hadoop Distributed File System: self-healing high-bandwidth clustered storage. MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. Key business values: Flexibility Store any data, Run any analysis. Scalability Start at 1TB/3-nodes grow to petabytes/1000s of nodes. Economics Cost per TB at a fraction of traditional options. 5

The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema must be created before any data can be loaded. An explicit load operation has to take place which transforms data to DB internal structure. New columns must be added explicitly before new data for such columns can be loaded into the database. Schema-on-Read (Hadoop): Data is simply copied to the file store, no transformation is needed. A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding) New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. Read is Fast Standards/Governance Pros Load is Fast Flexibility/Agility 6

Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled after Google s FlumeJava, also like Cascading). 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 7

Scalability: Scalable Software Development Grows without requiring developers to re-architect their data or applications. AUTO SCALE 8

Economics: Return on Byte Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB 9

Economics: Keep All Data Alive Forever Archive to Tape and Never See It Again Extract Value From All Your Data 10

Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Interactive OLAP Analytics Multistep ACID Transactions 100% SQL Compliance Use when: Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing 11

Build/Test: APACHE BIGTOP CDH: Cloudera s Distribution Including Apache Hadoop File System Mount Web Console Data Mining FUSE-NFS HUE APACHE MAHOUT BI Connectivity Job Workflow Metadata Mgmt ODBC/JDBC APACHE OOZIE APACHE HIVE Data Integration APACHE FLUME, APACHE SQOOP Analytical Languages APACHE PIG, APACHE HIVE Data Serving APACHE HBASE APACHE WHIRR Coordination APACHE ZOOKEEPER Cloudera Manager Free Edition (Installation Wizard) 12

HBase versus HDFS HDFS: Optimized For: Large Files Sequential Access (Hi Throughput) Append Only Use For: Fact tables that are mostly append only and require sequential full table scans. HBase: Optimized For: Small Records Random Access (Lo Latency) Atomic Record Updates Use For: Dimension tables which are updated frequently and require random low-latency lookups. Not Suitable For: Low Latency Interactive OLAP. 13

Hadoop in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA ARCHITECTS SYSTEM OPERATORS IDEs Modeling Tools BI / Analytics Enterprise Reporting Meta Data/ ETL Tools Cloudera Manager HParser ODBC, JDBC, NFS Sqoop Enterprise Data Warehouse Flume Flume Flume Logs Files Web Data Sqoop Relational Databases Sqoop Online Serving Systems Web/Mobile Applications CUSTOMERS 14

Informatica HParser Environment http://www.informatica.com/videos/demos/1846_hparser_simpleedi/default.htm 15

ODBC Connector for MicroStrategy/Tableau In-Memory I-Cubes ODBC Multi-Source Enterprise Data Warehouse CLOUDERA ODBC DRIVER HIVE HIVE SQL Hadoop MapReduce HDFS HBase 16

Three Deployment Options Solution Components Option 1: The complete solution Cloudera Enterprise (Cloudera Manager + Support) + CDH Fastest rollout Easiest to deploy, manage and configure Fully supported Annual Subscription Option 2: The no cost trial Cloudera Manager Free Edition + CDH No software cost No support included Manage up to 50 nodes Limited functionality in Cloudera Manager Option 3: CDH Free Download CDH Only Self install and configure No software cost No support included More management & installation complexity Longer rollout for most customers Proprietary Software + Support Cloudera Enterprise Cloudera Manager Cloudera Support Cloudera Manager (Free Edition) Up to 50 Nodes 100% Open Source Solution: Apache Hadoop + selected components CDH (Cloudera Distribution including Apache Hadoop) CDH (Cloudera Distribution including Apache Hadoop CDH (Cloudera Distribution including Apache Hadoop 17 17

Conclusion The Key Benefits of Apache Hadoop: Agility (Dynamic Run-Time Schemas, Any Language). Scalability of Storage/Compute (Freedom to Grow). Economics (Keep All Your Data Alive Forever). Cloudera has the: most complete technology offering, most experienced team, most diversified track-record, and richest partner ecosystem to enable your success with Apache Hadoop. 18

Appendix BACKUP SLIDES 19

Hadoop Creation History Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours over 3,658 nodes 20

HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64mb), then blocks are replicated across cluster (default=3). Optimized for: Throughput Put/Get/Delete Appends Block Replication for: Durability Availability Throughput Block Replicas are distributed across servers and racks. 21

MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: Same server as data (local disk) Same rack as data (rack/leaf switch) Wherever there is a free slot (cross rack) Optimized for: Batch Processing Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 22

MapReduce Next Gen Main idea is to split up the JobTracker functions: Cluster resource management (for tracking and allocating nodes) Application life-cycle management (for MapReduce scheduling and execution) Enables: High Availability Better Scalability Efficient Slot Allocation Rolling Upgrades Non-MapReduce Apps 23

Books 24

25