Hadoop Overview. Lars George Director EMEA Services

Similar documents
How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Hadoop An Overview. - Socrates CCDH

Introduction to Hadoop and MapReduce

Big Data Hadoop Stack

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MapR Enterprise Hadoop

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data with Hadoop Ecosystem

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Microsoft Big Data and Hadoop

Oracle Big Data Fundamentals Ed 2

Big Data Analytics using Apache Hadoop and Spark with Scala

Hadoop. Introduction / Overview

Innovatus Technologies

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data Architect.

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

@Pentaho #BigDataWebSeries

Modern Data Warehouse The New Approach to Azure BI

Oracle Big Data Fundamentals Ed 1

Certified Big Data and Hadoop Course Curriculum

Big Data Hadoop Course Content

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

HBase... And Lewis Carroll! Twi:er,

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Introduction to Big-Data

Hadoop Development Introduction

Configuring and Deploying Hadoop Cluster Deployment Templates

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Fast Innovation requires Fast IT

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

<Insert Picture Here> Introduction to Big Data Technology

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Oracle Big Data Connectors

Certified Big Data Hadoop and Spark Scala Course Curriculum

Importing and Exporting Data Between Hadoop and MySQL

Data Lake Based Systems that Work

April Copyright 2013 Cloudera Inc. All rights reserved.

Orchestration of Data Lakes BigData Analytics and Integration. Sarma Sishta Brice Lambelet

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Big Data Infrastructure at Spotify

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Stages of Data Processing

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Capture Business Opportunities from Systems of Record and Systems of Innovation

SpagoBI and Talend jointly support Big Data scenarios

A Survey on Big Data

Distributed Systems. CS422/522 Lecture17 17 November 2014

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data and Object Storage

Chase Wu New Jersey Institute of Technology

Cloudera Introduction

Building a Data Strategy for a Digital World

Introduction to BigData, Hadoop:-

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Big Data and Enterprise Data, Bridging Two Worlds with Oracle Data Integration

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

The Technology of the Business Data Lake. Appendix

docs.hortonworks.com

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

A BigData Tour HDFS, Ceph and MapReduce

Cloudera Introduction

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Data in the Cloud and Analytics in the Lake

BIG DATA COURSE CONTENT

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Cloud Analytics and Business Intelligence on AWS

Modernizing Business Intelligence and Analytics

Oracle Data Integrator 12c: Integration and Administration

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Scaling ETL. with Hadoop. Gwen

New Approaches to Big Data Processing and Analytics

Hadoop course content

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

5 Fundamental Strategies for Building a Data-centered Data Center

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

microsoft

Transcription:

Hadoop Overview Lars George Director EMEA Services 1

About Me Director EMEA Services @ Cloudera Consulting on Hadoop projects (everywhere) Apache Committer HBase and Whirr O Reilly Author HBase The Definitive Guide Contact Now in Japanese! lars@cloudera.com @larsgeorge 日本語版も出ました!

Agenda Part 1: Why Hadoop? Part 2: Hadoop in the Enterprise Infrastructure Part 3: What is Hadoop? Part 4: Use-Cases 3

Why Hadoop? Part 1

The Progression to Big Data THEN NOW GB VOLUME PB Structured VARIETY Structured + Unstructured Trickle VELOCITY Torrent Operational Reporting VALUE Reporting + Data Discovery 5

Pain Points: Data Management Can t ingest fast enough Costs too much to store Exists in different places Archived data is lost

Pain Points: Data Exploration & Analysis Analysis and processing takes too long Data exists in silos Can t ask new questions Can t analyze unstructured data

Apache Hadoop A Revolutionary Platform for Big Data INGEST STORE EXPLORE PROCESS ANALYZE SERVE VOLUME Distributed architecture scales cost-effectively VARIETY Store data in any format VELOCITY Load raw data and define how you look at it later VALUE Process data faster, Ask any question 8

Hadoop and Relational Databases Schema-on-Write Schema-on-Read Schema must be created before any data can be loaded An explicit load operation has to take place which transforms data to DB internal structure New columns must be added explicitly before new data for such columns can be loaded into the database Data is simply copied to the file store, no transformation is needed A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding) New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it 1) Reads are Fast 2) Standards and Governance PROS 1) Loads are Fast 2) Flexibility and Agility 9

Hadoop and Relational Databases You need Best Used For: Canonical Structured Data Interactive OLAP Analytics (<1sec) Multistep ACID Transactions 100% SQL Compliance Best Used For: Structured or Not (Flexibility) Exploratory Analysis (1sec-5min) Scalability of Storage/Compute Complex Data Processing 10

Hadoop in the Enterprise Infrastructure Part 2

Cloudera s Vision for Hadoop LEGACY Multiple platforms NEW A single data platform COMPLEX, FRAGMENTED, COSTLY SIMPLIFIED, UNIFIED, EFFICIENT 12

Hadoop in the Enterprise OPERATORS DATA ARCHITECTS ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS Management Tools Metadata / ETL Tools Developer Tools Data Modeling BI / Analytics Enterprise Reporting Hadoop Platform Enterprise Data Warehouse Data Serving Systems Logs Files Web Data Relational Databases Web / Mobile Applications CUSTOMERS 13

What Is Hadoop? Part 3

The Origins of Hadoop Source: Credit Suisse 15

Core Hadoop The Basics 16

What is Apache Hadoop? Apache Hadoop is an open source platform for data storage and processing that is Scalable Fault tolerant Distributed CORE HADOOP SYSTEM COMPONENTS Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce/ YARN + MRv2 Distributed Computing Framework Works with Every Type of Data Brings Computation to Storage Changes the Economics of Data Management 17

Hadoop Components Hadoop consists of two core components The Hadoop Distributed File System (HDFS) Distributed Processing Framework (MapReduce etc.) There are many other projects based around core Hadoop Often referred to as the Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop, etc. More on this later A set of machines running HDFS and MapReduce is known as a Hadoop Cluster Individual machines are known as nodes A cluster can have as few as one node, as many as several thousands More nodes = more capacity & better performance!

Core Hadoop Concepts Data is spread among machines in advance Computation happens where the data is stored, wherever possible Data is replicated multiple times on the system for increased availability and reliability Nodes talk to each other as little as possible Shared nothing architecture The system (vs. developers/applications) handles communication between nodes Applications are written in high-level code Developers do not worry about network programming, temporal dependencies, etc. Applications can be written in virtually any programming language

Hadoop Components: HDFS HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster Data files are split into blocks and distributed across multiple nodes in the cluster Each block is replicated multiple times Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability

HDFS Basic Concepts HDFS is a filesystem written in Java Based on Google s GFS Sits on top of a native filesystem ext3, xfs etc Provides redundant storage for massive amounts of data Using cheap, unreliable computers

HDFS Basic Concepts (cont d) HDFS performs best with a modest number of large files Millions, rather than billions, of files Each file typically 100Mb or more Files in HDFS are write once No random writes to files are allowed HDFS is optimized for large, streaming reads of files Rather than random reads

Getting Data in and out of HDFS Hadoop API hadoop fs to work with data in HDFS Ecosystem Projects Flume Collects data from log generating sources (e.g., Websites, syslogs, STDOUT) Sqoop Extracts and/or inserts data between HDFS and RDBMS Business Intelligence Tools

Hadoop Components: MapReduce MapReduce is the system used to process data in the Hadoop cluster Consists of two phases: Map, and then Reduce Each Map task operates on a discrete portion of the overall dataset Typically one HDFS data block After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase Much more on this later!

Features of MapReduce Automatic parallelization and distribution Fault-tolerance Status and monitoring tools A clean abstraction for programmers MapReduce programs are usually written in Java MapReduce abstracts all the housekeeping away from the developer Developer can concentrate simply on writing the Map and Reduce functions

How MapReduce Works Word Count Example: Mapping Shuffling Reducing Mapper Input The cat sat on the mat The aardvark sat on the sofa The, 1 cat, 1 sat, 1 on, 1 the, 1 mat, 1 The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1 aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 Final Result aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 26

The Hadoop Ecosystem Making Hadoop Function as Part of an Enterprise Infrastructure 27

Introduction The term Hadoop is taken to be the combination of HDFS and MapReduce There are numerous other projects surrounding Hadoop Typically referred to as the Hadoop Ecosystem Most are incorporated into Cloudera s Distribution Including Apache Hadoop (CDH) All use either HDFS, MapReduce, or both

Preview of CDH CDH 100% OPEN SOURCE CLOUD WH WHIRR USER INTERFACE WORKFLOW MGMT METADATA HU OO HUE OOZIE INTEGRATION SQ SQOOP FL FLUME FILE FUSE-DFS REST WEBHDFS HTTPFS BATCH PROCESSING HI PI HIVE PIG BATCH COMPUTE MR MR2 MAPREDUCE MAPREDUCE2 RESOURCE MGMT & COORDINATION MA MAHOUT DF DATAFU YA YARN REAL-TIME ACCESS & COMPUTE IM IMPALA ZO ZOOKEEPER SE SEARCH AC ACCESS MS META STORE SQL ODBC JDBC STORAGE HDFS HADOOP DFS HB HBASE 29

Data Lifecycle Process Store Explore Analyze Serve Business Analysts Business Users Customers HDFS, Sqoop, Flume Impala, Hive, Pig MapReduce, Impala, Hive, Pig, Mahout, HBase 30

Beyond Batch: Real Time Query for Hadoop Cloudera Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL-TIME ACCESS Speed to Insight Get answers as fast as you can ask questions Interactive analytics directly on source data No jumping between data silos Cost Savings Reduce duplicate storage with EDW Reduce data movement for analysis Leverage existing tools and employee skills Full Fidelity Analysis Ask questions of all your data No loss of fidelity from aggregation or conforming to fixed schemas Discoverability Single metadata store from source to analysis Supports familiar SQL language and existing BI tools Enables more users to interact with data 31 CONFIDENTIAL - RESTRICTED

Use-Cases Part 4

Ask Bigger Questions: How do we prevent mobile device returns? A leading manufacturer of mobile devices gleans new insights & delivers instant software bug fixes.

Cloudera complements the data warehouse The Challenge: Fast-growing Oracle DW difficult & expensive to maintain performance at scale Need to ingest massive volumes of unstructured data very quickly Mobile technology leader identified a hidden software bug causing sudden spike in returns. The Solution: Cloudera Enterprise + RTD: data processing, storage & analysis on 25 years data Integrated with Oracle: closed loop analytical process Collecting device data every min., loading 1TB/day into Cloudera 34 Read the case study: http://www.cloudera.com/content/cloudera/en/resources/library/casestudy/drivinginnovation-in-mobile-devices-with-cloudera-and-oracle.html

Ask Bigger Questions: How do we feed the world? A Fortune 500 company specializing in agriculture and genomics can automate datadriven R&D decisions to reduce time to market from years to months. 35

Fortune 500 agriculture company SITUATION SOLUTION RESULTS OPPORTUNITY More than 1,000 research scientists building product development algorithms Time to market for new products is 5-10 years BARRIERS Algorithms built in silos Data processing bottleneck slows development R&D data pipeline for each product involves series of questions & decisions 36

Fortune 500 agriculture company SITUATION SOLUTION RESULTS CLOUDERA ENTERPRISE CORE + RTD, RTQ PB-scale platform for consolidated view of all R&D data Integration with Oracle Exadata, Lucene Solr, spatial awareness, visualization Hadoop components: Avro, HDFS, HBase, Hive, Hue, MapReduce, Oozie, Pig, Sqoop 37

Fortune 500 agriculture company SITUATION SOLUTION RESULTS BENEFITS PB-scale Increased usability Scientists directly access Hadoop Flexibility Consolidated view of all data within R&D MEASURED IMPACT Data-driven decisions in R&D pipeline automated; reduces time to market of new products Which traits do we want to integrate into this germ plasm? Which male & female plants should be brought together to create a child plant? Where should the child plant be tested? 38

39