Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Similar documents
HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

Modern Data Warehouse The New Approach to Azure BI

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Understanding the latent value in all content

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

MapR Enterprise Hadoop

Configuring and Deploying Hadoop Cluster Deployment Templates

Building a Data-Friendly Platform for a Data- Driven Future

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

The Evolution of Big Data Platforms and Data Science

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Stages of Data Processing

EsgynDB Enterprise 2.0 Platform Reference Architecture

Cloud Computing & Visualization

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data and Object Storage

Big Data for Engineers Spring Resource Management

Integrate MATLAB Analytics into Enterprise Applications

What s New in Apache HTrace by Colin P. McCabe

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Modernizing Business Intelligence and Analytics

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Big Data 7. Resource Management

DATA SCIENCE USING SPARK: AN INTRODUCTION

ArcGIS Enterprise: Architecture & Deployment. Anthony Myers

EMC Forum 2014 EMC ViPR and ECS: A Lap Around Software-Defined Services. Magnus Nilsson Blog: purevirtual.

Certified Big Data and Hadoop Course Curriculum

MySQL High Availability

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

November 7, DAN WILSON Global Operations Architecture, Concur. OpenStack Summit Hong Kong JOE ARNOLD

Certified Big Data Hadoop and Spark Scala Course Curriculum

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

Software-defined Storage: Fast, Safe and Efficient

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Processing of big data with Apache Spark

2013 AWS Worldwide Public Sector Summit Washington, D.C.

An Introduction to Apache Spark

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Hadoop. Introduction / Overview

Big Data Hadoop Course Content

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Hyper-Convergence De-mystified. Francis O Haire Group Technology Director

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Top 25 Big Data Interview Questions And Answers

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

EMC Forum EMC ViPR and ECS: A Lap Around Software-Defined Services

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Lecture 11 Hadoop & Spark

MOHA: Many-Task Computing Framework on Hadoop

New Fresh Storage Approach for New IT Challenges Laurent Denel Philippe Nicolas OpenIO

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Chapter 5. The MapReduce Programming Model and Implementation

Hadoop & Big Data Analytics Complete Practical & Real-time Training

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Introduction To YARN. Adam Kawa, Spotify The 9 Meeting of Warsaw Hadoop User Group 2/23/13

Hadoop Development Introduction

Aurora, RDS, or On-Prem, Which is right for you

Big data systems 12/8/17

Oracle NoSQL Database Enterprise Edition, Version 18.1

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Mothra: A Large-Scale Data Processing Platform for Network Security Analysis

Integrate MATLAB Analytics into Enterprise Applications

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

L5-6:Runtime Platforms Hadoop and HDFS

CSE 444: Database Internals. Lecture 23 Spark

Build an open hybrid cloud and paint it red and blue

SFO15-TR6: Hadoop on ARM

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Vision of the Software Defined Data Center (SDDC)

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Planning for the HDP Cluster

SPLUNK ENTERPRISE AND ECS TECHNICAL SOLUTION GUIDE

The Evolving Apache Hadoop Ecosystem What it means for Storage Industry

Next-Generation Cloud Platform

Blurring the Line Between Developer and Data Scientist

Scalable Tools - Part I Introduction to Scalable Tools

Using DC/OS for Continuous Delivery

Dell Technologies IoT Solution Surveillance with Genetec Security Center

How we built a highly scalable Machine Learning platform using Apache Mesos

Elastic Cloud Storage (ECS)

Automatic-Hot HA for HDFS NameNode Konstantin V Shvachko Ari Flink Timothy Coulter EBay Cisco Aisle Five. November 11, 2011

Composable Infrastructure for Public Cloud Service Providers

VMware vsphere Big Data Extensions Command-Line Interface Guide

Webinar Series TMIP VISION

Hadoop An Overview. - Socrates CCDH

Hadoop, Yarn and Beyond

Transcription:

Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1

We believe data can make what is impossible today, possible tomorrow 2

We empower people to transform complex data into clear and actionable insights DRIVE CUSTOMER INSIGHTS CONNECT PRODUCTS & SERVICES (IoT) PROTECT BUSINESS 3

We deliver the modern platform for machine learning and advanced analytics RUNS ANYWHERE Cloud Multi-cloud On-prem SCALABLE Elastic Cost-effective Lower TCO ENTERPRISE GRADE Secure Performant Compliant 4

Cloudera Enterprise The modern platform for machine learning and analytics optimized for the cloud CORE SERVICES DATA SCIENCE ANALYTIC DATABASE OPERATIONAL DATABASE DATA ENGINEERING EXTENSIBLE SERVICES SECURITY GOVERNANCE WORKLOAD MANAGEMENT INGEST & REPLICATION DATA CATALOG STORAGE SERVICES S3 ADLS HDFS KUDU 51

6

The Enterprise Platform for Data Science and Machine Learning 440x MORE DATA 30B CONNECTED DEVICES The data is now here Modern Platform for Machine Learning and Advanced Analytics Cloudera first to integrate Spark 500 Customers Run Spark on Leading adoption among enterprises 7

Data Science at the Enterprise Access For sensitive data, secure clusters are difficult to access. And IT typically doesn t want random packages installed on a secure cluster. Popular open source tools don t easily connect to these environments, or always support Hadoop data formats. Scale Laptops rarely have capacity for medium, let alone big data. This leads to a lot of sampling. Popular frameworks don t easily parallelize on a cluster. Typically code has to get rewritten for production. Developer Experience Notebooks, while awesome, don t easily support virtual environment and dependency management, especially for teams. This makes sharing and reproducibility hard. Notebooks are also challenging to put into production. 8

Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: Secure self-service environments for data scientists to work against Cloudera clusters Support for Python, R, and Scala, plus project dependency isolation for multiple library versions Workflow automation, version control, collaboration and sharing 9

Hadoop 3 in a Nutshell 10

An Abbreviated History of Hadoop Releases Date Release Major Notes 2007-11-04 0.14.1 First release at the ASF 2011-12-27 1.0.0 Security, HBase support 2012-05-23 2.0.0 YARN, NameNode HA, wire compat 2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels 2015-04-21 2.7.0 Most recent production-quality release line 2017-03-22 2.8.0 Current release line 11

Motivation for Hadoop 3 Upgrade minimum Java version to Java 8 Java 7 end-of-life in April 2015 Many Java libraries now only support Java 8 HDFS erasure coding Major feature that refactored core pieces of HDFS Too big to backport to 2.x Classpath isolation Potentially impacts all clients Other miscellaneous incompatible bugfixes and improvements Hadoop 2.x was branched in 2011 6 years of changes waiting for 3.0 12

Hadoop 3 Status and Release Plan A series of alphas and betas leading up to GA Alpha4 feature complete Beta1 compatibility freeze GA by the end of the year Release Date 3.0.0-alpha1 2016-09-03 3.0.0-alpha2 2017-01-25 3.0.0-alpha3 2017-05-16 3.0.0-alpha4 2017-07-07 3.0.0-beta1 2017 Q3... 3.0.0 GA 2017 Q4 https://cwiki.apache.org/confluence/display/hadoop/hadoop+3.0.0+release 13

HDFS & Hadoop Features 14

3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 15

3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 b1 b2 b3 16

3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 b1 b2 b3 3 replicas 3 x 3 = 9 total replicas 6 / 3 = 200% overhead! 3 blocks 17

Erasure coding (HDFS-7285) Motivation: improve storage efficiency of HDFS ~2x the storage efficiency compared to 3x replication Reduction of overhead from 200% to 40% Uses Reed-Solomon(k,m) erasure codes instead of replication Support for multiple erasure coding policies RS(3,2), RS(6,3), RS(10,4) Can improve data durability RS(6,3) can tolerate 3 failures RS(10,4) can tolerate 4 failures Missing blocks reconstructed from remaining blocks 18

3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 19

3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p2 20

3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p2 3 + 2 = 5 replicas 2 / 3 = 67% overhead! 3 data blocks 2 parity blocks 21

3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p2 3 + 2 = 5 replicas 2 / 3 = 67% overhead! 3 data blocks 2 parity blocks /bigfoo.csv - 10 block file... b1 b2 b10 p1... 10 data blocks 4 parity blocks p4 10 + 4 = 14 replicas 4 / 10 = 40% overhead! 22

EC Reconstruction /foo.csv - 3 block file b1 b2 b3 p1 p2 Reed-Solomon (3,2) 23

EC Reconstruction /foo.csv - 3 block file X b1 b2 b3 p1 p2 Reed-Solomon (3,2) 24

EC Reconstruction /foo.csv - 3 block file X b1 b2 b3 p1 p2 Reed-Solomon (3,2) Read 3 remaining blocks Run RS to recover b3 b3 New copy of b3 recovered 25

EC implications File data is striped across multiple nodes and racks Reads and writes are remote and cross-rack Reconstruction is network-intensive, reads m blocks cross-rack Important to use Intel s optimized ISA-L for performance 1+ GB/s encode/decode speed, much faster than Java implementation CPU is no longer a bottleneck Need to combine data into larger files to avoid an explosion in replica count Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) Works best for archival / cold data use cases 26

EC performance 27

EC performance 28

Other new HDFS features Classpath isolation shaded HDFS clients Shell script rewrite Support for multiple Standby NameNodes Intra-DataNode balancer Support for Microsoft Azure Data Lake and Aliyun OSS (Alibaba Cloud) Move default ports out of the ephemeral range 29

YARN Features 30

Application Timeline Service v2 Resource Manager 31

Application Timeline Service v2 jobs Resource Manager 32

Application Timeline Service v2 jobs Resource Manager Job History Server 33

Application Timeline Service v2 jobs Node Manager Resource Manager Job History Server HDFS 34

Application Timeline Service v2 jobs Resource Manager Job History Server Spark History Server 35

Application Timeline Service v2 jobs? Resource Manager Job History Server Spark History Server 36

Application Timeline Service v2 Resource Manager Timeline Collecter Timeline Timeline Reader Timeline Reader Reader HBase Timeline Collecter Node Manager Application Master 37

Application Timeline Service v2 jobs Resource Manager Application Timeline Service HBase 38

Application Timeline Service v2 Store for application and system events and data Distributed Scalable Structured Data Model Updated in real time Application status Application metrics System metrics Fed by resource manager, node manager, and application masters REST API 39

Application Timeline Service v2 Flows 40

New YARN UI Rich client application Built on Node.js and Ember Improved visibility into cluster usage Memory, CPU By queues and applications Sunburst graphs for hierarchical queues NodeManager heatmap ATSv2 integration Plot container start/stop events Easy to capture delays in app execution 41

New YARN UI: Cluster Overview 42

New YARN UI: Queues 43

Before Hadoop 3 memory and CPU are the only managed resources Resource Types allows adding new managed resources Countable resources: GPUs, Disks etc. Resource profiles Resource Types Similar conceptually to EC2 instance types Capture complex resource request Current virtual CPU cores and memory resources work as before 44

YARN Federation Admin Policy Resource Manager Node Manager Node Manager Node Manager Node Manager Router Node Manager Resource Manager Node Manager Node Manager Node Manager 45

YARN Federation YARN scalability Twitter runs a 10k node cluster with fair scheduler Yahoo! runs 4k node cluster with capacity scheduler Federation Restrict users to sub-clusters based on policy Scalability to 100k nodes and beyond Heterogeneous scheduling policies 46

Opportunistic Containers Scheduler s job is to keep all resources busy Scheduling gaps Nothing to run Resource contention Resource reservations Opportunistic containers fill those gaps Requested explicitly Dedicated scheduler Queued at the node managers Scheduled locally when resources are available Preempted when guaranteed containers need to run Coming in 2.9 and 3.0 47

Oversubscription Resource utilization is typically low in most clusters (20-50%) Provision for peak usage Usage < Allocation Mean Usage = ½ Peak Usage 48

Other new YARN features New YARN UI Node.js application, improved visibility into cluster usage New resource types beyond CPU and memory YARN Federation Improved scalability to 100k node clusters and beyond Opportunistic containers / Oversubscription 49

Summary: What s new in Hadoop 3.0? Storage Optimization HDFS: Erasure codes Improved Visibility into Cluster Operations YARN: ATSv2 YARN: New UI Scalability & Multi-tenancy YARN: Federation Improved Utilization YARN: Opportunistic Containers YARN: Oversubscription Refactor Base Lots of Trunk content JDK8 and newer dependent libraries 50

Thank you Balazs Gaspar balazs@cloudera.com 51

Compatibility and Testing 52

Compatibility Strong feedback from large users on the need for compatibility Preserves wire-compatibility with Hadoop 2 clients Impossible to coordinate upgrading off-cluster Hadoop clients Will support rolling upgrade from Hadoop 2 to Hadoop 3 Can t take downtime to upgrade a business-critical cluster Not fully preserving API compatibility Dependency version bumps Removal of deprecated APIs and tools Shell script rewrite, rework of Hadoop tools scripts Incompatible bug fixes 53

Conclusion Expect Hadoop 3.0.0 GA by the end of the year Shiny new features HDFS erasure coding Client classpath isolation YARN ATSv2 YARN Federation Opportunistic containers and oversubscription Great time to get involved in testing and validation 54