Apache Hadoop.Next What it takes and what it means

Similar documents
Namenode HA. Sanjay Radia - Hortonworks

HDFS What is New and Futures

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Big Data Hadoop Stack

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

The Evolving Apache Hadoop Ecosystem What it means for Storage Industry

Hadoop & Big Data Analytics Complete Practical & Real-time Training

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

docs.hortonworks.com

A brief history on Hadoop

Automatic-Hot HA for HDFS NameNode Konstantin V Shvachko Ari Flink Timothy Coulter EBay Cisco Aisle Five. November 11, 2011

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Hortonworks Data Platform

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Hadoop. Introduction / Overview

Distributed Systems 16. Distributed File Systems II

MapR Enterprise Hadoop

Introduction to BigData, Hadoop:-

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi

HDFS Design Principles

Hortonworks and The Internet of Things

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

April Copyright 2013 Cloudera Inc. All rights reserved.

A BigData Tour HDFS, Ceph and MapReduce

Hadoop An Overview. - Socrates CCDH

Introduction To YARN. Adam Kawa, Spotify The 9 Meeting of Warsaw Hadoop User Group 2/23/13

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Chapter 5. The MapReduce Programming Model and Implementation

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Innovatus Technologies

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013

Hadoop. copyright 2011 Trainologic LTD

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Lecture 11 Hadoop & Spark

Welcome to. uweseiler

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

Big Data and Cloud Computing

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

A Glimpse of the Hadoop Echosystem

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

HCatalog. Table Management for Hadoop. Alan F. Page 1

HDInsight > Hadoop. October 12, 2017

Certified Big Data and Hadoop Course Curriculum

Big Data Hadoop Course Content

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Configuring and Deploying Hadoop Cluster Deployment Templates

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Big Data Analytics using Apache Hadoop and Spark with Scala

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

BigData and Map Reduce VITMAC03

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

The State of Apache HBase. Michael Stack

Hortonworks University. Education Catalog 2018 Q1

MapReduce. U of Toronto, 2014

HBASE + HUE THE UI FOR APACHE HADOOP

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Getting Started with Hadoop and BigInsights

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

ACTIAN PRODUCTS by Platform - Vector, Vector in Hadoop as of October 18, 2017

Apache BookKeeper. A High Performance and Low Latency Storage Service

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

HDP Security Overview

HDP Security Overview

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

VMware vsphere Big Data Extensions Administrator's and User's Guide

Hadoop Overview. Lars George Director EMEA Services

Data Analysis Using MapReduce in Hadoop Environment

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Hadoop Online Training

Hadoop/MapReduce Computing Paradigm

HADOOP FRAMEWORK FOR BIG DATA

L5-6:Runtime Platforms Hadoop and HDFS

Big Data with Hadoop Ecosystem

Certified Big Data Hadoop and Spark Scala Course Curriculum

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

DATA SCIENCE USING SPARK: AN INTRODUCTION

Transcription:

Apache Hadoop.Next What it takes and what it means Arun C. Murthy Founder & Architect, Hortonworks @acmurthy (@hortonworks) Page 1

Hello! I m Arun Founder/Architect at Hortonworks Inc. Lead, Map-Reduce Formerly, Architect Hadoop MapReduce, Yahoo Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) Yes, I took the 3am calls! Apache Hadoop, ASF VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) Long-term Committer/PMC member (full time ~6 years) Release Manager - hadoop-0.23 (i.e. Hadoop.Next) Page 2

Yahoo!, Apache Hadoop & Hortonworks http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop Yahoo! embraced Apache Hadoop, an open source platform, to crunch epic amounts of data using an army of dirt-cheap servers 2006 Hadoop at Yahoo! 40K+ Servers 170PB Storage 5M+ Monthly Jobs 1000+ Active Users 2011 Yahoo! spun off 22+ engineers into Hortonworks, a company focused on advancing open source Apache Hadoop for the broader market Page 3

Hortonworks Data Platform Fully Supported Integrated Platform Challenge: Integrate, manage, and support changes across a wide range of open source projects that power the Hadoop platform; each with their own release schedules, versions, & dependencies. Time intensive, Complex, Expensive Solution: Hortonworks Data Platform Integrated certified platform distributions Extensive Q/A process Industry-leading Support with clear service levels for updates and patches Continuity via multi-year Support and Maintenance Policy Hadoop Core Pig Zookeeper Hive = New Version HCatalog HBase Page 4

HDP1: Hadoop.Now Based on Hadoop 1.0 (formerly known as 0.20.205) Merges various Hadoop 0.20.* branches (security, HBase support,..) Goal: ensure patching and back-porting happen on one stable line Highlights: First Apache line that supports Security, HBase, and WebHDFS Common table and schema management via HCatalog (based on Hive) Open APIs for metadata access, data movement, app job management Consumable standard Hadoop stack: Hadoop 0.20.205* (HDFS, MapReduce) Pig 0.9.* data flow programming language Hive 0.8.* SQL-like language HBase 0.92.* bigtable datastore HCatalog 0.3.* common table and schema management ZooKeeper 3.4.* coordinator Hortonworks Inc. 2011 Page 5

HDP2: Hadoop.Next Based on Hadoop 0.23.* Include latest stable standard Hadoop components Goal: ensure next-generation architecture happens on one stable line Highlights: HDFS Federation Clear separation of Namespace and Block Storage Improved scalability and isolation HDFS HA Next Generation MapReduce architecture New architecture enables other application types to plug in Ex. Streaming, Graph, Bulk Sync Processing, MPI (Message Passing Interface) New Resource Manager enhances enterprise viability Scalability, HA, FT, and SPoF Client/cluster version compatibility Performance Hortonworks Inc. 2011 Page 6

Releases so far Started for Nutch Yahoo picked it up in early 2006, hired Doug Cutting Initially, we did monthly releases (0.1, 0.2 ) Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009 hadoop-0.20 is still the basis of all current, stable, Hadoop distributions Apache Hadoop 1.0.0 CDH3.* HDP1.* hadoop-0.20.203 (security) 05/2011 hadoop-1.0.0 (security + append + webhdfs) 12/2011 hadoop-0.1.0 hadoop-0.10.0 hadoop-0.20.0 hadoop-0.23 hadoop-1.0.0 2006 2009 2012 Page 7

hadoop-0.23 First stable release off Apache Hadoop trunk in over 30 months Currently alpha quality (hadoop-0.23.0) Significant major features HDFS Federation NextGen Map-Reduce (YARN) HDFS HA Wire protocol compatibility Performance Page 8

HDFS - Federation Significant scaling Separation of Namespace mgmt and Block mgmt Page 9

MapReduce - YARN NextGen Hadoop Data Processing Framework Support MR and other paradigms High Availability Node Manager Container App Mstr Client Client Resource Manager Node Manager App Mstr Container MapReduce Status Job Submission Node Status Resource Request Container Node Manager Container Page 10

HDFS NameNode HA Nearly code-complete Automatic failover Multiple-options for failover (shared disk, Linux HA, Zookeeper) Heartbeat ZK ZK ZK Heartbeat FailoverController Active FailoverController Standby Monitor Health of NN. OS, HW Cmds NN Active Shared NN state with single writer (fencing) NN Standby Monitor Health of NN. OS, HW Block Reports to Active & Standby DN fencing: Update cmds from one D N D N D N Page 11

Performance 2x+ across the board HDFS read/write CRC32 fadvise Shortcut for local reads MapReduce Unlock lots of improvements from Terasort record (Owen/Arun, 2009) Shuffle 30%+ Small Jobs Uber AM Page 12

More Compatible Wire Protocols On the way to rolling upgrades HDFS Write pipeline improvements for HBase Append/flush etc. Build - Full Mavenization EditLogs re-write https://issues.apache.org/jira/browse/hdfs-1073 Lots more Page 13

Deployment goals Clusters of 6,000 machines Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks 200+ PB (raw) per cluster 100,000+ concurrent tasks 10,000 concurrent jobs Yahoo: 50,000+ machines Page 14

What does it take to get there? Testing, *lots* of it Benchmarks (Every release should be at least as good as the last one) Integration testing HBase Pig Hive Oozie Deployment discipline Page 15

Testing Why is it hard? Map-Reduce is, effectively, very wide api Add Streaming Add Pipes Oh, Pig/Hive etc. etc. Functional tests Nightly Over1000 functional tests for Map-Reduce alone Several hundred for Pig/Hive etc. Scale tests Simulation Longevity tests Stress tests Page 16

Benchmarks Benchmark every part of the HDFS & MR pipeline HDFS read/write throughput NN operations Scan, Shuffle, Sort GridMixv3 Run production traces in test clusters Thousands of jobs Stress mode v/s Replay mode Page 17

Integration Testing Several projects in the ecosystem HBase Pig Hive Cycle Oozie Functional Scale Rinse, repeat Page 18

Deployment Alpha/Test (early UAT) Started in Dec, 2011 - ongoing Alpha Beta Small scale (500-800 nodes) Feb, 2012 Majority of users ~1000 nodes per cluster, > 2,000 nodes in all Misnomer: 100s of PB, Millions of user applications Significantly wide variety of applications and load 4000+ nodes per cluster, > 15000 nodes in all Late Q1, 2012 Production Well, it s production Mid-to-late Q2 2012 Page 19

HDP2: Hadoop.Next Based on Hadoop 0.23.* Include latest stable standard Hadoop components Goal: ensure next-generation architecture happens on one stable line Highlights: HDFS Federation HDFS HA Next Generation MapReduce architecture Performance Apache Release hadoop-0.23.0 (pre-alpha) http://hadoop.apache.org/common/releases.html http://hadoop.apache.org/common/docs/r0.23.0/ hadoop-0.23.1 (alpha) Jan, 2012 Hortonworks Inc. 2011 Page 20

How Hortonworks Can Help Training and Certification www.hortonworks.com/training/ Hortonworks Data Platform and Support www.hortonworks.com/hortonworksdataplatform/ www.hortonworks.com/technology/techpreview/ www.hortonworks.com/support/ Educational Webinars www.hortonworks.com/webinars/ Page 21

Questions? hadoop-0.23.0 (alpha release): http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r0.23.0/ Thank You. @acmurthy Page 22