Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1
We believe data can make what is impossible today, possible tomorrow 2
We empower people to transform complex data into clear and actionable insights DRIVE CUSTOMER INSIGHTS CONNECT PRODUCTS & SERVICES (IoT) PROTECT BUSINESS 3
We deliver the modern platform for machine learning and advanced analytics RUNS ANYWHERE Cloud Multi-cloud On-prem SCALABLE Elastic Cost-effective Lower TCO ENTERPRISE GRADE Secure Performant Compliant 4
Cloudera Enterprise The modern platform for machine learning and analytics optimized for the cloud CORE SERVICES DATA SCIENCE ANALYTIC DATABASE OPERATIONAL DATABASE DATA ENGINEERING EXTENSIBLE SERVICES SECURITY GOVERNANCE WORKLOAD MANAGEMENT INGEST & REPLICATION DATA CATALOG STORAGE SERVICES S3 ADLS HDFS KUDU 51
6
The Enterprise Platform for Data Science and Machine Learning 440x MORE DATA 30B CONNECTED DEVICES The data is now here Modern Platform for Machine Learning and Advanced Analytics Cloudera first to integrate Spark 500 Customers Run Spark on Leading adoption among enterprises 7
Data Science at the Enterprise Access For sensitive data, secure clusters are difficult to access. And IT typically doesn t want random packages installed on a secure cluster. Popular open source tools don t easily connect to these environments, or always support Hadoop data formats. Scale Laptops rarely have capacity for medium, let alone big data. This leads to a lot of sampling. Popular frameworks don t easily parallelize on a cluster. Typically code has to get rewritten for production. Developer Experience Notebooks, while awesome, don t easily support virtual environment and dependency management, especially for teams. This makes sharing and reproducibility hard. Notebooks are also challenging to put into production. 8
Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: Secure self-service environments for data scientists to work against Cloudera clusters Support for Python, R, and Scala, plus project dependency isolation for multiple library versions Workflow automation, version control, collaboration and sharing 9
Hadoop 3 in a Nutshell 10
An Abbreviated History of Hadoop Releases Date Release Major Notes 2007-11-04 0.14.1 First release at the ASF 2011-12-27 1.0.0 Security, HBase support 2012-05-23 2.0.0 YARN, NameNode HA, wire compat 2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels 2015-04-21 2.7.0 Most recent production-quality release line 2017-03-22 2.8.0 Current release line 11
Motivation for Hadoop 3 Upgrade minimum Java version to Java 8 Java 7 end-of-life in April 2015 Many Java libraries now only support Java 8 HDFS erasure coding Major feature that refactored core pieces of HDFS Too big to backport to 2.x Classpath isolation Potentially impacts all clients Other miscellaneous incompatible bugfixes and improvements Hadoop 2.x was branched in 2011 6 years of changes waiting for 3.0 12
Hadoop 3 Status and Release Plan A series of alphas and betas leading up to GA Alpha4 feature complete Beta1 compatibility freeze GA by the end of the year Release Date 3.0.0-alpha1 2016-09-03 3.0.0-alpha2 2017-01-25 3.0.0-alpha3 2017-05-16 3.0.0-alpha4 2017-07-07 3.0.0-beta1 2017 Q3... 3.0.0 GA 2017 Q4 https://cwiki.apache.org/confluence/display/hadoop/hadoop+3.0.0+release 13
HDFS & Hadoop Features 14
3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 15
3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 b1 b2 b3 16
3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 b1 b2 b3 3 replicas 3 x 3 = 9 total replicas 6 / 3 = 200% overhead! 3 blocks 17
Erasure coding (HDFS-7285) Motivation: improve storage efficiency of HDFS ~2x the storage efficiency compared to 3x replication Reduction of overhead from 200% to 40% Uses Reed-Solomon(k,m) erasure codes instead of replication Support for multiple erasure coding policies RS(3,2), RS(6,3), RS(10,4) Can improve data durability RS(6,3) can tolerate 3 failures RS(10,4) can tolerate 4 failures Missing blocks reconstructed from remaining blocks 18
3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 19
3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p2 20
3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p2 3 + 2 = 5 replicas 2 / 3 = 67% overhead! 3 data blocks 2 parity blocks 21
3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p2 3 + 2 = 5 replicas 2 / 3 = 67% overhead! 3 data blocks 2 parity blocks /bigfoo.csv - 10 block file... b1 b2 b10 p1... 10 data blocks 4 parity blocks p4 10 + 4 = 14 replicas 4 / 10 = 40% overhead! 22
EC Reconstruction /foo.csv - 3 block file b1 b2 b3 p1 p2 Reed-Solomon (3,2) 23
EC Reconstruction /foo.csv - 3 block file X b1 b2 b3 p1 p2 Reed-Solomon (3,2) 24
EC Reconstruction /foo.csv - 3 block file X b1 b2 b3 p1 p2 Reed-Solomon (3,2) Read 3 remaining blocks Run RS to recover b3 b3 New copy of b3 recovered 25
EC implications File data is striped across multiple nodes and racks Reads and writes are remote and cross-rack Reconstruction is network-intensive, reads m blocks cross-rack Important to use Intel s optimized ISA-L for performance 1+ GB/s encode/decode speed, much faster than Java implementation CPU is no longer a bottleneck Need to combine data into larger files to avoid an explosion in replica count Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) Works best for archival / cold data use cases 26
EC performance 27
EC performance 28
Other new HDFS features Classpath isolation shaded HDFS clients Shell script rewrite Support for multiple Standby NameNodes Intra-DataNode balancer Support for Microsoft Azure Data Lake and Aliyun OSS (Alibaba Cloud) Move default ports out of the ephemeral range 29
YARN Features 30
Application Timeline Service v2 Resource Manager 31
Application Timeline Service v2 jobs Resource Manager 32
Application Timeline Service v2 jobs Resource Manager Job History Server 33
Application Timeline Service v2 jobs Node Manager Resource Manager Job History Server HDFS 34
Application Timeline Service v2 jobs Resource Manager Job History Server Spark History Server 35
Application Timeline Service v2 jobs? Resource Manager Job History Server Spark History Server 36
Application Timeline Service v2 Resource Manager Timeline Collecter Timeline Timeline Reader Timeline Reader Reader HBase Timeline Collecter Node Manager Application Master 37
Application Timeline Service v2 jobs Resource Manager Application Timeline Service HBase 38
Application Timeline Service v2 Store for application and system events and data Distributed Scalable Structured Data Model Updated in real time Application status Application metrics System metrics Fed by resource manager, node manager, and application masters REST API 39
Application Timeline Service v2 Flows 40
New YARN UI Rich client application Built on Node.js and Ember Improved visibility into cluster usage Memory, CPU By queues and applications Sunburst graphs for hierarchical queues NodeManager heatmap ATSv2 integration Plot container start/stop events Easy to capture delays in app execution 41
New YARN UI: Cluster Overview 42
New YARN UI: Queues 43
Before Hadoop 3 memory and CPU are the only managed resources Resource Types allows adding new managed resources Countable resources: GPUs, Disks etc. Resource profiles Resource Types Similar conceptually to EC2 instance types Capture complex resource request Current virtual CPU cores and memory resources work as before 44
YARN Federation Admin Policy Resource Manager Node Manager Node Manager Node Manager Node Manager Router Node Manager Resource Manager Node Manager Node Manager Node Manager 45
YARN Federation YARN scalability Twitter runs a 10k node cluster with fair scheduler Yahoo! runs 4k node cluster with capacity scheduler Federation Restrict users to sub-clusters based on policy Scalability to 100k nodes and beyond Heterogeneous scheduling policies 46
Opportunistic Containers Scheduler s job is to keep all resources busy Scheduling gaps Nothing to run Resource contention Resource reservations Opportunistic containers fill those gaps Requested explicitly Dedicated scheduler Queued at the node managers Scheduled locally when resources are available Preempted when guaranteed containers need to run Coming in 2.9 and 3.0 47
Oversubscription Resource utilization is typically low in most clusters (20-50%) Provision for peak usage Usage < Allocation Mean Usage = ½ Peak Usage 48
Other new YARN features New YARN UI Node.js application, improved visibility into cluster usage New resource types beyond CPU and memory YARN Federation Improved scalability to 100k node clusters and beyond Opportunistic containers / Oversubscription 49
Summary: What s new in Hadoop 3.0? Storage Optimization HDFS: Erasure codes Improved Visibility into Cluster Operations YARN: ATSv2 YARN: New UI Scalability & Multi-tenancy YARN: Federation Improved Utilization YARN: Opportunistic Containers YARN: Oversubscription Refactor Base Lots of Trunk content JDK8 and newer dependent libraries 50
Thank you Balazs Gaspar balazs@cloudera.com 51
Compatibility and Testing 52
Compatibility Strong feedback from large users on the need for compatibility Preserves wire-compatibility with Hadoop 2 clients Impossible to coordinate upgrading off-cluster Hadoop clients Will support rolling upgrade from Hadoop 2 to Hadoop 3 Can t take downtime to upgrade a business-critical cluster Not fully preserving API compatibility Dependency version bumps Removal of deprecated APIs and tools Shell script rewrite, rework of Hadoop tools scripts Incompatible bug fixes 53
Conclusion Expect Hadoop 3.0.0 GA by the end of the year Shiny new features HDFS erasure coding Client classpath isolation YARN ATSv2 YARN Federation Opportunistic containers and oversubscription Great time to get involved in testing and validation 54