Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Size: px

Start display at page:

Kellie Floyd
5 years ago
Views:

1 Hadoop 2.x Core: YARN, Tez, and Spark

2 YARN

3 Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop master processes to manage and coordinate cluster services and tasks slave nodes run Hadoop slave processes and provide cluster resources to perform data processing

4 How Hadoop Processes Data Hadoop has historically processed data using MapReduce. MapReduce has been the basis for Hadoop s data processing scalability. MapReduce processes the data on each slave node in parallel and then aggregates the results. The secret to performance and scalability is to move the processing to the data rather than move the data to the processing. Doing so signficantly reduces network I/O traffic.

5 Hadoop Version 2.x Hadoop 2.x has two core components. provides distributed, scalable, and highly available data storage. YARN provides distributed, scalable, and highly available processing. Batch Map Reduce Script Pig Tez SQL Hive HCatalog Tez DATA ACCESS NoSQL HBase Stream Storm YARN : Data Operating System DATA MANAGEMENT Search Solr 1 (Hadoop Distributed File System) Hadoop 2.x Others In-Memory Analytics, ISV engines N

is a Distributed File System automatically: -splits large files into blocks -spreads blocks across cluster -tracks block locations -replicates blocks (not shown) dataa datab datac large data

6 is a Distributed File System automatically: -splits large files into blocks -spreads blocks across cluster -tracks block locations -replicates blocks (not shown) dataa datab datac large data file split block locations block A block B C block master node (NameNode) slave nodes (DataNodes) MR MR MR distributed applications like MapReduce get block information to access and analyze data

7 Hadoop Data Operating System Apache Hadoop YARN is the data operating system for Hadoop 2. YARN is: Responsible for scheduling tasks and managing CPU and memory resources Designed to enable multiple distributed applications to utilize cluster resources in a shared, secure, and multi-tenant manner Batch Map Reduce Script Pig Tez SQL Hive HCatalog Tez DATA ACCESS NoSQL HBase Stream Storm YARN : Data Operating System DATA MANAGEMENT Search Solr 1 (Hadoop Distributed File System) Hadoop 2.x Others In-Memory Analytics, ISV engines N

8 A Little History In Hadoop version 1.x, MapReduce was more than just a data processing application. MapReduce was also the Hadoop cluster s scheduler and resource manager. In Hadoop 2.x, YARN replaced MapReduce for scheduling and resource management. Batch Map Reduce DATA ACCESS Script Pig SQL Hive HCatalog MapReduce: Scheduling and Resource Management DATA MANAGEMENT NoSQL Hbase 1 (Hadoop Distributed File System) Batch Map Reduce Script Pig SQL Hive HCatalog DATA ACCESS NoSQL HBase Stream Storm YARN : Data Operating System DATA MANAGEMENT Search Solr 1 (Hadoop Distributed File System) Hadoop 1.x Hadoop 2.x Tez Tez Others In-Memory Analytics, ISV engines N

9 Why the Move to YARN? YARN is a generic scheduler and resource manager to support applications other than just MapReduce. MapReduce is not suitable for every type of data processing workload. The problem is that MapReduce is by nature batch processing. Batch is not suitable for: Processing streaming data Performing real-time analytics Record fetching High-speed iterative processing

additional deployment and management tasks Created data silos that forced additional

10 Hadoop Before YARN Many times separate clusters were deployed that: Ensured different workloads received sufficient resources Wasted time and money on additional deployment and management tasks Created data silos that forced additional data transfers batch processing interactive processing ingest data transfer clustera clusterb results

11 Hadoop After YARN YARN transformed Hadoop into a generic, distributed operating system. batch streaming applications iterative real-time is a distributed file system. YARN is a distributed scheduler. The combination gives a single Hadoop cluster multi-tenant capability to run distributed applications of many types. YARN distributed processing distributed storage

12 Tez

13 Tez, an Alternative to MapReduce Tez is an alternative to the traditional MapReduce framework. It meets the demands for fast response times and extreme throughput at petabyte scale. Batch Map Reduce DATA ACCESS Script Pig SQL Hive HCatalog MapReduce: Scheduling and Resource Management DATA MANAGEMENT NoSQL Hbase 1 (Hadoop Distributed File System) Batch Map Reduce Script Pig SQL Hive HCatalog DATA ACCESS NoSQL HBase Stream Storm YARN : Data Operating System DATA MANAGEMENT Search Solr 1 (Hadoop Distributed File System) Hadoop 1.x Hadoop 2.x Tez Tez Others In-Memory Analytics, ISV engines N

14 Inefficiencies in MapReduce To understand how Tez accelerates query processing it is helpful to understand some inefficiencies in MapReduce. These inefficiencies make MapReduce suitable only for batch processing. Causes of MapReduce inefficiencies are: and local storage use Requirement of map phase before reduce phase Hadoop containers (A container is an abstraction used to represent a discreet amount of slave node CPU and memory resources. Resources in one container are logically isolated from other container resources. Applications run inside containers.)

15 MapReduce and MapReduce uses storage to store temporary data between MapReduce jobs. M M M M temporary data M M R R M Local storage is used to store temporary data between map and reduce phases. Storage I/O adds significant overhead to the overall job. M M M R R M M M R R

16 Tez and M M M M M M M M M M M R R R R M M M M M M M M M R R M M M R R M M M R R Map and Reduce over MapReduce Map and Reduce over Tez R R Map and Reduce over MapReduce Map and Reduce over Tez

17 Tez is Simple Tez is a completely client-side implementation. Tez is a set of client-side libraries. There is no server to deploy or manage. Tez is not meant for end-users. Developers use the Tez API to create better end-user applications. Tez applications: Support batch and interactive data processing applications Integrate with YARN Perform well in a mixed application workload cluster

18 Spark

19 Apache Spark Apache Spark is an open source, general purpose processing engine used to build and run fast and sophisticated applications. It features a simple set of APIs to write applications in Scala, Java, or Python. The processing engine and applications run on Hadoop 2. It leverages Hadoop s horizontal scale out capabilities. It is YARN-ready. You can process a single copy of data in multiple ways using the same cluster.

20 Spark RDD Scalability and Performance To leverage Hadoop s horizontal scalability: Spark processes data in a Resilient Distributed Dataset (RDD). It is a fault-tolerant collection of data elements. An RDD is stored in memory or on disk. Each RDD is distributed across Hadoop slave nodes. Enables parallel processing across the cluster on-disk RDD 10x MapReduce performance in-memory RDD RAM RAM RAM RAM 100x MapReduce performance

21 Spark High-Level Tools The Spark Engine supports four high-level tools to build applications. Spark SQL Spark SQL Spark Streaming MLlib GraphX Spark Streaming MLlib GraphX SQL streaming machine learning Apache Spark Engine graph computation

22 Spark SQL Use Spark SQL for interactive or batch queries on streaming or historical data. Perform queries in Scala, Java, and Python programs using integrated APIs. It queries structured data as an SchemaRDD. A SchemaRDD is an RDD of row objects that has an associated schema. SchemaRDDs are registered as tables and used in FROM clauses in SQL statements. SchemaRDDs can be used in relational queries, as well as in standard RDD functions. Spark SQL reuses an existing Apache Hive frontend and metastore. This makes it compatible with existing Hive data, queries, and UDFs. Spark SQL includes a server mode with standard ODBC and JDBC connectors.

23 Decisions, decisions, decisions

24 Data Processing Options Spark vs. Tez Three Common Options Hive on Tez Hive on Spark Spark SQL

25 Hive on Tez vs. Hive on Spark Hive on Tez outperforms Hive on Spark Hive tends to be bound by CPU rather than I/O, especially with introduction of columnar file formats Spark spends time translating from RDDs to Hive s native Row Containers Ends up consuming more CPU, Disk & Network I/O Tez is a framework for building special-purpose engines, whereas Spark is a general-purpose engine Hive on Tez is optimized for typical Hive operations

26 Hive on Tez vs. Spark SQL Depends on size of dataset Less than 200 GB, Spark SQL wins 200 GB and greater, Hive on Tez wins The larger the dataset, the greater the discrepancy in performance

27 Tez vs. Spark

28 BUT Spark, like all other Hadoop projects, is evolving. Performance metrics are likely to change as will those for Tez applications, etc. Your mileage will vary, and performance variance today may not be the same as performance variance tomorrow Beware of the word always

29 Thank you!

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals