Hadoop Execution Environment

Size: px

Start display at page:

Download "Hadoop Execution Environment"

Avis Watson
5 years ago
Views:

1 Hadoop Execution Environment

2 Hadoop Execution Environment Learn about execution environments in Hadoop. Limitations of classic MapReduce framework. New frameworks like YARN, Tez, Spark to compliment classic MapReduce.

3 Recall Hadoop Architecture Data distributed across nodes Node 1 Node 2 Node n B1 B2 Bn

4 Recall Hadoop Architecture Data distributed across nodes Keep compute task on the node with data. Node 1 Node 2 Node n Task 1 Task 2 Task n B1 B2 Bn

5 MapReduce Execution Framework Software framework Schedules, monitors, and manages tasks

6 MapReduce Execution Framework Works for Applications that fit MapReduce paradigm.

7 NextGen Execution Frameworks What if Application doesn t fit or is not efficient in MapReduce Paradigm?

8 NextGen Execution Frameworks What if Application doesn t fit or is not efficient in MapReduce Paradigm? Interactive data exploration Iterative data processing

9 NextGen Execution Frameworks Enter: Execution frameworks like YARN, Tez, Spark Support complex directed acyclic graph (DAG) of tasks. In memory caching of data

10 Lesson 2, Video #2

11 Hadoop Execution Environment Layout of new frameworks (YARN, Tez, Spark) in Hadoop environment. Optimization strategies used in new frameworks. Examples illustrating use of Tez, Spark.

12 YARN, Tez, Spark Execution frameworks: YARN, Tez, and Spark MR Pig Hive MLib GraphX HBase Other Apps TEZ Spark Spark without YARN YARN HDFS2

13 YARN, Tez, Spark Execution frameworks: YARN, Tez, and Spark MR Pig Hive MLib GraphX HBase Other Apps TEZ Spark Spark without YARN YARN HDFS2

14 YARN, Tez, Spark Execution frameworks: YARN, Tez, and Spark MR Pig Hive MLib GraphX HBase Other Apps TEZ Spark Spark without YARN YARN HDFS2

15 YARN, Tez, Spark Execution frameworks: YARN, Tez, and Spark MR Pig Hive MLib GraphX HBase Other Apps TEZ Spark Spark without YARN YARN HDFS2

16 YARN MapReduce Open source/commercial applications User developed applications Frameworks like Tez, Spark

17 Tez Dataflow graphs Custom data types Can run complex DAG of tasks Dynamic DAG changes Resource usage efficiency

18 HIVE on Tez example SELECT a.vendor, COUNT(*), AVG(c.cost) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemid = c.itemid) GROUP BY a.vendor

19 HIVE Example - MapReduce SELECT a.vendor M M M R R SELECT b.id M M M M HDFS R JOIN (a,c) SELECT c.cost R JOIN (a,b) GROUP BY a.vendor COUNT(*) AVG(c.cost) HDFS M R M HDFS

20 HIVE Example - Tez SELECT a.vendor, c.itemid M M M R R SELECT b.id M M R JOIN (a,c) R JOIN (a,b) GROUP BY a.vendor COUNT(*) AVG(c.cost) R

21 Spark Advanced DAG execution engine Supports cyclic data flow In-memory computing Java, Scala, Python, R Existing optimized libraries

22 Spark Example Logistic Regression example points = spark.textfile(...).map(parsepoint).cache() w = numpy.random.ranf(size = D) # current separating plane for i in range(iterations): gradient = points.map( lambda p: (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y * p.x ).reduce(lambda a, b: a + b) w -= gradient print "Final separating plane: %s" % w

23 Spark Example Logistic Regression example points = spark.textfile(...).map(parsepoint).cache() w = numpy.random.ranf(size = D) # current separating plane for i in range(iterations): gradient = points.map( lambda p: (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y * p.x ).reduce(lambda a, b: a + b) w -= gradient print "Final separating plane: %s" % w

24 Lesson 2, Video #3

25 Hadoop Resource Scheduling Learn about resource management Different kinds of scheduling algorithms Types of parameters that can be controlled.

26 Motivation for Schedulers Various execution engines/options Scheduling, Performance Control of resources between components

27 Schedulers Default First in First out (FIFO) Fairshare Capacity

28 Capacity Scheduler Queue 1 Queue 2 Queue 3 Queue 4 user1, user2 user2, user4,user5 user3 user1, user4,user5 20% 30% 10% 40%

29 Capacity Scheduler Queues and sub-queues Capacity Guarantee with elasticity ACLs for security Runtime changes/draining apps Resource based scheduling

30 Fairshare Scheduler App1 100% App1 100% App2 submitted App1 75% App2 25% App3 submitted App1 50% App2 25% App3 25% App1 33% App2 33% App3 33% Time

31 Fairshare Scheduler Balances out resource allocation among apps over time. Can organize into queues/sub-queues Guarantee minimum shares Limits per user/app Weighted app priorities

32 Summary of resource scheduling Default is FIFO Fairshare and Capacity schedulers Queues/sub-queues possible User/App based limits Resource limits Vendors usually provide additional mechanisms to allocate resources

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1 CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 13, 5/9/2016. Scribed by Alfredo Láinez, Luke de Oliveira.