Dr. Chuck Cartledge. 11 Feb. 2015

Size: px

Start display at page:

Download "Dr. Chuck Cartledge. 11 Feb. 2015"

Corey Morgan
6 years ago
Views:

1 CS-495/595 Hadoop (part 2) Lecture #5 Dr. Chuck Cartledge 11 Feb /32

2 Table of contents I 1 Miscellanea 2 The Book 3 Chapter 3 4 Chapter 4 5 Chapter 6 6 Chapter 8 7 Break 8 Assignment #2 9 Exam 10 Conclusion 11 References 2/32

3 Corrections and additions since last lecture. Updated assignment #2 (check the assignment write-up) Google is serious about taking on telecom [3] White House working on Safeguarding American Consumers and Families [5, 11] Samsung Smart TV is listening to you [10, 9] 3/32

4 Hadoop, The Definitive Guide Version 3 is specified in the syllabus [12] Version 4 came out in November 2015 We ll use Version 3 as much as possible 4/32

5 The HDFS Things that HDFS is good at: LARGE files (> terabyte) Streaming access (WORM) Commodity hardware (failures are common) Image from [7]. HDFS is a robust, reliable distributed file system. 5/32

6 The HDFS Things that HDFS is not good at: Low-latency data access (HDFS is based on an RPC model) Lots of small files (overhead per file is constant) Multiple writers, writes only at the end of a file Image from [7]. File access can be slow because of RPC overhead. 6/32

7 The HDFS Namenode and datanodes Namenode contains meta-data about files Datanodes contain blocks File blocks are replicated across datanodes Loss of a datanode can be detected and a replicate created Loss of a namenode is catastrophic Image from [4]. All communication is via RPC. 7/32

8 The HDFS A better view 8/32

9 The HDFS What happens if you lose your namenode? Namenode is a single point of failure. Have secondary namenode available Namenode keeps edit log of datanode actions Namenode monitors health of datanodes New primary namenode has to read edit log, get state of all datanodes Large cluster can take up to 30 minutes to become fully functional. Namenode should be run on a highly reliable hardware suite. 9/32

10 The HDFS Remote Procedure Call (RPC) Client makes a procedure call Data is serialized Data is sent to server Data is deserialized Data is processed Any returned data is handled in the same way Programmers write to procedures and network messiness is hidden. Attributed to Birrell and Nelson [1]. Image from [6]. 10/32

11 Hadoop I/O Hadoop supports file I/O HDFS s primary concerns Data integrity ensuring that data is complete and intact 1 Checks CRCs 2 Bit rot 3 Creates new replications when and where necessary Data compression 1 Minimizing data size 2 Network activity 3 Adds processing time Hadoop ensure data integrity, minimizes data size, is not fast. 11/32

12 Hadoop I/O Not all data compression algorithms are the same All compression routines have the same basic goal. File extensions are assumed/expected Not all algorithms are CLI compatible Not all compressed files are splittable Splittable is important. If a file can not be split, there can only be one reader. Low level routines are available, some work better than others. 12/32

13 Hadoop I/O Comparing compression algorithms All algorithms trade off space and time Compressing 145,293,291 bytes. Algo. Bytes Comp. Dec. bzip2 11,676, gzip 59,029, lzop 91,913, gzip is a good, middle of the road performer. 13/32

Hadoop I/O Design considerations Raw (uncompressed) can be split on 64M boundaries Compressed and unsplittable file will support one reader Compressed and splittable file can

14 Hadoop I/O Design considerations Raw (uncompressed) can be split on 64M boundaries Compressed and unsplittable file will support one reader Compressed and splittable file can support multiple readers Store file uncompressed Use compression that supports splitting Use Mapper to split file Unsplittable files result in a single Mapper instance. 14/32

15 Hadoop I/O Serialization The process of turning structured objects into a byte stream. Used extensively in Hadoop inter-process communications (RPC). Compact Fast Extensible Interoperable Hadoop serialization is not Java serialization. 15/32

16 Hadoop I/O Summary The HDFS is: Optimized for LARGE files Distributed, robust, and resilient Supports multiple readers Limited support for writers Has native support for raw and compressed files Most file operations are RPC based. The HDFS should be considered a WORM system. 16/32

How MapReduce works Classic organization Client our CLI Nodes may be on different machines Communication between machines is via HFDS Heart beat messages 1 Tasktracker every 5 seconds 2 No heartbeat

17 How MapReduce works Classic organization Client our CLI Nodes may be on different machines Communication between machines is via HFDS Heart beat messages 1 Tasktracker every 5 seconds 2 No heartbeat after 10 minutes node is down and won t use it 3 Child process every few seconds 4 Jobtracker every second 5 Progress every second (just indicates nothing is stuck ) Remember: Hadoop is in Java, Mapper and Reducers may not be Lots of timers and periodics to monitor activity and detect when something is hung or dead. 17/32

18 How MapReduce works Progress What is it and how is it reported? Not possible to show absolute progress because there may not be anyway to know ahead of time how much work needs to be done. Have to report something: Reading by a Mapper or Reducer Writing by a Mapper or Reducer Setting the status description Incrementing a counter (expensive operation) Using the progress() function If progress is not being made, Hadoop will terminate the processes. 18/32

by Resource Manager, Application Master, and Node manager Mappers and Reducers written the same way Interfaces allow

19 How MapReduce works Classic Hadoop vs. YARN Hadoop Hadoop version 1 vs. Yet Another Resource Negotiator (YARN) Hadoop version 2 Architectural differences: Job tracker and Task tracker replaced by Resource Manager, Application Master, and Node manager Mappers and Reducers written the same way Interfaces allow things to be swapped out with minimal impact. Image from [8].. Hadoop administrators are intimately concerned with the differences between classic and YARN installations. 19/32

20 How MapReduce works Schedulers When will MY job be run? Different types of schedulers: FIFO default in Hadoop ver. 1 first come first served Fair also available jobs placed in user pool, one job per pool is scheduled Capacity default in Hadoop ver. 2 similar to Fair, but adds priorities and relationships between pools Image from [13]. Different schedulers do things differently. 20/32

21 How MapReduce works A scheduler example. The best scheduler is the one that serves you best. 21/32

22 MapReduce Features Counters Hadoop has its own counters, and supports global user defined counters. Applications can access and increment counters Counters are global, across all Mappers and Reducers, so incrementing them can be expensive Details, counters are defined by Java enum Hadoop supplies counters, you can create counters, you can use counters. 22/32

23 MapReduce Features Sorting Sorting is based on the Key Keys can be: Simple RawComparator() will work Compound comparator() and partitioner() need to work on the correct part of the key Image from [2]. Sorting can be very complex, depending on your application. 23/32

24 MapReduce Features Joins These are exactly the same as traditional SQL database joins. Depending on your application, joins can happen: Mapper inputs have to be strictly partitioned Reducer inputs have to be tagged to be processed correctly Image from [2]. MapReduce can perform joins between large datasets, but writing the code to do joins from scratch is fairly involved. Rather than writing MapReduce programs, you might consider using a higher-level framework such as Pig, Hive, or Cascading, in which join operations are a core part of the implementation. [12] 24/32

25 Break time. Take about 10 minutes. 25/32

26 An inverted word list. Looking at where words are used. A simply stated problem: where are certain words used? Undergrad students which lines have the word loue Grad student which lines of have the word loue, which have the word course, and which have both Interested in the line numbers and the line itself. An example: 1408: And wonne thy loue, doing thee iniuries: 26/32

Mechanics of the exam Closed book No cheat sheets Anything from the lectures (and supporting material) is fair game Anything that was discussed in class is fair game Anything that should have

27 Mechanics of the exam Closed book No cheat sheets Anything from the lectures (and supporting material) is fair game Anything that was discussed in class is fair game Anything that should have been experienced, or encountered in the assignments is fair game Each question will have two parts 1 An undergrad part 2 A graduate part Undergrads can attempt graduate part without penality. 27/32

28 What have we covered? Spent time on the HDFS, identifying its strengths and weaknesses Discussed importance of HDFS name and data nodes Went over RPCs and its strengths and weaknesses Talked about Hadoop file I/O and compression Talked about Hadoop ver. 1 and ver. 2 Talked about assignment #2 Talked about the exam Chapter 11 will NOT be on the exam Next lecture: Hadoop book, Chapter 11 and exam 28/32

29 References I [1] Andrew D Birrell and Bruce Jay Nelson, Implementing remote procedure calls, ACM Transactions on Computer Systems (TOCS) 2 (1984), no. 1, [2] Iv an de Prado Alonso, Mapreduce & hadoop api revised, [3] Brian Fung, Google is serious about taking on telecom, wp/2015/02/06/google-is-serious-about-taking-ontelecom-heres-why-itll-win/. [4] Pramod Kumar Gampa, Hdfs architecture, /32

30 References II [5] Paul Hastings and Mathew Gibson, In visit to ftc, president outlines broad privacy agenda, offers scant details, 73df90aa-bcfa-4b c372f [6] Jan Newmarch, Web services, internetdevices/webservices/tutorial.html. [7] Sreenivas Pasam, Apache hadoop, wordpress.com/category/cloud-computing/, [8] Tavish Srivastava, Hadoop beyond traditional mapreduce simplified, /11/hadoop-mapreduce/, /32

31 References III [9] BBC staff, Not in front of the telly: Warning over listening tv, [10] Samsung staff, Samsung global privacy policy - smarttv supplement, html?cid=afl-hq-mul , [11] White House staff, Big data: Seizing opportunities, preserving values interim progress report, Tech. report, White House, [12] Tom White, Hadoop: The definitive guide, 3rd edition, O Reilly Media, Inc., /32

32 References IV [13], Hadoop: The definitive guide, 4th edition, O Reilly Media, Inc., /32

Dr. Chuck Cartledge. 4 Feb. 2015

Dr. Chuck Cartledge. 4 Feb. 2015 CS-495/595 Hadoop (part 1) Lecture #3 Dr. Chuck Cartledge 4 Feb. 2015 1/23 Table of contents I 1 Miscellanea 2 Assignment 3 The Book 4 Chapter 1 5 Chapter 2 7 Break 8 Assignment #2 9 Conclusion 10 References