TP1-2: Analyzing Hadoop Logs

Similar documents
A brief history on Hadoop

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Hadoop. copyright 2011 Trainologic LTD

Distributed Filesystem

Getting Started with Spark

Lecture 11 Hadoop & Spark

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Database Applications (15-415)

HADOOP FRAMEWORK FOR BIG DATA

Distributed Systems 16. Distributed File Systems II

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop and HDFS Overview. Madhu Ankam

Programming Models MapReduce

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

BigData and Map Reduce VITMAC03

Introduction to MapReduce

MapReduce. U of Toronto, 2014

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Hadoop/MapReduce Computing Paradigm

Clustering Lecture 8: MapReduce

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Big Data for Engineers Spring Resource Management

PWBRR Algorithm of Hadoop Platform

CLIENT DATA NODE NAME NODE

Data Analysis Using MapReduce in Hadoop Environment

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Chapter 5. The MapReduce Programming Model and Implementation

Hadoop Map Reduce 10/17/2018 1

Hadoop MapReduce Framework

Hadoop An Overview. - Socrates CCDH

CA485 Ray Walshe Google File System

Introduction to HDFS and MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Batch Inherence of Map Reduce Framework

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Cloud Computing CS

Introduction to MapReduce

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Introduction to BigData, Hadoop:-

Distributed Systems CS6421

Inria, Rennes Bretagne Atlantique Research Center

Big Data 7. Resource Management

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

A BigData Tour HDFS, Ceph and MapReduce

2/26/2017. For instance, consider running Word Count across 20 splits

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Distributed File Systems II

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Google File System (GFS) and Hadoop Distributed File System (HDFS)

OPEN SOURCE GRID MIDDLEWARE PACKAGES

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

MapReduce-style data processing

CS427 Multicore Architecture and Parallel Computing

STORM AND LOW-LATENCY PROCESSING.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Commands Guide. Table of contents

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Distributed Face Recognition Using Hadoop

Distributed Systems. CS422/522 Lecture17 17 November 2014

Introduction to Hadoop and MapReduce

Evaluation of Apache Hadoop for parallel data analysis with ROOT

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

MapReduce, Hadoop and Spark. Bompotas Agorakis

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Global Journal of Engineering Science and Research Management

50 Must Read Hadoop Interview Questions & Answers

Commands Manual. Table of contents

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

CLOUD-SCALE FILE SYSTEMS

A Study of Comparatively Analysis for HDFS and Google File System towards to Handle Big Data

Improving the MapReduce Big Data Processing Framework

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Big Data and Scripting map reduce in Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Introduction to Data Management CSE 344

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

A Glimpse of the Hadoop Echosystem

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Introduction to Map Reduce

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

CS370 Operating Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

Transcription:

TP1-2: Analyzing Hadoop Logs Shadi Ibrahim January 26th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web search applications on a large number of machines. Hadoop is a java open source implementation of MapReduce sponsored by Yahoo! The Hadoop project is a collection of various subprojects for reliable, scalable distributed computing. The two fundamental subprojects are the Hadoop MapReduce framework and the HDFS. HDFS is a distributed file system that provides high throughput access to application data. It is inspired by the GFS. HDFS has master/slave architecture. The master server, called NameNode, splits files into blocks and distributes them across the cluster with replications for fault tolerance. It holds all metadata information about stored files. The HDFS slaves, the actual store of the data blocks called DataNodes, serve read/write requests from clients and propagate replication tasks as directed by the NameNode. The Hadoop MapReduce is a software framework for distributed processing of large data sets on compute clusters. It runs on the top of the HDFS. Thus data processing is collocated with data storage. It also has master/slave architecture. The master, called Job Tracker (JT), is responsible of : (a) Querying the NameNode for the block locations, (b) considering the information retrieved by the NameNode, JT schedule the tasks on the slaves, called Task Trackers (TT), and (c) monitoring the success and failures of the tasks. The goal of this TP is to study the operation of Hadoop platform through exploring the logs of the execution of MapReduce applications. We will examine these logs by scanning their content, by looking at Hadoop s web GUI, and using another Log file visualization tool. 1

Exercise 1: Accessing Hadoop s Logs Hadoop keeps track of several logs of the execution of your programs. They are located in the logs sub-directory in the Hadoop directory. There are two important files for each job: the configuration file and the log file. job_201410292315_0002_1414658868472_useranme_word+count job_201410292315_0002_conf.xml job_201410292315_0002_conf.xml file includes all configurations of Hadoop platform when running the application. job_201410292315_0002_1414658868472_useranme_word+count contains the start and end time of all the tasks that run during the execution of our Hadoop program. It contains several different types of lines: Lines starting with Job, listing information about the job (priority, submit time, configuration, number of map tasks, number of reduce tasks, etc. Job JOBID = job_201410292315_0002 JOBNAME = word count USER = xxxxxx SUBMIT_TIME = 1414658868472 JOBCONF = hdfs://localhost:8020/ Users/$Hadoop_Home/hdfs/tmp/mapred/staging/xxxxx/staging/job_201410292315_0002/job.xml VIEW_JOB = * MODIFY_JOB = * JOB_QUEUE = default WORKFLOW_ID = WORKFLOW_NAME = WORKFLOW_NODE_NAME = WORKFLOW_ADJACENCIES = WORKFLOW_TAGS =. Job JOBID = job_201410292315_0002 JOB_PRIORITY = NORMAL. Job JOBID = job_201410292315_0002 LAUNCH_TIME = 1414658869065 TOTAL_MAPS = 115 TOTAL_REDUCES = 1 JOB_STATUS = PREP Lines starting with Task referring to the creation or completion of Map or Reduce tasks, indicating which host they start on, and which split they work on (i.e., replica). On completion, all the counters associated with the task are listed. Task TASKID = task_201410292315_0002_m_000000 TASK_TYPE =MAP START_TIME = 1414658870533 SPLITS = /default-rack/shadis-mbp. Lines starting with MapAttempt, reporting mostly status update, except if they contain the keywords SUCCESS and/or FI N ISH_TI ME, indicating that the task has completed. The final time when the task finished is included in this line. 2

Figure 1 Lines starting with ReduceAttempt, similar to the MapAttempt tasks, report on the intermediary status of the tasks, and when the keyword SUCCESS is included, the finish time of the sort and shuffle phases will also be included. Figure 2 Question 1.1 Run Wordcount application using the data set 75MB and the default block size (64MB). Check the configuration file and see all the properties which were discussed in TP2. Check the number of map tasks, reduce tasks, split size and location in your log files. Question 1.2 Use the Data set 75MB and run both benchmarks using different block sizes (i.e., 5, 10, and 20MB). Check the configuration and the log files, what can you observe? Exercise 2: Accessing Hadoop s Web GUI Same logs and conf files are also available from the hadoop Web GUI: http://localhost:50030/ 3

Figure 3: Hadoop Main page Figure 4: All logs History 4

Figure 5: The log file of one application : Overview Figure 6: The log file of one application: Details 5

Figure 7: The log file of one application: Tasks executions 1 Figure 8: The log file of one application: Tasks executions 2 Figure 9: The log file of one application: Map task execution 1 Question 2.1 Check the configuration and the log files of your previous jobs through Hadoop GUI, what can you observe? 6

Exercise 3: Accessing VisHadoop GUI We have developed a new visualization tool to show the progress of different tasks during the execution of MapReduce application. Check this site http://hadoop-log.irisa.fr. Figure 10: An Example of VisHadoop Question 3.1 Check the pattern of the three jobs (waves), Can you explain it? Question 3.2 Check the two log files in the (resources), and identify values of the following properties, using any of the three log analysis methods: Total Map tasks Total Reduce tasks Total local map tasks Replication factor Total killed tasks How to distinguish speculated task from normal one? 7