Big Data: Architectures and Data Analytics

Similar documents
Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS

MAPREDUCE - PARTITIONER

The core source code of the edge detection of the Otsu-Canny operator in the Hadoop

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Map-Reduce Applications: Counting, Graph Shortest Paths

Topics covered in this lecture

Java in MapReduce. Scope

Steps: First install hadoop (if not installed yet) by,

15/03/2018. Combiner

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Map-Reduce Applications: Counting, Graph Shortest Paths

MapReduce Simplified Data Processing on Large Clusters

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

Parallel Computing. Prof. Marco Bertini

MapReduce. Arend Hintze

Large-scale Information Processing

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Hadoop Integration Guide

Hadoop Integration Guide

Example of a use case

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara

An Introduction to Apache Spark

Guidelines For Hadoop and Spark Cluster Usage

Hadoop 3.X more examples

Recommended Literature

Chapter 3. Distributed Algorithms based on MapReduce

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

15/03/2018. Counters

Big Data Analysis using Hadoop Lecture 3

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Hadoop 2.X on a cluster environment

Dept. Of Computer Science, Colorado State University

CSE6331: Cloud Computing

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

Big Data on Big Maps. Displaying Vast Amounts of Geospatial Data

Introduction to HDFS and MapReduce

COSC 6339 Big Data Analytics. NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment. Edgar Gabriel Spring 2017.

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

MapReduce Types and Formats

FINAL PROJECT REPORT

DATA ANALYTICS PROGRAMMING ON SPARK: PYTHON IN THE CLOUD

Clustering Documents. Case Study 2: Document Retrieval

Today s topics. FAQs. Modify the way data is loaded on disk. Methods of the InputFormat abstract. Input and Output Patterns --Continued

Hadoop 2.8 Configuration and First Examples

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern

LAMPIRAN. public static void runaniteration (String datafile, String clusterfile) {

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Using Big Data for the analysis of historic context information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

EE657 Spring 2012 HW#4 Zhou Zhao

Processing Distributed Data Using MapReduce, Part I

itpass4sure Helps you pass the actual test with valid and latest training material.

Hadoop 3 Configuration and First Examples

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

MapReduce-style data processing

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Big Data: Tremendous challenges, great solutions

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop.

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Ultimate Hadoop Developer Training

Map-Reduce in Various Programming Languages

Map Reduce & Hadoop Recommended Text:

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

TI2736-B Big Data Processing. Claudia Hauff

High Performance Computing on MapReduce Programming Framework

Big Data Overview. Nenad Jukic Loyola University Chicago. Abhishek Sharma Awishkar, Inc. & Loyola University Chicago

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Big Data landscape Lecture #2

Cloud Computing. Up until now

TI2736-B Big Data Processing. Claudia Hauff

Harnessing the Power of YARN with Apache Twill

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Big Data Analytics: Insights and Innovations

1/30/2019 Week 2- B Sangmi Lee Pallickara

1.00 Lecture 4. Promotion

Universidade de Santiago de Compostela. Perldoop v0.6.3 User Manual

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io

13/05/2018. Spark MLlib provides

// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName("Spark Exam 2 - Exercise

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Recommended Literature

Big Data and Scripting map reduce in Hadoop

CSE 201 JAVA PROGRAMMING I. Copyright 2016 by Smart Coding School

Birkbeck (University of London) Software and Programming 1 In-class Test Mar 2018

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Transcription:

Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer for each question. 1. (2 points) Consider the HDFS folder logsfolder, which contains three files: log.txt, errors.txt, and warning.txt. The size of log.txt is 512MB, the size of errors.txt is 518MB, while the size of warning.txt is 250MB. Suppose that you are using a Hadoop cluster that can potentially run up to 9 mappers in parallel and suppose to execute a Map-only MapReduce-based program that selects only the lines of the input files in logsfolder containing the words ERROR or WARNING. The HDFS block size is 256MB. How many mappers are instantiated by Hadoop when you execute the application by specifying the folder logsfolder as input? a) 9 b) 6 c) 5 d) 3 2. (2 points) Consider the HDFS folder inputdata containing the following two files: Filename Size Content of the files HDFS Blocks Wind1.txt 18 bytes 11.00 19.00 40.00 Wind2.txt 18 bytes 40.00 70.00 90.00 Block ID Content of the block B1 11.00 19.00 B2 40.00 B3 40.00 70.00 B4 90.00 Suppose that you are using a Hadoop cluster that can potentially run up to 10 mappers in parallel and suppose that the HDFS block size is 12 bytes.

Suppose that the following MapReduce program is executed by providing the folder inputdata as input folder and the folder results as output folder. /* Driver */ import ; public class DriverBigData extends Configured implements Tool { @Override public int run(string[] args) throws Exception { Configuration conf = this.getconf(); Job job = Job.getInstance(conf); job.setjobname("2018/06/26 - Theory"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setjarbyclass(driverbigdata.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); job.setmapperclass(mapperbigdata.class); job.setmapoutputkeyclass(doublewritable.class); job.setmapoutputvalueclass(nullwritable.class); job.setnumreducetasks(0); if (job.waitforcompletion(true) == true) return 0; else return 1; public static void main(string args[]) throws Exception { int res = ToolRunner.run(new Configuration(), new DriverBigData(), args); System.exit(res); /* Mapper */ import ; class MapperBigData extends Mapper<LongWritable, Text, DoubleWritable, NullWritable> { double sumwind; protected void setup(context context) { sumwind = 0; protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { double wind = Double.parseDouble(value.toString()); sumwind = sumwind + wind; protected void cleanup(context context) throws IOException, InterruptedException { // emit the content of sumwind if it is greater than 45 if (sumwind>45) context.write(new DoubleWritable(sumWind), NullWritable.get());

What is the output generated by the execution of the application reported above? a) The output folder contains four part files s 110 90 Three empty files b) The output folder contains four part files 30 40 110 90 c) The output folder contains four part files 110 90 Two empty files d) The output folder contains four part files 240 Three empty files

Part II PoliClimate is an Intergovernmental Agency focused on climate change. They have millions of sensors located around the world for monitoring precipitations and wind speed. To identify critical behaviors, PoliClimate computes a set of statistics about the climate conditions based on the following input file. ClimateData.txt o ClimateData.txt is a text file containing the historical information about precipitations and wind speed gathered by the sensors of PoliClimate. For each sensor, a new line is inserted in ClimateData.txt every 10 minutes. The number of managed sensors is more than 100,000 and ClimateData.txt contains the historical data about the last 5 years (i.e., ClimateData.txt contains more than 52 billion lines). o Each line of ClimateData.txt has the following format SID,Timestamp,Precipitation_mm,WindSpeed where SID is the sensor identifier, Timestamp is the timestamp at which precipitation and wind speed are collected, Precipitation_mm is the measured precipitation (in mm) and WindSpeed is the measured wind speed (in km/h). For example, the following line Sens2,2018/03/01,15:40,0.10,8.5 means that sensor Sens2, at 15:40 of March 1, 2018, measured a precipitation equal to 0.10 mm and a wind speed equal to 8.5 km/h. Exercise 1 MapReduce and Hadoop (8 points) The managers of PoliClimate are interested in selecting the sensors that were frequently associated with low wind speed in the morning (i.e., from 8:00 to 11:59) in April 2018. Design a single application, based on MapReduce and Hadoop, and write the corresponding Java code, to address the following point: A. Low wind speed in the morning of April 2018. Considering only the measurements (i.e., lines of ClimateData.txt) associated with the time slot from 8:00 to 11:59 of April 2018, the application must select the identifiers (SID) of the sensors that were associated at least 1000 times with a WindSpeed value less than 1.0 km/h. Store the result of the analysis in a HDFS folder. The output file contains one line for each of the selected sensors. Specifically, each line of the output file contains the SID of one of the selected sensors. The name of the output folder is one argument of the application. The other argument is the path of the input file ClimateData.txt. Note that ClimateData.txt contains the

measurements collected in the last 5 years but the analysis is focused only on April 2018 and the time slot from 8:00 to 11:59. Fill out the provided template for the Driver of this exercise. Use your sheets of paper for the other parts (Mapper and Reducer). Exercise 2 Spark and RDDs (19 points) The managers of PoliClimate are interested in performing some analyses about the average value of precipitations and wind speed for each sensor during the last two months (from April 2018 to May 2018) to identify critical hours. The managers of PoliClimate are also interested in identifying sensors measuring an unbalanced daily wind speed during the last two months (from April 2018 to May 2018). Specifically, given a date and a sensor, the sensor measured an unbalanced daily wind speed during that date if that sensor, in that specific date, is characterized by at least 5 hours for which the minimum value of WindSpeed is less than 1 km/h and at least 5 hours for which the minimum value of WindSpeed is greater than 40 km/h. The managers of PoliClimate asked you to develop one single application to address all the analyses they are interested in. The application has five arguments: precipitation threshold (PrecThr), wind speed threshold (WindThr), the input file ClimateData.txt, and two output folders (associated with the outputs of the following points A and B, respectively). Specifically, design a single application, based on Spark, and write the corresponding Java code, to address the following points: A. (8 points) Critical (SID, hour) pairs in April - May 2018. Considering only the measurements collected in the last two months (from April 2018 to May 2018), the application must select only the pairs (SID, hour) for which the average of Precipitation_mm for that pair is less than the specified threshold PrecThr and the average of WindSpeed for that pair is less than the specified threshold WindThr (PrecThr and WindThr are two arguments of the application). The application stores in the first HDFS output folder the information SID-hour for the selected pairs (SID, hour), one pair per line. B. (11 points) Pairs (SID,date) characterized by a daily unbalanced wind speed during the last two months. Considering only the measurements collected in the last two months (from April 2018 to May 2018), the application must select the pairs (SID, date) that are characterized by an unbalanced daily wind speed. Specifically, given a date and a sensor, the sensor measured an unbalanced daily wind speed during that date if that sensor, in that specific date, is characterized by at least 5 hours for which the minimum value of WindSpeed is less than 1 km/h and at least 5 hours for which the minimum value of WindSpeed is greater than 40 km/h. The application stores in the second HDFS output folder the information SID-date for the selected pairs (SID, date), one pair per line.

Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name Use the following template for the Driver of Exercise 1 Fill in the missing parts. You can strikethrough the second job if you do not need it. import. /* Driver class. */ public class DriverBigData extends Configured implements Tool { public int run(string[] args) throws Exception { Path inputpath = new Path(args[0]); Path outputdir = new Path(args[1]); Configuration conf = this.getconf(); // First job Job job1 = Job.getInstance(conf); job1.setjobname("exercise 1 - Job 1"); // Job 1 - Input path FileInputFormat.addInputPath(job, ); // Job 1 - Output path FileOutputFormat.setOutputPath(job, ); // Job 1 - Driver class job1.setjarbyclass(driverbigdata.class); // Job1 - Input format job1.setinputformatclass( ); // Job1 - Output format job1.setoutputformatclass( ); // Job 1 - Mapper class job1.setmapperclass(mapper1bigdata.class); // Job 1 Mapper: Output key and output value: data types/classes job1.setmapoutputkeyclass( ); job1.setmapoutputvalueclass( ); // Job 1 - Reducer class job.setreducerclass(reducer1bigdata.class); // Job 1 Reducer: Output key and output value: data types/classes job1.setoutputkeyclass( ); job1.setoutputvalueclass( ); // Job 1 - Number of reducers job1.setnumreducetasks( 0[ _ ] or 1[ _ ] or >=1[ _ ] ); /* Select only one of the three options */

// Execute the first job and wait for completion if (job1.waitforcompletion(true)==true) { // Second job Job job2 = Job.getInstance(conf); job2.setjobname("exercise 1 - Job 2"); // Set path of the input folder of the second job FileInputFormat.addInputPath(job2, ); else // Set path of the output folder for the second job FileOutputFormat.setOutputPath(job2, ); // Class of the Driver for this job job2.setjarbyclass(driverbigdata.class); // Set input format job2.setinputformatclass( ); // Set output format job2.setoutputformatclass( ); // Set map class job2.setmapperclass(mapper2bigdata.class); // Set map output key and value classes job2.setmapoutputkeyclass( ); job2.setmapoutputvalueclass( ); // Set reduce class job2.setreducerclass(reducer2bigdata.class); // Set reduce output key and value classes job2.setoutputkeyclass( ); job2.setoutputvalueclass( ); // Set number of reducers of the second job job2.setnumreducetasks( 0[ _ ] or 1[ _ ] or >=1[ _ ] ); /*Select only one of the three options*/ // Execute the job and wait for completion if (job2.waitforcompletion(true)==true) exitcode=0; else exitcode=1; exitcode=1; return exitcode; /* Main of the driver */ public static void main(string args[]) throws Exception { int res = ToolRunner.run(new Configuration(), new DriverBigData(), args); System.exit(res);