// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName("Spark Exam 2 - Exercise

Similar documents
2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages:

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

本文转载自 :

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Beyond MapReduce: Apache Spark Antonino Virgillito

Spark MLlib provides a (limited) set of clustering algorithms

Date & Time Handling In JAVA

Spark supports several storage levels

25/05/2018. Spark Streaming is a framework for large scale stream processing

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

10/05/2018. The following slides show how to

L6: Introduction to Spark Spark

Dept. Of Computer Science, Colorado State University

13/05/2018. Spark MLlib provides

Data processing in Apache Spark

Data processing in Apache Spark

2018/2/5 话费券企业客户接入文档 语雀

IT 313 Advanced Application Development Midterm Exam

Spark supports also RDDs of key-value pairs

Data processing in Apache Spark

12/04/2018. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow

Sequence structure. The computer executes java statements one after the other in the order in which they are written. Total = total +grade;

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Control Structure: Loop

Creating an Immutable Class. Based on slides by Prof. Burton Ma

High-Speed In-Memory Analytics over Hadoop and Hive Data

PROGRAMMING FUNDAMENTALS

10/04/2019. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

This exam is open book. Each question is worth 3 points.

CSC-140 Assignment 6

1.00 Introduction to Computers and Engineering Problem Solving. Quiz 1 March 7, 2003

Java - Dates and Threads

C212 Early Evaluation Exam Mon Feb Name: Please provide brief (common sense) justifications with your answers below.

Variable initialization and assignment

CS141 Programming Assignment #8

CS 101 Spring 2007 Midterm 2 Name: ID:

Assignment-1 Final Code. Student.java

It is a constructor and is called using the new statement, for example, MyStuff m = new MyStuff();

Department of Computer Science Purdue University, West Lafayette

Spark Tutorial. General Instructions

1. An operation in which an overall value is computed incrementally, often using a loop.

Introduction to Computer Science Unit 2. Notes

COE318 Lecture Notes Week 13 (Nov 28, 2011)

McGill University School of Computer Science COMP-202A Introduction to Computing 1

CS111: PROGRAMMING LANGUAGE II

JAVA OPERATORS GENERAL

Birkbeck (University of London) Software and Programming 1 In-class Test Mar Answer ALL Questions

University of Palestine. Mid Exam Total Grade: 100

Page 1 / 3. Page 2 / 18. Page 3 / 8. Page 4 / 21. Page 5 / 15. Page 6 / 20. Page 7 / 15. Total / 100. Pledge:

TOP-DOWN PROCEDURAL PROGRAMMING

Birkbeck (University of London) Software and Programming 1 In-class Test Mar 2018

(A) 99 ** (B) 100 (C) 101 (D) 100 initial integers plus any additional integers required during program execution

Exam 2. Programming I (CPCS 202) Instructor: M. G. Abbas Malik. Total Marks: 40 Obtained Marks:

1 Short Answer (10 Points Each)

Ryerson University Vers HAL6891A-05 School of Computer Science CPS109 Midterm Test Fall 05 page 1 of 6

Hadoop 3.X more examples

Practice Questions for Final Exam: Advanced Java Concepts + Additional Questions from Earlier Parts of the Course

Big Data: Architectures and Data Analytics

CSC 231 DYNAMIC PROGRAMMING HOMEWORK Find the optimal order, and its optimal cost, for evaluating the products A 1 A 2 A 3 A 4

Handout 7. Defining Classes part 1. Instance variables and instance methods.

Apache Spark: Hands-on Session A.A. 2016/17

CS 1063 Introduction to Computer Programming Midterm Exam 2 Section 1 Sample Exam

Apache Spark: Hands-on Session A.A. 2017/18

CS180. Exam 1 Review

Prelim 1. Solution. CS 2110, 14 March 2017, 7:30 PM Total Question Name Short answer

Introduction to Objects. James Brucker

Listen EECS /40

Selected Questions from by Nageshwara Rao

Wentworth Institute of Technology. Engineering & Technology WIT COMP1000. Simple Control Flow: if-else statements

CS Week 14. Jim Williams, PhD

Prof. Navrati Saxena TA: Rochak Sachan

CSE 142 Sample Midterm Exam #2

CSCI 135 Exam #0 Fundamentals of Computer Science I Fall 2012

Task 1: Print a series of random integers between 0 and 99 until the value 77 is printed. Use the Random class to generate random numbers.

CSCE 145 Exam 2 Review No Answers. This exam totals to 100 points. Follow the instructions. Good luck!

Copyright Warning & Restrictions

Introduction to Computer Science Unit 2. Notes

LECTURE NOTES ON PROGRAMMING FUNDAMENTAL USING C++ LANGUAGE

1.00 Introduction to Computers and Engineering Problem Solving Quiz 1 March 4, 2005

1 Short Answer (15 Points Each)

Name Return type Argument list. Then the new method is said to override the old one. So, what is the objective of subclass?

Building Java Programs A Back to Basics Approach 4th Edition Reges TEST BANK Full download at:

Fall CS 101: Test 2 Name UVA ID. Grading. Page 1 / 4. Page3 / 20. Page 4 / 13. Page 5 / 10. Page 6 / 26. Page 7 / 17.

1. Which of the following is the correct expression of character 4? a. 4 b. "4" c. '\0004' d. '4'

TOPIC 8 MORE ON WHILE LOOPS

CSEN202: Introduction to Computer Science Spring Semester 2017 Midterm Exam

Lecture 6. Drinking. Nested if. Nested if s reprise. The boolean data type. More complex selection statements: switch. Examples.

CS 110 Practice Final Exam originally from Winter, Instructions: closed books, closed notes, open minds, 3 hour time limit.

CIS 1068 Design and Abstraction Spring 2017 Midterm 1a

H212 Introduction to Software Systems Honors

CSE 143 Lecture 11. Decimal Numbers

Name CIS 201 Midterm II: Chapters 1-8

Compiler Design Spring 2017

Midterm Examination (MTA)

CS 101 Fall 2006 Midterm 3 Name: ID:

AP CS Unit 3: Control Structures Notes

Transcription:

import org.apache.spark.api.java.*; import org.apache.spark.sparkconf; public class SparkDriver { public static void main(string[] args) { String inputpathpm10readings; String outputpathmonthlystatistics; String outputpathcriticalperiods; double PM10threshold; inputpathpm10readings=args[0]; PM10threshold=Double.parseDouble(args[1]); outputpathmonthlystatistics=args[2]; outputpathcriticalperiods=args[3]; #2"); // Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName("Spark Exam 2 - Exercise // Create a Spark Context object JavaSparkContext sc = new JavaSparkContext(conf); // ***************************************** // Exercise 2 - Part A // ***************************************** // Compute, for each pair (sensorid, month), // the number of days with a critical PM10 value and store the result in an HDFS folder. // Each line of the output file contains a // pair (month, number of critical days for that month). // Read the content of PM10Readings.txt JavaRDD<String> PM10readingsRDD = sc.textfile(inputpathpm10readings); // Select the critical readings/lines JavaRDD<String> criticalpm10readingsrdd = PM10readingsRDD.filter(new CriticalPM10(PM10threshold)).cache(); // Count the number of critical days (lines) for each pair (sensorid,month) // Define a JavaPairRDD with sensorid_month as key and 1 as value sensormonthdatescriticalpm10=criticalpm10readingsrdd.maptopair(new SensorMonthOne()); // Use reduce by key to compute the number of critical days for each pair (sensorid,month) sensormonthdatesnumdays=sensormonthdatescriticalpm10.reducebykey(new Sum()); sensormonthdatesnumdays.saveastextfile(outputpathmonthlystatistics);

// ***************************************** // Exercise 2 - Part B // ***************************************** // Select, for each sensor, the dates associated with critical PM10 values // The critical readings are already available in criticalpm10readingsrdd. // Each line associated with a critical PM10 value can be part of // three critical time periods (composed of three consecutive dates). // Suppose datecurrent is the date of the current line. // The three potential critical time periods containing datecurrent are: // - datecurrent, datecurrent+1, datecurrent+2 // - datecurrent-1, datecurrent, datecurrent+1 // - datecurrent-2, datecurrent-1, datecurrent // For each line emits three pairs: // - datecurrent_sensorid, 1, // - datecurrent-1_sensorid, 1 // - datecurrent-2_sensorid, 1 // The sensorid is needed because the same date can be associated with many sensors sensordateinitialsequencerdd=criticalpm10readingsrdd.flatmaptopair(new SensorDateInitialSequenceOne()); // Count the number of ones for each key. If the sum is 3 it means // that the dates key, key+1, and key+2, for the sensor sensorid, are all associated // with a critical PM10 value. Hence, key, key+2 is a critical time period for sensor sensordid DateInitialSequenceCountRDD=sensorDateInitialSequenceRDD.reduceByKey(new Sum()); // Select the critical time periods (i.e., the ones with count = 3 DateInitialCriticalSequenceCountRDD=DateInitialSequenceCountRDD.filter(new ThreeValues()); // Get only the dates and store them JavaRDD<String> DateInitialCriticalSequenceRDD=DateInitialCriticalSequenceCountRDD.keys(); DateInitialCriticalSequenceRDD.saveAsTextFile(outputPathCriticalPeriods); // Close the Spark context sc.close();

import org.apache.spark.api.java.function.function; public class CriticalPM10 implements Function<String, Boolean> { private double PM10threshold; public CriticalPM10(double PM10th) { PM10threshold=PM10th; public Boolean call(string line) { // fields[2] = PM10 value String[] fields=line.split(","); if (Double.parseDouble(fields[2])>PM10threshold) return true; else return false;

import org.apache.spark.api.java.function.pairfunction; import scala.tuple2; public class SensorMonthOne implements PairFunction<String, String, Integer> { @Override public Tuple2<String, Integer> call(string PM10reading) { // sensor#1,1-10-2014,15.3 String sensorid; String month; String year; // fields[0] = sensorid // fields[1] = date String[] fields=pm10reading.split(","); sensorid=fields[0]; month=fields[1].split("-")[0]; year=fields[1].split("-")[2]; return new Tuple2<String, Integer>(new String(sensorid+"_"+month+"-"+year),new Integer(1));

import org.apache.spark.api.java.function.function2; public class Sum implements Function2<Integer, Integer, Integer> { public Integer call(integer num1, Integer num2) { return num1+num2;

import java.util.arraylist; import org.apache.spark.api.java.function.pairflatmapfunction; import scala.tuple2; public class SensorDateInitialSequenceOne implements PairFlatMapFunction<String, String, Integer> { public Iterable<Tuple2<String, Integer>> call(string PM10reading) { ArrayList<Tuple2<String, Integer>> initialdates=new ArrayList<Tuple2<String, Integer>>(); // sensor#1,january-10-2014,15.3 String[] fields = PM10reading.split(","); // fields[0] = sensordid // fields[1] = date // Return // - date,1 // - date-1,1 // - date-2,1 String sensorid=fields[0]; String date=fields[1]; initialdates.add(new Tuple2<String, Integer>(date+"_"+sensorid,new Integer(1))); initialdates.add(new Tuple2<String, Integer>(DateTool.dateOffset(date, -1)+"_"+sensorid,new Integer(1))); initialdates.add(new Tuple2<String, Integer>(DateTool.dateOffset(date, -2)+"_"+sensorid,new Integer(1))); return initialdates;

import org.apache.spark.api.java.function.function; import scala.tuple2; public class ThreeValues implements Function<Tuple2<String, Integer>, Boolean> { public Boolean call(tuple2<string, Integer> InitiaDateCount) { if (InitiaDateCount._2()==3) return true; else return false;

import java.text.parseexception; import java.text.simpledateformat; import java.util.calendar; import java.util.date; public class DateTool { public static String dateoffset(string date, int deltadays) { String newdate; Date d=new Date(); SimpleDateFormat format = new SimpleDateFormat("MM-dd-yyyy"); try { d = format.parse(date); catch (ParseException e) { e.printstacktrace(); System.out.println(d); Calendar cal = Calendar.getInstance(); cal.settime(d); cal.add(calendar.date, deltadays); newdate=format.format(cal.gettime()); return newdate;