FINAL PROJECT REPORT

Size: px

Start display at page:

Download "FINAL PROJECT REPORT"

Scott White
5 years ago
Views:

1 FINAL PROJECT REPORT NYC TAXI DATA ANALYSIS Reshmi Padavala Project Summary: For my final project, I decided to showcase my big data analysis skills by working on a large amount of dataset. On which is very difficult to identify patterns and visualize using any normal tools other than the powerful concepts like "MapReduce" that I learned during the current course tenure. After doing research and weighing my options, finally, I set my mind to work on the NewYork City(NYC) taxi data to produce some in-depth analysis on taxi ride patterns and behaviors. These datasets were made accessible by the NYC Taxi and Limousine Commission (TLC). The NYC taxi dataset contains trip records which include fields like " pick-up and drop-off, dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts"[1]. The dataset is huge than I expected containing millions of records so I truncated the scope of the project to display some core analysis patterns. In this paper, I performed analysis to identify the trends in the taxi ride in NYC from different boroughs during the period of Dataset: NYC taxi dataset: NYC geostats:

2 Analysis: Analysis 1: Data Cleansing The dataset I chose is huge and has raw data, I had to cleanse the data for the date fields and null records and filter only the records that I needed, I performed Simple Filtering Pattern. Output: Analysis 2: Statistical Analysis This analysis is performed to identify the statistical data such as total rides, revenue, maximum toll charges, maximum tip amount of every day in So, I considered the date as the key and using a Custom Writable object I have generated the values. Since, the functions performed on the reducer is aggregation, the performance is optimized by using reducer as a combiner.

3 Output: Analysis 3: Peak Hour Analysis The analysis performed is to identify the peak hours in a day and the amount earned during those peak hours. The motivation behind this analysis is that rides in a day are not consistent every

4 hour. Most of the rides would be taken during the office hours either in the morning or at the evening. Using {Date, Hours as a Custom Writable object I performed Secondary Sorting generated {total rides, total amount as the value. This output is chained to another Secondary Sorting where {Date, total rides is taken as key, {Hours, total amount as the value and thus generated the peak hours in a day.

Due to high demand of riders during weekends, Saturday and

5 Output: Analysis 4: Day Based Surcharge Analysis The surcharges applied on a ride depends on the day of the week. Due to high demand of riders during weekends, Saturday and Sunday are expected to have more applicable surcharges than the rest of the week.

So, after partitioning, the total surcharge for the entire dataset on that particular day of the week has been calculated from the respective partitions. And the results are obvious.

6 To perform this analysis, I made use of Partitioning Pattern and divided the data into 7 partitions one for each day of the week respectively. Since, the data is partitioned, the analysis can be performed on any particular partition of choice without having the load of running the MR on the entire data. So, after partitioning, the total surcharge for the entire dataset on that particular day of the week has been calculated from the respective partitions. And the results are obvious. Output: Analysis 5: Boroughs with Most Number of Riders Every ride has an inter-connection with the neighborhood of pickup. Passengers from some locations prefer more to commute on taxi. This analysis is used to find out how many passengers from each borough have commuted on NYC taxi.

The output is grouped by boroughs to find the total passengers.

7 To perform this analysis, I had to do an inner join between taxi rides data and the location data. Now, the new generated output has the pickup boroughs. The output is grouped by boroughs to find the total passengers. Output: Analysis 6: Identifying Distinct Neighborhoods Each borough has multiple zones. I wanted to know how many unique zones or neighborhoods exist in total. I used Distinct Pattern to filter out the unique zones from the NYC taxi dataset.

8 Analysis 7: Top 10 Pickup Zones This analysis is to identify the most frequent pick up zones. For this I have first check how pickups from different zones are distributed across New York City and from this output based on the total rides, I have emitted the top ten zones using Top Ten Pattern.

Since, I have an idea about what I am looking for, I made use of a Bloom Filter to filter out the remaining zones

9 Output: Analysis 8: Fare Analysis on Top 5 Zones Based on the result from the above, for my analysis, I wanted to concentrate on the top 5 zones which has the most riders from. Since, I have an idea about what I am looking for, I made use of a Bloom Filter to filter out the remaining zones which are not in the top 5 list. And on the filtered data, I was looking for the median fare charges from these zones. And for optimization, I made use of a combiner.

10 Output: Analysis 9: Calculate the longest rides I made use of Pig Latin script to perform this analysis. I have LOADED the taxi data and zone data into variables and performed JOIN using the pickup ID to get the pickup zone name then joined the resulting data with zone data on drop off ID to get the drop off zone. Then ORDERED it based on the total distance covered in the ride and LIMITED the output to top 20 longest rides.

Analysis 10: Calculate total rides from each borough I

I have GROUPED the loaded data based on the pickup

On each group, the total number of rides is calculated.

Bronx Queens Unknown Brooklyn Manhattan Staten Island

borough every day To perform this analysis I made use of

12 Analysis 10: Calculate total rides from each borough I made use of Pig Latin script to perform this analysis. I have GROUPED the loaded data based on the pickup boroughs. On each group, the total number of rides is calculated. Total rides from each borough 31% 0%4% 30% 35% 0% EWR Bronx Queens Unknown Brooklyn Manhattan Staten Island Output: Analysis 11: Find the number of riders from each borough every day To perform this analysis I made use of Hive. After the data is loaded, I have grouped the data based on the pickup data and pickup borough and generated the number of riders.

14 Programming Code: Analysis 1: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

15 package analysis3; import java.io.ioexception; import java.text.decimalformat; import java.text.parseexception; import java.text.simpledateformat; import java.util.calendar; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; /** * reshmip public class Analysis3 { /** args the command line arguments

16 public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "filtering data"); job.setjarbyclass(analysis3.class); job.setmapperclass(filteringmapper.class); job.setmapoutputvalueclass(customwritable.class); job.setmapoutputkeyclass(nullwritable.class); //job.setoutputkeyclass(nullwritable.class); //job.setoutputvalueclass(customwritable.class); job.setnumreducetasks(1); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); public static class FilteringMapper extends Mapper<Object,Text,NullWritable,CustomWritable>{ private CustomWritable customwritable = new CustomWritable(); private final static SimpleDateFormat frmt = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException {

17 String line = value.tostring(); String[] line_values = line.split(","); Calendar cal = Calendar.getInstance(); DecimalFormat numberformat = new DecimalFormat("#.0000"); try{ if(line_values.length == 21){ if(!line_values[1].equals("lpep_pickup_datetime") && line_values[1]!=(null) &&!line_values[1].equals("null")){ String[] pickupdatestring = line_values[1].split(" "); String pickupdate = pickupdatestring[0]; String pickuptime = pickupdatestring[1]; cal.settime(frmt.parse(pickupdate)); String pick_date = cal.gettime().tostring(); // String[] new_date = pick_date.split(" "); // String new_pickdate = new StringBuilder().append(new_date[0]) //.append(" ").append(new_date[1]).append(" "). // append(new_date[2]).append(" ").append(new_date[5]).tostring(); customwritable.setride_pickup_date(pickupdate); customwritable.setride_pickup_time(pickuptime); // String[] dropoffdatestring = line_values[2].split(" "); // String dropoffdate = dropoffdatestring[0]; // String dropofftime = dropoffdatestring[1]; // customwritable.setride_dropoff_date(dropoffdate); // customwritable.setride_dropoff_time(dropofftime);

18 // customwritable.setratecodeid(integer.parseint(line_values[4])); String pickup_longitude="",drop_longitude=""; String pickup_latitude="",drop_latitiude=""; if(line_values[5].length()>8 && line_values[6].length()>7 && line_values[7].length()>8 && line_values[8].length()>7){ pickup_longitude = line_values[5].substring(0, 8); pickup_latitude = line_values[6].substring(0,7); drop_longitude = line_values[7].substring(0,8); drop_latitiude = line_values[8].substring(0, 7); //System.err.println("coordinates:"+longitude); customwritable.setpick_longitude((pickup_longitude)); customwritable.setpickup_latitude((pickup_latitude)); // customwritable.setdropoff_longitude((drop_longitude)); // customwritable.setdropoff_latitude((drop_latitiude)); customwritable.setpassengers(integer.parseint(line_values[9])); customwritable.settrip_distance(double.parsedouble(line_values[10])); customwritable.setfare_amount(double.parsedouble(line_values[11])); customwritable.setexta(double.parsedouble(line_values[12])); customwritable.setmta_tax(double.parsedouble(line_values[13])); customwritable.settip_amount(double.parsedouble(line_values[14])); customwritable.settotal_amount(double.parsedouble(line_values[18])); customwritable.setpayment_type(integer.parseint(line_values[19])); context.write(nullwritable.get(),customwritable);

19 catch(nullpointerexception ex){ ex.getmessage(); catch (ParseException ex) { Logger.getLogger(Analysis3.class.getName()).log(Level.SEVERE, null, ex); Custom Writable.java /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis3; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils; /**

20 * reshmip public class CustomWritable implements Writable{ private String ride_pickup_date; private String ride_pickup_time; private String ride_dropoff_date; private String ride_dropoff_time; private int ratecodeid; private String pick_longitude; private String pickup_latitude; private String dropoff_longitude; private String dropoff_latitude; private int passengers; private Double trip_distance; private Double fare_amount; private Double exta; private Double mta_tax; private Double tip_amount; private Double total_amount; private int payment_type; public String getride_pickup_date() { return ride_pickup_date; public void setride_pickup_date(string ride_pickup_date) {

21 this.ride_pickup_date = ride_pickup_date; public String getride_pickup_time() { return ride_pickup_time; public void setride_pickup_time(string ride_pickup_time) { this.ride_pickup_time = ride_pickup_time; public String getride_dropoff_date() { return ride_dropoff_date; public void setride_dropoff_date(string ride_dropoff_date) { this.ride_dropoff_date = ride_dropoff_date; public String getride_dropoff_time() { return ride_dropoff_time; public void setride_dropoff_time(string ride_dropoff_time) { this.ride_dropoff_time = ride_dropoff_time;

22 public int getratecodeid() { return ratecodeid; public void setratecodeid(int ratecodeid) { this.ratecodeid = ratecodeid; public String getpick_longitude() { return pick_longitude; public void setpick_longitude(string pick_longitude) { this.pick_longitude = pick_longitude; public String getpickup_latitude() { return pickup_latitude; public void setpickup_latitude(string pickup_latitude) { this.pickup_latitude = pickup_latitude; public String getdropoff_longitude() { return dropoff_longitude;

23 public void setdropoff_longitude(string dropoff_longitude) { this.dropoff_longitude = dropoff_longitude; public String getdropoff_latitude() { return dropoff_latitude; public void setdropoff_latitude(string dropoff_latitude) { this.dropoff_latitude = dropoff_latitude; public int getpassengers() { return passengers; public void setpassengers(int passengers) { this.passengers = passengers; public Double gettrip_distance() { return trip_distance; public void settrip_distance(double trip_distance) { this.trip_distance = trip_distance;

24 public Double getfare_amount() { return fare_amount; public void setfare_amount(double fare_amount) { this.fare_amount = fare_amount; public Double getexta() { return exta; public void setexta(double exta) { this.exta = exta; public Double getmta_tax() { return mta_tax; public void setmta_tax(double mta_tax) { this.mta_tax = mta_tax; public Double gettip_amount() {

25 return tip_amount; public void settip_amount(double tip_amount) { this.tip_amount = tip_amount; public Double gettotal_amount() { return total_amount; public void settotal_amount(double total_amount) { this.total_amount = total_amount; public int getpayment_type() { return payment_type; public void setpayment_type(int payment_type) { this.payment_type = payment_type;

26 @Override public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_pickup_date); WritableUtils.writeString(d,ride_pickup_time); //WritableUtils.writeString(d, ride_dropoff_date); //WritableUtils.writeString(d,ride_dropoff_time); d.writeint(ratecodeid); d.writeint(passengers); d.writeint(payment_type); WritableUtils.writeString(d,pick_longitude); WritableUtils.writeString(d,pickup_latitude); //WritableUtils.writeString(d,dropoff_longitude); //WritableUtils.writeString(d,dropoff_latitude); d.writedouble(trip_distance); d.writedouble(fare_amount); d.writedouble(exta); d.writedouble(mta_tax); d.writedouble(tip_amount); public void readfields(datainput di) throws IOException { ride_pickup_date = WritableUtils.readString(di); ride_pickup_time = WritableUtils.readString(di); //ride_dropoff_date = WritableUtils.readString(di);

27 //ride_dropoff_time = WritableUtils.readString(di); ratecodeid = di.readint(); passengers = di.readint(); payment_type = di.readint(); pick_longitude = WritableUtils.readString(di); pickup_latitude = WritableUtils.readString(di); //dropoff_latitude = WritableUtils.readString(di); //dropoff_longitude = WritableUtils.readString(di); trip_distance = di.readdouble(); fare_amount = di.readdouble(); exta = di.readdouble(); mta_tax = di.readdouble(); tip_amount = di.readdouble(); total_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_pickup_date). append("\t").append(ride_pickup_time). //append("\t").append(ride_dropoff_date). //append("\t").append(ride_dropoff_time). append("\t").append(ratecodeid). append("\t").append(pick_longitude). append("\t").append(pickup_latitude). //append("\t").append(dropoff_longitude).

28 //append("\t").append(dropoff_latitude). append("\t").append(passengers). append("\t").append(trip_distance). append("\t").append(fare_amount). append("\t").append(exta). append("\t").append(mta_tax). append("\t").append(tip_amount). append("\t").append(total_amount). append("\t").append(payment_type). tostring()); Analysis 2: Driver class : /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job;

29 import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; /** * reshmip public class Analysis1 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "summarize trips"); job.setjarbyclass(analysis1.class); job.setmapperclass(analysis1_mapper.class); job.setmapoutputvalueclass(customwritable.class); job.setcombinerclass(analysis1_reducer.class); job.setreducerclass(analysis1_reducer.class); job.setmapoutputkeyclass(text.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(customwritable.class);

30 FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class Analysis1_Mapper extends Mapper<Object,Text,Text,CustomWritable>{ private CustomWritable customwritable = new CustomWritable();

31 //private IntWritable ride; private Text tripdate = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); double distance=0; double fare=0; String[] line_values = line.split(","); try{ if(line_values.length == 21){ if((!(line_values[1].equals("pickup_date"))) && (!(line_values[1].equals(""))) && line_values[1]!="" && line_values[1]!="na" && line_values[1]!=null){ tripdate.set(line_values[1].split(" ")[0]); customwritable.settrip_distance(double.parsedouble(line_values[10])); customwritable.settrip_fare(double.parsedouble(line_values[11])); customwritable.setmax_tip(double.parsedouble(line_values[14])); customwritable.setmax_toll(double.parsedouble(line_values[15])); context.write(tripdate,customwritable); catch (NumberFormatException ex) { ex.getmessage();

32 catch(nullpointerexception ex){ ex.getmessage(); Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class Analysis1_Reducer extends Reducer<Text, CustomWritable, Text, CustomWritable>{ private CustomWritable result = new CustomWritable();

33 @Override protected void reduce(text key, Iterable<CustomWritable> values, Context context) throws IOException, InterruptedException { double sumtrip = 0; double sumfare = 0; Double max_tip = 0.0; Double max_toll = 0.0; result.setmax_tip(0.0); result.setmax_toll(0.0); result.settrip_distance(0.0); result.settrip_fare(0.0); for(customwritable val : values){ sumtrip+= val.gettrip_distance(); sumfare+=val.gettrip_fare(); max_tip = val.getmax_tip(); max_toll = val.getmax_toll(); if(result.getmax_tip()== null max_tip.compareto(result.getmax_tip()) > 0){ result.setmax_tip(max_tip); if(result.getmax_toll()== null max_toll.compareto(result.getmax_toll()) > 0){ result.setmax_toll(max_toll);

34 result.settrip_distance(sumtrip); result.settrip_fare(sumfare); context.write(key, result); Custom Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; /** *

35 reshmip public class CustomWritable implements Writable{ private Double trip_distance; private Double trip_fare; private Double max_tip; private Double max_toll; public Double gettrip_distance() { return trip_distance; public void settrip_distance(double trip_distance) { this.trip_distance = trip_distance; public Double gettrip_fare() { return trip_fare; public void settrip_fare(double trip_fare) { this.trip_fare = trip_fare; public Double getmax_tip() { return max_tip;

36 public void setmax_tip(double max_tip) { this.max_tip = max_tip; public Double getmax_toll() { return max_toll; public void setmax_toll(double max_toll) { this.max_toll = public void write(dataoutput d) throws IOException { d.writedouble(trip_distance); d.writedouble(trip_fare); d.writedouble(max_tip); public void readfields(datainput di) throws IOException { trip_distance = di.readdouble(); trip_fare = di.readdouble(); max_tip = di.readdouble();

37 max_toll = di.readdouble(); public String tostring(){ return (new StringBuilder().append(trip_distance).append("\t").append(trip_fare).append("\t").append(max_tip).append("\t").append(max_toll).toString()); Analysis 3: Driver class: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable;

38 import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** * reshmip public class Analysis2 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf,"first secondary sorting by date and hours"); job.setjarbyclass(analysis2.class); job.setmapperclass(secondarysortmapper.class);

39 job.setmapoutputkeyclass(compositekeywritable.class); job.setmapoutputvalueclass(customvaluewritable.class); //job.setgroupingcomparatorclass(groupingcomparator.class); //job.setnumreducetasks(0); job.setreducerclass(secondarysortreducer.class); job.setoutputkeyclass(compositekeywritable.class); job.setoutputvalueclass(customvaluewritable.class); job.setinputformatclass(textinputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "second secondary sorting on date and rides"); if(complete){ job2.setjarbyclass(analysis2.class); job2.setmapperclass(peakanalysismapper.class); job2.setmapoutputkeyclass(peakanalysiswritable.class); job2.setmapoutputvalueclass(peakanalysisvaluewritable.class); job2.setreducerclass(peakanalysisreducer.class); job2.setoutputkeyclass(peakanalysiswritable.class); job2.setoutputvalueclass(peakanalysisvaluewritable.class);

40 FileInputFormat.addInputPath(job2, new Path (args[1])); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1); Composite Key Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writableutils; /**

41 * reshmip public class CompositeKeyWritable implements WritableComparable<CompositeKeyWritable>{ private String ride_date; private String ride_time; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public String getride_time() { return ride_time; public void setride_time(string ride_time) { this.ride_time = public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date);

42 WritableUtils.writeString(d, public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); ride_time = WritableUtils.readString(di); public String tostring(){ return (new public int compareto(compositekeywritable o) { int result = ride_date.compareto(o.ride_date); if(result == 0){ result = ride_time.compareto(o.ride_time); return (-1)*result;

43 Composite Value Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; /** * reshmip public class CustomValueWritable implements Writable{ private Double ride_amount; private int count_rides; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) {

44 this.ride_amount = ride_amount; public int getcount_rides() { return count_rides; public void setcount_rides(int count_rides) { this.count_rides = public void write(dataoutput d) throws IOException { d.writeint(count_rides); public void readfields(datainput di) throws IOException { count_rides = di.readint(); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_amount).append("\t").append(count_rides).toString());

45 Grouping Comparator: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writablecomparator; /** * reshmip public class GroupingComparator extends WritableComparator{ protected GroupingComparator() { super(compositekeywritable.class,true);

46 @Override public int compare(writablecomparable w1, WritableComparable w2){ CompositeKeyWritable cw1 = (CompositeKeyWritable) w1; CompositeKeyWritable cw2 = (CompositeKeyWritable) w2; return cw1.getride_date().compareto(cw2.getride_date()); Secondary Sort Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip

47 public class SecondarySortMapper extends Mapper<Object, Text, CompositeKeyWritable,CustomValueWritable>{ private DoubleWritable total_amount = new DoubleWritable(); private CompositeKeyWritable cw = new CompositeKeyWritable(); private CustomValueWritable customval = new CustomValueWritable(); public void map(object key, Text value, Context context){ String values[] = value.tostring().split("\\t"); cw.setride_date(""); cw.setride_time(""); customval.setcount_rides(0); customval.setride_amount(0.0); try{ if(values.length==13){ String date = values[0]; String hours = values[1].split(":")[0]; Double amount = Double.parseDouble(values[12]); //cw = new CompositeKeyWritable(date,hours); cw.setride_date(date); cw.setride_time(hours); //total_amount.set(amount); customval.setcount_rides(1); customval.setride_amount(amount); context.write(cw,customval); catch(ioexception InterruptedException ex){

48 System.out.println("Error Message:" +ex.getmessage()); Secondary Sort Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class SecondarySortReducer extends Reducer<CompositeKeyWritable,CustomValueWritable,CompositeKeyWritable,CustomValueWr itable>{ //Double totalamt = 0.0; CustomValueWritable customval = new CustomValueWritable();

49 private DoubleWritable total_amount = new protected void reduce(compositekeywritable key, Iterable<CustomValueWritable> values, Context context) throws IOException, InterruptedException { double sumamount = 0; int totalrides = 0; for(customvaluewritable val : values){ sumamount+= val.getride_amount(); totalrides+=val.getcount_rides(); customval.setride_amount(sumamount); customval.setcount_rides(totalrides); context.write(key, customval); Peak Analysis Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

50 package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writableutils; /** * reshmip public class PeakAnalysisWritable implements Writable,WritableComparable<PeakAnalysisWritable>{ private String ride_date; private Integer count_rides; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public Integer getcount_rides() {

51 return count_rides; public void setcount_rides(integer count_rides) { this.count_rides = public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date); public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); count_rides = public int compareto(peakanalysiswritable o) { int result = ride_date.compareto(o.ride_date); if(result == 0){

52 result = count_rides.compareto(o.count_rides); return (-1)*result; public String tostring(){ return (new StringBuilder().append(ride_date).append("\t").append(count_rides).toString()); Peak Analysis Value Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils; /**

53 * reshmip public class PeakAnalysisValueWritable implements Writable{ private Double ride_amount; private String ride_time; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) { this.ride_amount = ride_amount; public String getride_time() { return ride_time; public void setride_time(string ride_time) { this.ride_time = public void write(dataoutput d) throws IOException {

54 WritableUtils.writeString(d, ride_time); public void readfields(datainput di) throws IOException { ride_time = WritableUtils.readString(di); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_time).append("\t").append(ride_amount).toString()); Peak Analysis Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception;

55 import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class PeakAnalysisMapper extends Mapper<Object, Text, PeakAnalysisWritable,PeakAnalysisValueWritable>{ private PeakAnalysisWritable cw = new PeakAnalysisWritable(); private PeakAnalysisValueWritable customval = new PeakAnalysisValueWritable(); public void map(object key, Text value, Context context){ String values[] = value.tostring().split("\\t"); cw.setride_date(""); cw.setcount_rides(0); customval.setride_time(""); customval.setride_amount(0.0); try{ if(values.length==4){ String date = values[0]; String hours = values[1]; Double amount = Double.parseDouble(values[2]); int count = Integer.parseInt(values[3]); //cw = new CompositeKeyWritable(date,hours); cw.setride_date(date); cw.setcount_rides(count);

56 //total_amount.set(amount); customval.setride_time(hours); customval.setride_amount(amount); context.write(cw,customval); catch(ioexception InterruptedException ex){ System.out.println("Error Message:" +ex.getmessage()); Peak Analysis Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.mapreduce.reducer; /** * reshmip

57 public class PeakAnalysisReducer extends Reducer<PeakAnalysisWritable,PeakAnalysisValueWritable,PeakAnalysisWritable,PeakAnalysisV aluewritable>{ //Double totalamt = 0.0; private PeakAnalysisWritable customkey = new PeakAnalysisWritable(); private PeakAnalysisValueWritable customvalue = new protected void reduce(peakanalysiswritable key, Iterable<PeakAnalysisValueWritable> values, Context context) throws IOException, InterruptedException { for(peakanalysisvaluewritable val : values){ context.write(key, val); Analysis 4: Driver class: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.ioexception;

58 import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.multipleoutputs; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** * reshmip public class Analysis4 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "partitioning pattern");

59 job.setjarbyclass(analysis4.class); job.setmapperclass(analysis4_mapper.class); job.setmapoutputkeyclass(intwritable.class); job.setmapoutputvalueclass(floatwritable.class); // MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class, Text.class, NullWritable.class); // MultipleOutputs.setCountersEnabled(job, true); job.setpartitionerclass(groupbydaypartitioner.class); job.setcombinerclass(analysis4_reducer.class); job.setnumreducetasks(7); //job.setnumreducetasks(0); job.setcombinerclass(analysis4_reducer.class); job.setreducerclass(analysis4_reducer.class); job.setoutputkeyclass(intwritable.class); job.setoutputvalueclass(floatwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "Borough Rides"); if(complete){ job2.setjarbyclass(analysis4.class); job2.setmapperclass(identitiymapper.class);

60 job2.setmapoutputkeyclass(nullwritable.class); job2.setmapoutputvalueclass(text.class); job2.setreducerclass(identityreducer.class); job2.setoutputkeyclass(nullwritable.class); job2.setoutputvalueclass(text.class); FileInputFormat.addInputPath(job2, new Path (args[1])); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1); Custom Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils;

61 /** * reshmip public class CustomWritable implements Writable{ private String ride_date; private Double ride_amount; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) { this.ride_amount =

62 public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date); public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_date).append("\t").append(ride_amount).toString()); //.append("\t").append(ride_amount).tostring()); Group by Partitioner: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import org.apache.hadoop.io.floatwritable;

63 import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.partitioner; /** * reshmip public class GroupByDayPartitioner extends Partitioner<IntWritable, public int getpartition(intwritable key, FloatWritable value, int i) { return (key.get()%i); Mapper: package analysis4; import java.io.ioexception; import java.text.parseexception; import java.text.simpledateformat; import java.util.calendar; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.io.floatwritable;

64 import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.lib.output.multipleoutputs; /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. /** * reshmip public class Analysis4_Mapper extends Mapper<Object, Text, IntWritable, FloatWritable>{ // private MultipleOutputs<Text, NullWritable> mos = null; private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-mm-dd"); private CustomWritable tuple = new CustomWritable(); // protected void setup(context context) throws IOException, InterruptedException { // mos = new MultipleOutputs(context); //

65 @Override protected void map(object key, Text value, Context context) throws IOException, InterruptedException { Calendar cal = Calendar.getInstance(); String[] row = value.tostring().split("\\t"); String pickupdate = row[0]; int day=0; float surcharge = 0; try { cal.settime(frmt.parse(pickupdate)); day = cal.get(calendar.day_of_week); tuple.setride_date(pickupdate); tuple.setride_amount(double.parsedouble(row[11])); surcharge = Float.parseFloat(row[11])-Float.parseFloat(row[7]); context.write(new IntWritable(day), new FloatWritable(surcharge)); catch (ParseException ex) { Logger.getLogger(Analysis4_Mapper.class.getName()).log(Level.SEVERE, null, ex); Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

66 package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class Analysis4_Reducer extends Reducer<IntWritable, FloatWritable, IntWritable, FloatWritable>{ private CustomWritable result = new CustomWritable(); // protected void reduce(intwritable key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException { // float total_amount = 0; // for(floatwritable t : values){ // total_amount += t.get(); // // // amount.set(total_amount);

67 // context.write(key,amount); protected void reduce(intwritable key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException { float amount = 0; String date=""; for(floatwritable val : values){ //date = val.getride_date(); amount += val.get(); //result.setride_amount(amount); result.setride_date(date); context.write(key, new FloatWritable(amount)); Identity Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

68 package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class IdentitiyMapper extends Mapper<Object, Text, NullWritable,Text>{ private Text outkey = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { //To change body of generated methods, choose Tools Templates. context.write(nullwritable.get(),value);

69 Identity Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class IdentityReducer extends protected void reduce(nullwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for(text value : values){ context.write(key,value);

70 Analysis 5: Driver: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.multipleinputs; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** *

71 reshmip public class InnerJoin { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "inner_join"); job.setjarbyclass(innerjoin.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, InnerJoin_Mapper1.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, InnerJoin_Mapper2.class); job.setreducerclass(innerjoin_reducer1.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputformatclass(textoutputformat.class); TextOutputFormat.setOutputPath(job, new Path(args[2])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "Most Passangers");

72 if(complete){ job2.setjarbyclass(innerjoin.class); FileInputFormat.addInputPath(job, new Path(args[2])); //MultipleInputs.addInputPath(job2, new Path(args[3]), TextInputFormat.class,JoinMapper4.class); job2.setmapperclass(innerjoin_mapper3.class); job2.setmapoutputkeyclass(text.class); job2.setmapoutputvalueclass(intwritable.class); job2.setreducerclass(innerjoin_reducer2.class); job2.setoutputformatclass(textoutputformat.class); TextOutputFormat.setOutputPath(job2, new Path(args[3])); job2.setoutputkeyclass(text.class); job2.setoutputvalueclass(intwritable.class); System.exit(job2.waitForCompletion(true)? 0 : 1); Mapper 1: /* * To change this license header, choose License Headers in Project Properties.

73 * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper1 extends Mapper<Object, Text, Text, Text> { private Text outkey = new Text(); private Text outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String[] separatedinput = value.tostring().split("\\t"); //String id = separatedinput[6]; String pickuploc = separatedinput[6]; if(pickuploc==null pickuploc == "" pickuploc.equalsignorecase("")){ return;

74 outkey.set(pickuploc); outvalue.set("a" + value); context.write(outkey, outvalue); Mapper 2: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper2 extends Mapper<Object, Text, Text, Text> { private Text outkey = new Text();

75 private Text outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String[] line_values = line.split(","); if(line_values.length==5){ String latitude = value.tostring().split(",")[4].trim(); if(latitude==null latitude =="" latitude.equalsignorecase("")){ return; outkey.set(latitude); outvalue.set("b" + value); context.write(outkey, outvalue); Mapper 3: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

76 package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper3 extends Mapper<Object, Text, Text, IntWritable> { private Text outkey = new Text(); private IntWritable outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String lines = value.tostring(); String[] line = lines.split("\\t"); if(line.length==18){ String area = line[17]; if(!area.equalsignorecase("") area!=null!area.equalsignorecase("null"))

77 { String[] locations = area.split(","); if(locations.length>1){ String borough = locations[0]; outkey.set(borough); outvalue.set(integer.parseint(locations[5])); context.write(outkey,outvalue); Reducer 1: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import java.util.arraylist; import org.apache.hadoop.io.text;

78 import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class InnerJoin_Reducer1 extends Reducer<Text, Text, Text, Text> { public static final Text EMPTY_TEXT = new Text(); private Text tmp = new Text(); private ArrayList<Text> lista = new ArrayList<Text>(); private ArrayList<Text> listb = new ArrayList<Text>(); private String jointype = null; protected void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { lista.clear(); listb.clear(); while (values.iterator().hasnext()) { tmp = values.iterator().next(); if (tmp.charat(0) == 'A') { lista.add(new Text(tmp.toString().substring(1))); else if (tmp.charat(0) == 'B') { listb.add(new Text(tmp.toString().substring(1)));

79 executejoinlogic(context); private void executejoinlogic(context context) throws IOException, InterruptedException { if (!lista.isempty() &&!listb.isempty()) { for (Text A : lista) { for (Text B : listb) { context.write(a, B); Reducer 2: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin;

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example