FINAL PROJECT REPORT
|
|
- Scott White
- 5 years ago
- Views:
Transcription
1 FINAL PROJECT REPORT NYC TAXI DATA ANALYSIS Reshmi Padavala Project Summary: For my final project, I decided to showcase my big data analysis skills by working on a large amount of dataset. On which is very difficult to identify patterns and visualize using any normal tools other than the powerful concepts like "MapReduce" that I learned during the current course tenure. After doing research and weighing my options, finally, I set my mind to work on the NewYork City(NYC) taxi data to produce some in-depth analysis on taxi ride patterns and behaviors. These datasets were made accessible by the NYC Taxi and Limousine Commission (TLC). The NYC taxi dataset contains trip records which include fields like " pick-up and drop-off, dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts"[1]. The dataset is huge than I expected containing millions of records so I truncated the scope of the project to display some core analysis patterns. In this paper, I performed analysis to identify the trends in the taxi ride in NYC from different boroughs during the period of Dataset: NYC taxi dataset: NYC geostats:
2 Analysis: Analysis 1: Data Cleansing The dataset I chose is huge and has raw data, I had to cleanse the data for the date fields and null records and filter only the records that I needed, I performed Simple Filtering Pattern. Output: Analysis 2: Statistical Analysis This analysis is performed to identify the statistical data such as total rides, revenue, maximum toll charges, maximum tip amount of every day in So, I considered the date as the key and using a Custom Writable object I have generated the values. Since, the functions performed on the reducer is aggregation, the performance is optimized by using reducer as a combiner.
3 Output: Analysis 3: Peak Hour Analysis The analysis performed is to identify the peak hours in a day and the amount earned during those peak hours. The motivation behind this analysis is that rides in a day are not consistent every
4 hour. Most of the rides would be taken during the office hours either in the morning or at the evening. Using {Date, Hours as a Custom Writable object I performed Secondary Sorting generated {total rides, total amount as the value. This output is chained to another Secondary Sorting where {Date, total rides is taken as key, {Hours, total amount as the value and thus generated the peak hours in a day.
5 Output: Analysis 4: Day Based Surcharge Analysis The surcharges applied on a ride depends on the day of the week. Due to high demand of riders during weekends, Saturday and Sunday are expected to have more applicable surcharges than the rest of the week.
6 To perform this analysis, I made use of Partitioning Pattern and divided the data into 7 partitions one for each day of the week respectively. Since, the data is partitioned, the analysis can be performed on any particular partition of choice without having the load of running the MR on the entire data. So, after partitioning, the total surcharge for the entire dataset on that particular day of the week has been calculated from the respective partitions. And the results are obvious. Output: Analysis 5: Boroughs with Most Number of Riders Every ride has an inter-connection with the neighborhood of pickup. Passengers from some locations prefer more to commute on taxi. This analysis is used to find out how many passengers from each borough have commuted on NYC taxi.
7 To perform this analysis, I had to do an inner join between taxi rides data and the location data. Now, the new generated output has the pickup boroughs. The output is grouped by boroughs to find the total passengers. Output: Analysis 6: Identifying Distinct Neighborhoods Each borough has multiple zones. I wanted to know how many unique zones or neighborhoods exist in total. I used Distinct Pattern to filter out the unique zones from the NYC taxi dataset.
8 Analysis 7: Top 10 Pickup Zones This analysis is to identify the most frequent pick up zones. For this I have first check how pickups from different zones are distributed across New York City and from this output based on the total rides, I have emitted the top ten zones using Top Ten Pattern.
9 Output: Analysis 8: Fare Analysis on Top 5 Zones Based on the result from the above, for my analysis, I wanted to concentrate on the top 5 zones which has the most riders from. Since, I have an idea about what I am looking for, I made use of a Bloom Filter to filter out the remaining zones which are not in the top 5 list. And on the filtered data, I was looking for the median fare charges from these zones. And for optimization, I made use of a combiner.
10 Output: Analysis 9: Calculate the longest rides I made use of Pig Latin script to perform this analysis. I have LOADED the taxi data and zone data into variables and performed JOIN using the pickup ID to get the pickup zone name then joined the resulting data with zone data on drop off ID to get the drop off zone. Then ORDERED it based on the total distance covered in the ride and LIMITED the output to top 20 longest rides.
11
12 Analysis 10: Calculate total rides from each borough I made use of Pig Latin script to perform this analysis. I have GROUPED the loaded data based on the pickup boroughs. On each group, the total number of rides is calculated. Total rides from each borough 31% 0%4% 30% 35% 0% EWR Bronx Queens Unknown Brooklyn Manhattan Staten Island Output: Analysis 11: Find the number of riders from each borough every day To perform this analysis I made use of Hive. After the data is loaded, I have grouped the data based on the pickup data and pickup borough and generated the number of riders.
13
14 Programming Code: Analysis 1: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.
15 package analysis3; import java.io.ioexception; import java.text.decimalformat; import java.text.parseexception; import java.text.simpledateformat; import java.util.calendar; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; /** * reshmip public class Analysis3 { /** args the command line arguments
16 public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "filtering data"); job.setjarbyclass(analysis3.class); job.setmapperclass(filteringmapper.class); job.setmapoutputvalueclass(customwritable.class); job.setmapoutputkeyclass(nullwritable.class); //job.setoutputkeyclass(nullwritable.class); //job.setoutputvalueclass(customwritable.class); job.setnumreducetasks(1); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); public static class FilteringMapper extends Mapper<Object,Text,NullWritable,CustomWritable>{ private CustomWritable customwritable = new CustomWritable(); private final static SimpleDateFormat frmt = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException {
17 String line = value.tostring(); String[] line_values = line.split(","); Calendar cal = Calendar.getInstance(); DecimalFormat numberformat = new DecimalFormat("#.0000"); try{ if(line_values.length == 21){ if(!line_values[1].equals("lpep_pickup_datetime") && line_values[1]!=(null) &&!line_values[1].equals("null")){ String[] pickupdatestring = line_values[1].split(" "); String pickupdate = pickupdatestring[0]; String pickuptime = pickupdatestring[1]; cal.settime(frmt.parse(pickupdate)); String pick_date = cal.gettime().tostring(); // String[] new_date = pick_date.split(" "); // String new_pickdate = new StringBuilder().append(new_date[0]) //.append(" ").append(new_date[1]).append(" "). // append(new_date[2]).append(" ").append(new_date[5]).tostring(); customwritable.setride_pickup_date(pickupdate); customwritable.setride_pickup_time(pickuptime); // String[] dropoffdatestring = line_values[2].split(" "); // String dropoffdate = dropoffdatestring[0]; // String dropofftime = dropoffdatestring[1]; // customwritable.setride_dropoff_date(dropoffdate); // customwritable.setride_dropoff_time(dropofftime);
18 // customwritable.setratecodeid(integer.parseint(line_values[4])); String pickup_longitude="",drop_longitude=""; String pickup_latitude="",drop_latitiude=""; if(line_values[5].length()>8 && line_values[6].length()>7 && line_values[7].length()>8 && line_values[8].length()>7){ pickup_longitude = line_values[5].substring(0, 8); pickup_latitude = line_values[6].substring(0,7); drop_longitude = line_values[7].substring(0,8); drop_latitiude = line_values[8].substring(0, 7); //System.err.println("coordinates:"+longitude); customwritable.setpick_longitude((pickup_longitude)); customwritable.setpickup_latitude((pickup_latitude)); // customwritable.setdropoff_longitude((drop_longitude)); // customwritable.setdropoff_latitude((drop_latitiude)); customwritable.setpassengers(integer.parseint(line_values[9])); customwritable.settrip_distance(double.parsedouble(line_values[10])); customwritable.setfare_amount(double.parsedouble(line_values[11])); customwritable.setexta(double.parsedouble(line_values[12])); customwritable.setmta_tax(double.parsedouble(line_values[13])); customwritable.settip_amount(double.parsedouble(line_values[14])); customwritable.settotal_amount(double.parsedouble(line_values[18])); customwritable.setpayment_type(integer.parseint(line_values[19])); context.write(nullwritable.get(),customwritable);
19 catch(nullpointerexception ex){ ex.getmessage(); catch (ParseException ex) { Logger.getLogger(Analysis3.class.getName()).log(Level.SEVERE, null, ex); Custom Writable.java /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis3; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils; /**
20 * reshmip public class CustomWritable implements Writable{ private String ride_pickup_date; private String ride_pickup_time; private String ride_dropoff_date; private String ride_dropoff_time; private int ratecodeid; private String pick_longitude; private String pickup_latitude; private String dropoff_longitude; private String dropoff_latitude; private int passengers; private Double trip_distance; private Double fare_amount; private Double exta; private Double mta_tax; private Double tip_amount; private Double total_amount; private int payment_type; public String getride_pickup_date() { return ride_pickup_date; public void setride_pickup_date(string ride_pickup_date) {
21 this.ride_pickup_date = ride_pickup_date; public String getride_pickup_time() { return ride_pickup_time; public void setride_pickup_time(string ride_pickup_time) { this.ride_pickup_time = ride_pickup_time; public String getride_dropoff_date() { return ride_dropoff_date; public void setride_dropoff_date(string ride_dropoff_date) { this.ride_dropoff_date = ride_dropoff_date; public String getride_dropoff_time() { return ride_dropoff_time; public void setride_dropoff_time(string ride_dropoff_time) { this.ride_dropoff_time = ride_dropoff_time;
22 public int getratecodeid() { return ratecodeid; public void setratecodeid(int ratecodeid) { this.ratecodeid = ratecodeid; public String getpick_longitude() { return pick_longitude; public void setpick_longitude(string pick_longitude) { this.pick_longitude = pick_longitude; public String getpickup_latitude() { return pickup_latitude; public void setpickup_latitude(string pickup_latitude) { this.pickup_latitude = pickup_latitude; public String getdropoff_longitude() { return dropoff_longitude;
23 public void setdropoff_longitude(string dropoff_longitude) { this.dropoff_longitude = dropoff_longitude; public String getdropoff_latitude() { return dropoff_latitude; public void setdropoff_latitude(string dropoff_latitude) { this.dropoff_latitude = dropoff_latitude; public int getpassengers() { return passengers; public void setpassengers(int passengers) { this.passengers = passengers; public Double gettrip_distance() { return trip_distance; public void settrip_distance(double trip_distance) { this.trip_distance = trip_distance;
24 public Double getfare_amount() { return fare_amount; public void setfare_amount(double fare_amount) { this.fare_amount = fare_amount; public Double getexta() { return exta; public void setexta(double exta) { this.exta = exta; public Double getmta_tax() { return mta_tax; public void setmta_tax(double mta_tax) { this.mta_tax = mta_tax; public Double gettip_amount() {
25 return tip_amount; public void settip_amount(double tip_amount) { this.tip_amount = tip_amount; public Double gettotal_amount() { return total_amount; public void settotal_amount(double total_amount) { this.total_amount = total_amount; public int getpayment_type() { return payment_type; public void setpayment_type(int payment_type) { this.payment_type = payment_type;
26 @Override public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_pickup_date); WritableUtils.writeString(d,ride_pickup_time); //WritableUtils.writeString(d, ride_dropoff_date); //WritableUtils.writeString(d,ride_dropoff_time); d.writeint(ratecodeid); d.writeint(passengers); d.writeint(payment_type); WritableUtils.writeString(d,pick_longitude); WritableUtils.writeString(d,pickup_latitude); //WritableUtils.writeString(d,dropoff_longitude); //WritableUtils.writeString(d,dropoff_latitude); d.writedouble(trip_distance); d.writedouble(fare_amount); d.writedouble(exta); d.writedouble(mta_tax); d.writedouble(tip_amount); public void readfields(datainput di) throws IOException { ride_pickup_date = WritableUtils.readString(di); ride_pickup_time = WritableUtils.readString(di); //ride_dropoff_date = WritableUtils.readString(di);
27 //ride_dropoff_time = WritableUtils.readString(di); ratecodeid = di.readint(); passengers = di.readint(); payment_type = di.readint(); pick_longitude = WritableUtils.readString(di); pickup_latitude = WritableUtils.readString(di); //dropoff_latitude = WritableUtils.readString(di); //dropoff_longitude = WritableUtils.readString(di); trip_distance = di.readdouble(); fare_amount = di.readdouble(); exta = di.readdouble(); mta_tax = di.readdouble(); tip_amount = di.readdouble(); total_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_pickup_date). append("\t").append(ride_pickup_time). //append("\t").append(ride_dropoff_date). //append("\t").append(ride_dropoff_time). append("\t").append(ratecodeid). append("\t").append(pick_longitude). append("\t").append(pickup_latitude). //append("\t").append(dropoff_longitude).
28 //append("\t").append(dropoff_latitude). append("\t").append(passengers). append("\t").append(trip_distance). append("\t").append(fare_amount). append("\t").append(exta). append("\t").append(mta_tax). append("\t").append(tip_amount). append("\t").append(total_amount). append("\t").append(payment_type). tostring()); Analysis 2: Driver class : /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job;
29 import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; /** * reshmip public class Analysis1 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "summarize trips"); job.setjarbyclass(analysis1.class); job.setmapperclass(analysis1_mapper.class); job.setmapoutputvalueclass(customwritable.class); job.setcombinerclass(analysis1_reducer.class); job.setreducerclass(analysis1_reducer.class); job.setmapoutputkeyclass(text.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(customwritable.class);
30 FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class Analysis1_Mapper extends Mapper<Object,Text,Text,CustomWritable>{ private CustomWritable customwritable = new CustomWritable();
31 //private IntWritable ride; private Text tripdate = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); double distance=0; double fare=0; String[] line_values = line.split(","); try{ if(line_values.length == 21){ if((!(line_values[1].equals("pickup_date"))) && (!(line_values[1].equals(""))) && line_values[1]!="" && line_values[1]!="na" && line_values[1]!=null){ tripdate.set(line_values[1].split(" ")[0]); customwritable.settrip_distance(double.parsedouble(line_values[10])); customwritable.settrip_fare(double.parsedouble(line_values[11])); customwritable.setmax_tip(double.parsedouble(line_values[14])); customwritable.setmax_toll(double.parsedouble(line_values[15])); context.write(tripdate,customwritable); catch (NumberFormatException ex) { ex.getmessage();
32 catch(nullpointerexception ex){ ex.getmessage(); Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class Analysis1_Reducer extends Reducer<Text, CustomWritable, Text, CustomWritable>{ private CustomWritable result = new CustomWritable();
33 @Override protected void reduce(text key, Iterable<CustomWritable> values, Context context) throws IOException, InterruptedException { double sumtrip = 0; double sumfare = 0; Double max_tip = 0.0; Double max_toll = 0.0; result.setmax_tip(0.0); result.setmax_toll(0.0); result.settrip_distance(0.0); result.settrip_fare(0.0); for(customwritable val : values){ sumtrip+= val.gettrip_distance(); sumfare+=val.gettrip_fare(); max_tip = val.getmax_tip(); max_toll = val.getmax_toll(); if(result.getmax_tip()== null max_tip.compareto(result.getmax_tip()) > 0){ result.setmax_tip(max_tip); if(result.getmax_toll()== null max_toll.compareto(result.getmax_toll()) > 0){ result.setmax_toll(max_toll);
34 result.settrip_distance(sumtrip); result.settrip_fare(sumfare); context.write(key, result); Custom Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; /** *
35 reshmip public class CustomWritable implements Writable{ private Double trip_distance; private Double trip_fare; private Double max_tip; private Double max_toll; public Double gettrip_distance() { return trip_distance; public void settrip_distance(double trip_distance) { this.trip_distance = trip_distance; public Double gettrip_fare() { return trip_fare; public void settrip_fare(double trip_fare) { this.trip_fare = trip_fare; public Double getmax_tip() { return max_tip;
36 public void setmax_tip(double max_tip) { this.max_tip = max_tip; public Double getmax_toll() { return max_toll; public void setmax_toll(double max_toll) { this.max_toll = public void write(dataoutput d) throws IOException { d.writedouble(trip_distance); d.writedouble(trip_fare); d.writedouble(max_tip); public void readfields(datainput di) throws IOException { trip_distance = di.readdouble(); trip_fare = di.readdouble(); max_tip = di.readdouble();
37 max_toll = di.readdouble(); public String tostring(){ return (new StringBuilder().append(trip_distance).append("\t").append(trip_fare).append("\t").append(max_tip).append("\t").append(max_toll).toString()); Analysis 3: Driver class: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable;
38 import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** * reshmip public class Analysis2 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf,"first secondary sorting by date and hours"); job.setjarbyclass(analysis2.class); job.setmapperclass(secondarysortmapper.class);
39 job.setmapoutputkeyclass(compositekeywritable.class); job.setmapoutputvalueclass(customvaluewritable.class); //job.setgroupingcomparatorclass(groupingcomparator.class); //job.setnumreducetasks(0); job.setreducerclass(secondarysortreducer.class); job.setoutputkeyclass(compositekeywritable.class); job.setoutputvalueclass(customvaluewritable.class); job.setinputformatclass(textinputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "second secondary sorting on date and rides"); if(complete){ job2.setjarbyclass(analysis2.class); job2.setmapperclass(peakanalysismapper.class); job2.setmapoutputkeyclass(peakanalysiswritable.class); job2.setmapoutputvalueclass(peakanalysisvaluewritable.class); job2.setreducerclass(peakanalysisreducer.class); job2.setoutputkeyclass(peakanalysiswritable.class); job2.setoutputvalueclass(peakanalysisvaluewritable.class);
40 FileInputFormat.addInputPath(job2, new Path (args[1])); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1); Composite Key Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writableutils; /**
41 * reshmip public class CompositeKeyWritable implements WritableComparable<CompositeKeyWritable>{ private String ride_date; private String ride_time; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public String getride_time() { return ride_time; public void setride_time(string ride_time) { this.ride_time = public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date);
42 WritableUtils.writeString(d, public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); ride_time = WritableUtils.readString(di); public String tostring(){ return (new public int compareto(compositekeywritable o) { int result = ride_date.compareto(o.ride_date); if(result == 0){ result = ride_time.compareto(o.ride_time); return (-1)*result;
43 Composite Value Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; /** * reshmip public class CustomValueWritable implements Writable{ private Double ride_amount; private int count_rides; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) {
44 this.ride_amount = ride_amount; public int getcount_rides() { return count_rides; public void setcount_rides(int count_rides) { this.count_rides = public void write(dataoutput d) throws IOException { d.writeint(count_rides); public void readfields(datainput di) throws IOException { count_rides = di.readint(); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_amount).append("\t").append(count_rides).toString());
45 Grouping Comparator: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writablecomparator; /** * reshmip public class GroupingComparator extends WritableComparator{ protected GroupingComparator() { super(compositekeywritable.class,true);
46 @Override public int compare(writablecomparable w1, WritableComparable w2){ CompositeKeyWritable cw1 = (CompositeKeyWritable) w1; CompositeKeyWritable cw2 = (CompositeKeyWritable) w2; return cw1.getride_date().compareto(cw2.getride_date()); Secondary Sort Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip
47 public class SecondarySortMapper extends Mapper<Object, Text, CompositeKeyWritable,CustomValueWritable>{ private DoubleWritable total_amount = new DoubleWritable(); private CompositeKeyWritable cw = new CompositeKeyWritable(); private CustomValueWritable customval = new CustomValueWritable(); public void map(object key, Text value, Context context){ String values[] = value.tostring().split("\\t"); cw.setride_date(""); cw.setride_time(""); customval.setcount_rides(0); customval.setride_amount(0.0); try{ if(values.length==13){ String date = values[0]; String hours = values[1].split(":")[0]; Double amount = Double.parseDouble(values[12]); //cw = new CompositeKeyWritable(date,hours); cw.setride_date(date); cw.setride_time(hours); //total_amount.set(amount); customval.setcount_rides(1); customval.setride_amount(amount); context.write(cw,customval); catch(ioexception InterruptedException ex){
48 System.out.println("Error Message:" +ex.getmessage()); Secondary Sort Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class SecondarySortReducer extends Reducer<CompositeKeyWritable,CustomValueWritable,CompositeKeyWritable,CustomValueWr itable>{ //Double totalamt = 0.0; CustomValueWritable customval = new CustomValueWritable();
49 private DoubleWritable total_amount = new protected void reduce(compositekeywritable key, Iterable<CustomValueWritable> values, Context context) throws IOException, InterruptedException { double sumamount = 0; int totalrides = 0; for(customvaluewritable val : values){ sumamount+= val.getride_amount(); totalrides+=val.getcount_rides(); customval.setride_amount(sumamount); customval.setcount_rides(totalrides); context.write(key, customval); Peak Analysis Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.
50 package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writableutils; /** * reshmip public class PeakAnalysisWritable implements Writable,WritableComparable<PeakAnalysisWritable>{ private String ride_date; private Integer count_rides; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public Integer getcount_rides() {
51 return count_rides; public void setcount_rides(integer count_rides) { this.count_rides = public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date); public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); count_rides = public int compareto(peakanalysiswritable o) { int result = ride_date.compareto(o.ride_date); if(result == 0){
52 result = count_rides.compareto(o.count_rides); return (-1)*result; public String tostring(){ return (new StringBuilder().append(ride_date).append("\t").append(count_rides).toString()); Peak Analysis Value Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils; /**
53 * reshmip public class PeakAnalysisValueWritable implements Writable{ private Double ride_amount; private String ride_time; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) { this.ride_amount = ride_amount; public String getride_time() { return ride_time; public void setride_time(string ride_time) { this.ride_time = public void write(dataoutput d) throws IOException {
54 WritableUtils.writeString(d, ride_time); public void readfields(datainput di) throws IOException { ride_time = WritableUtils.readString(di); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_time).append("\t").append(ride_amount).toString()); Peak Analysis Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception;
55 import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class PeakAnalysisMapper extends Mapper<Object, Text, PeakAnalysisWritable,PeakAnalysisValueWritable>{ private PeakAnalysisWritable cw = new PeakAnalysisWritable(); private PeakAnalysisValueWritable customval = new PeakAnalysisValueWritable(); public void map(object key, Text value, Context context){ String values[] = value.tostring().split("\\t"); cw.setride_date(""); cw.setcount_rides(0); customval.setride_time(""); customval.setride_amount(0.0); try{ if(values.length==4){ String date = values[0]; String hours = values[1]; Double amount = Double.parseDouble(values[2]); int count = Integer.parseInt(values[3]); //cw = new CompositeKeyWritable(date,hours); cw.setride_date(date); cw.setcount_rides(count);
56 //total_amount.set(amount); customval.setride_time(hours); customval.setride_amount(amount); context.write(cw,customval); catch(ioexception InterruptedException ex){ System.out.println("Error Message:" +ex.getmessage()); Peak Analysis Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.mapreduce.reducer; /** * reshmip
57 public class PeakAnalysisReducer extends Reducer<PeakAnalysisWritable,PeakAnalysisValueWritable,PeakAnalysisWritable,PeakAnalysisV aluewritable>{ //Double totalamt = 0.0; private PeakAnalysisWritable customkey = new PeakAnalysisWritable(); private PeakAnalysisValueWritable customvalue = new protected void reduce(peakanalysiswritable key, Iterable<PeakAnalysisValueWritable> values, Context context) throws IOException, InterruptedException { for(peakanalysisvaluewritable val : values){ context.write(key, val); Analysis 4: Driver class: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.ioexception;
58 import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.multipleoutputs; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** * reshmip public class Analysis4 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "partitioning pattern");
59 job.setjarbyclass(analysis4.class); job.setmapperclass(analysis4_mapper.class); job.setmapoutputkeyclass(intwritable.class); job.setmapoutputvalueclass(floatwritable.class); // MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class, Text.class, NullWritable.class); // MultipleOutputs.setCountersEnabled(job, true); job.setpartitionerclass(groupbydaypartitioner.class); job.setcombinerclass(analysis4_reducer.class); job.setnumreducetasks(7); //job.setnumreducetasks(0); job.setcombinerclass(analysis4_reducer.class); job.setreducerclass(analysis4_reducer.class); job.setoutputkeyclass(intwritable.class); job.setoutputvalueclass(floatwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "Borough Rides"); if(complete){ job2.setjarbyclass(analysis4.class); job2.setmapperclass(identitiymapper.class);
60 job2.setmapoutputkeyclass(nullwritable.class); job2.setmapoutputvalueclass(text.class); job2.setreducerclass(identityreducer.class); job2.setoutputkeyclass(nullwritable.class); job2.setoutputvalueclass(text.class); FileInputFormat.addInputPath(job2, new Path (args[1])); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1); Custom Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils;
61 /** * reshmip public class CustomWritable implements Writable{ private String ride_date; private Double ride_amount; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) { this.ride_amount =
62 public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date); public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_date).append("\t").append(ride_amount).toString()); //.append("\t").append(ride_amount).tostring()); Group by Partitioner: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import org.apache.hadoop.io.floatwritable;
63 import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.partitioner; /** * reshmip public class GroupByDayPartitioner extends Partitioner<IntWritable, public int getpartition(intwritable key, FloatWritable value, int i) { return (key.get()%i); Mapper: package analysis4; import java.io.ioexception; import java.text.parseexception; import java.text.simpledateformat; import java.util.calendar; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.io.floatwritable;
64 import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.lib.output.multipleoutputs; /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. /** * reshmip public class Analysis4_Mapper extends Mapper<Object, Text, IntWritable, FloatWritable>{ // private MultipleOutputs<Text, NullWritable> mos = null; private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-mm-dd"); private CustomWritable tuple = new CustomWritable(); // protected void setup(context context) throws IOException, InterruptedException { // mos = new MultipleOutputs(context); //
65 @Override protected void map(object key, Text value, Context context) throws IOException, InterruptedException { Calendar cal = Calendar.getInstance(); String[] row = value.tostring().split("\\t"); String pickupdate = row[0]; int day=0; float surcharge = 0; try { cal.settime(frmt.parse(pickupdate)); day = cal.get(calendar.day_of_week); tuple.setride_date(pickupdate); tuple.setride_amount(double.parsedouble(row[11])); surcharge = Float.parseFloat(row[11])-Float.parseFloat(row[7]); context.write(new IntWritable(day), new FloatWritable(surcharge)); catch (ParseException ex) { Logger.getLogger(Analysis4_Mapper.class.getName()).log(Level.SEVERE, null, ex); Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.
66 package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class Analysis4_Reducer extends Reducer<IntWritable, FloatWritable, IntWritable, FloatWritable>{ private CustomWritable result = new CustomWritable(); // protected void reduce(intwritable key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException { // float total_amount = 0; // for(floatwritable t : values){ // total_amount += t.get(); // // // amount.set(total_amount);
67 // context.write(key,amount); protected void reduce(intwritable key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException { float amount = 0; String date=""; for(floatwritable val : values){ //date = val.getride_date(); amount += val.get(); //result.setride_amount(amount); result.setride_date(date); context.write(key, new FloatWritable(amount)); Identity Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.
68 package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class IdentitiyMapper extends Mapper<Object, Text, NullWritable,Text>{ private Text outkey = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { //To change body of generated methods, choose Tools Templates. context.write(nullwritable.get(),value);
69 Identity Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class IdentityReducer extends protected void reduce(nullwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for(text value : values){ context.write(key,value);
70 Analysis 5: Driver: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.multipleinputs; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** *
71 reshmip public class InnerJoin { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "inner_join"); job.setjarbyclass(innerjoin.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, InnerJoin_Mapper1.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, InnerJoin_Mapper2.class); job.setreducerclass(innerjoin_reducer1.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputformatclass(textoutputformat.class); TextOutputFormat.setOutputPath(job, new Path(args[2])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "Most Passangers");
72 if(complete){ job2.setjarbyclass(innerjoin.class); FileInputFormat.addInputPath(job, new Path(args[2])); //MultipleInputs.addInputPath(job2, new Path(args[3]), TextInputFormat.class,JoinMapper4.class); job2.setmapperclass(innerjoin_mapper3.class); job2.setmapoutputkeyclass(text.class); job2.setmapoutputvalueclass(intwritable.class); job2.setreducerclass(innerjoin_reducer2.class); job2.setoutputformatclass(textoutputformat.class); TextOutputFormat.setOutputPath(job2, new Path(args[3])); job2.setoutputkeyclass(text.class); job2.setoutputvalueclass(intwritable.class); System.exit(job2.waitForCompletion(true)? 0 : 1); Mapper 1: /* * To change this license header, choose License Headers in Project Properties.
73 * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper1 extends Mapper<Object, Text, Text, Text> { private Text outkey = new Text(); private Text outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String[] separatedinput = value.tostring().split("\\t"); //String id = separatedinput[6]; String pickuploc = separatedinput[6]; if(pickuploc==null pickuploc == "" pickuploc.equalsignorecase("")){ return;
74 outkey.set(pickuploc); outvalue.set("a" + value); context.write(outkey, outvalue); Mapper 2: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper2 extends Mapper<Object, Text, Text, Text> { private Text outkey = new Text();
75 private Text outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String[] line_values = line.split(","); if(line_values.length==5){ String latitude = value.tostring().split(",")[4].trim(); if(latitude==null latitude =="" latitude.equalsignorecase("")){ return; outkey.set(latitude); outvalue.set("b" + value); context.write(outkey, outvalue); Mapper 3: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.
76 package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper3 extends Mapper<Object, Text, Text, IntWritable> { private Text outkey = new Text(); private IntWritable outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String lines = value.tostring(); String[] line = lines.split("\\t"); if(line.length==18){ String area = line[17]; if(!area.equalsignorecase("") area!=null!area.equalsignorecase("null"))
77 { String[] locations = area.split(","); if(locations.length>1){ String borough = locations[0]; outkey.set(borough); outvalue.set(integer.parseint(locations[5])); context.write(outkey,outvalue); Reducer 1: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import java.util.arraylist; import org.apache.hadoop.io.text;
78 import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class InnerJoin_Reducer1 extends Reducer<Text, Text, Text, Text> { public static final Text EMPTY_TEXT = new Text(); private Text tmp = new Text(); private ArrayList<Text> lista = new ArrayList<Text>(); private ArrayList<Text> listb = new ArrayList<Text>(); private String jointype = null; protected void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { lista.clear(); listb.clear(); while (values.iterator().hasnext()) { tmp = values.iterator().next(); if (tmp.charat(0) == 'A') { lista.add(new Text(tmp.toString().substring(1))); else if (tmp.charat(0) == 'B') { listb.add(new Text(tmp.toString().substring(1)));
79 executejoinlogic(context); private void executejoinlogic(context context) throws IOException, InterruptedException { if (!lista.isempty() &&!listb.isempty()) { for (Text A : lista) { for (Text B : listb) { context.write(a, B); Reducer 2: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin;
COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.
COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example
More informationTopics covered in this lecture
9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm
More informationThe core source code of the edge detection of the Otsu-Canny operator in the Hadoop
Attachment: The core source code of the edge detection of the Otsu-Canny operator in the Hadoop platform (ImageCanny.java) //Map task is as follows. package bishe; import java.io.ioexception; import org.apache.hadoop.fs.path;
More informationUNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus
UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.
More informationBig Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2
Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer
More informationJava in MapReduce. Scope
Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on
More information15/03/2018. Combiner
Combiner 2 1 Standard MapReduce applications The (key,value) pairs emitted by the Mappers are sent to the Reducers through the network Some pre-aggregations could be performed to limit the amount of network
More informationSteps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/
SL-V BE IT EXP 7 Aim: Design and develop a distributed application to find the coolest/hottest year from the available weather data. Use weather data from the Internet and process it using MapReduce. Steps:
More informationIntroduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece
Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics July 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationGuidelines For Hadoop and Spark Cluster Usage
Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset
More informationParallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018
Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much
More informationCS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern
W4.A.0.0 CS435 Introduction to Big Data W4.A.1 FAQs PA0 submission is open Feb. 6, 5:00PM via Canvas Individual submission (No team submission) If you have not been assigned the port range, please contact
More informationMapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java
MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics January 22, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationOutline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.
D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels
More informationHadoop 3.X more examples
Hadoop 3.X more examples Big Data - 09/04/2018 Let s start with some examples! http://www.dia.uniroma3.it/~dvr/es2_material.zip Example: LastFM Listeners per Track Consider the following log file UserId
More informationChapter 3. Distributed Algorithms based on MapReduce
Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data
More informationSession 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi
Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/
More informationMapReduce and Hadoop. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The
More informationHadoop 2.X on a cluster environment
Hadoop 2.X on a cluster environment Big Data - 05/04/2017 Hadoop 2 on AMAZON Hadoop 2 on AMAZON Hadoop 2 on AMAZON Regions Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on
More informationLarge-scale Information Processing
Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,
More informationDhavide Aruliah Director of Training, Anaconda
PARALLEL COMPUTING WITH DASK Using Dask DataFrames Dhavide Aruliah Director of Training, Anaconda Reading CSV In [1]: import dask.dataframe as dd dd.read_csv() function Accepts single filename or glob
More informationPackage nyctaxi. October 26, 2017
Title Accessing New York City Taxi Data Version 0.0.1 Date 2017-10-24 Package nyctaxi October 26, 2017 Description New York City's Taxi and Limousine Commission (TLC) Trip Data
More informationRecommended Literature
COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2017 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic
More informationExample of a use case
2 1 In some applications data are read from two or more datasets The datasets could have different formats Hadoop allows reading data from multiple inputs (multiple datasets) with different formats One
More information2. MapReduce Programming Model
Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationHadoop Integration Guide
HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 4/7/2016 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Apache Hadoop Chaining jobs Chaining MapReduce jobs Many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce
More informationHadoop Integration Guide
HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 5/2/2018 Legal Notices Warranty The only warranties for Micro Focus products and services are set forth in the express warranty
More informationGhislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data
More informationExperiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018
Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,
More informationMap Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms
Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google
More informationGhislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationParallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014
Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example
More informationMAPREDUCE - PARTITIONER
MAPREDUCE - PARTITIONER http://www.tutorialspoint.com/map_reduce/map_reduce_partitioner.htm Copyright tutorialspoint.com A partitioner works like a condition in processing an input dataset. The partition
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationDepartment of Information Technology Software Laboratory-V Assignment No: 1 Title of the Assignment:
Department of Information Technology Software Laboratory-V --------------------------------------------------------------------------------------------------------------------- Assignment No: 1 ---------------------------------------------------------------------------------------------------------------------
More informationQUERY OPTIMIZATION IN BIG DATA USING HADOOP, HIVE AND NEO4J
QUERY OPTIMIZATION IN BIG DATA USING HADOOP, HIVE AND NEO4J SUMMER INTERNSHIP PROJECT REPORT Submitted by M. ARUN(2016103010) S. BEN STEWART(2016103513) P. SANJAY(2016103580) COLLEGE OF ENGINEERING, GUINDY
More informationPIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS
PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the
More informationMRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop.
MRUnit Tutorial Setup development environment 1. Download the latest version of MRUnit jar from Apache website: https://repository.apache.org/content/repositories/releases/org/apache/ mrunit/mrunit/. For
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark
More informationProcessing Distributed Data Using MapReduce, Part I
Processing Distributed Data Using MapReduce, Part I Computer Science E-66 Harvard University David G. Sullivan, Ph.D. MapReduce A framework for computation on large data sets that are fragmented and replicated
More informationRecommended Literature
COSC 6339 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Fall 2018 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic
More informationCS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara
W5.A.0.0 CS435 Introduction to Big Data W5.A.1 FAQs PA1 has been posted Feb. 21, 5:00PM via Canvas Individual submission (No team submission) Source code of examples in lectures: https://github.com/adamjshook/mapreducepatterns
More informationMapReduce. Arend Hintze
MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,
More informationCSE6331: Cloud Computing
CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce
More information// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName("Spark Exam 2 - Exercise
import org.apache.spark.api.java.*; import org.apache.spark.sparkconf; public class SparkDriver { public static void main(string[] args) { String inputpathpm10readings; String outputpathmonthlystatistics;
More informationJava & Inheritance. Inheritance - Scenario
Java & Inheritance ITNPBD7 Cluster Computing David Cairns Inheritance - Scenario Inheritance is a core feature of Object Oriented languages. A class hierarchy can be defined where the class at the top
More informationCS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [MAPREDUCE & HADOOP] Does Shrideep write the poems on these title slides? Yes, he does. These musing are resolutely on track For obscurity shores, from whence
More informationImplementing Algorithmic Skeletons over Hadoop
Implementing Algorithmic Skeletons over Hadoop Dimitrios Mouzopoulos E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2011
More informationBig Data Analysis using Hadoop Lecture 3
Big Data Analysis using Hadoop Lecture 3 Last Week - Recap Driver Class Mapper Class Reducer Class Create our first MR process Ran on Hadoop Monitored on webpages Checked outputs using HDFS command line
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationCloud Programming on Java EE Platforms. mgr inż. Piotr Nowak
Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak dsh distributed shell commands execution -c concurrent --show-machine-names -M --group cluster -g cluster /etc/dsh/groups/cluster needs passwordless
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationUsing Big Data for the analysis of historic context information
0 Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer francisco.romerobueno@telefonica.com Big Data: What is it and how
More informationBig Data Analytics CP3620
Big Data Analytics CP3620 Big Data Some facts: 2.7 Zettabytes (2.7 billion TB) of data exists in the digital universe and it s growing. Facebook stores, accesses, and analyzes 30+ Petabytes (1000 TB) of
More informationIntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertthat(comparator.compare(w1, w2), greaterthan(0));
factory for RawComparator instances (that Writable implementations have registered). For example, to obtain a comparator for IntWritable, we just use: RawComparator comparator = WritableComparator.get(IntWritable.class);
More informationEE657 Spring 2012 HW#4 Zhou Zhao
EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the
More informationBig Data Analytics: Insights and Innovations
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations
More informationLAMPIRAN. public static void runaniteration (String datafile, String clusterfile) {
DAFTAR PUSTAKA [1] Mishra Shweta, Badhe Vivek. (2016), Improved Map Reduce K Means Clustering Algorithm for Hadoop Architectur, International Journal Of Engineering and Computer Science, 2016, IJECS. [2]
More informationCloud Computing. Up until now
Cloud Computing Lecture 9 Map Reduce 2010-2011 Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1 Outline Map Reduce:
More informationHortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :
Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationHadoop 2.8 Configuration and First Examples
Hadoop 2.8 Configuration and First Examples Big Data - 29/03/2017 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of
More informationTable of Contents. Chapter Topics Page No. 1 Meet Hadoop
Table of Contents Chapter Topics Page No 1 Meet Hadoop - - - - - - - - - - - - - - - - - - - - - - - - - - - 3 2 MapReduce - - - - - - - - - - - - - - - - - - - - - - - - - - - - 10 3 The Hadoop Distributed
More informationComputer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am
Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.
More informationBig Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme
Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationBig Data and Scripting map reduce in Hadoop
Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks
More informationA Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science
A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science Introduction The Hadoop cluster in Computing Science at Stirling allows users with a valid user account to submit and
More informationCOSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.
COSC 6397 Big Data Analytics Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading Edgar Gabriel Spring 2014 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in
More informationAn efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup)
Rensselaer Polytechnic Institute Universidade Federal de Viçosa An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Prof. Dr. W Randolph Franklin, RPI Salles Viana Gomes
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing
ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters
More informationProcessing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer
Processing big data with modern applications: Hadoop as DWH backend at Pro7 Dr. Kathrin Spreyer Big data engineer GridKa School Karlsruhe, 02.09.2014 Outline 1. Relational DWH 2. Data integration with
More informationHadoop 3 Configuration and First Examples
Hadoop 3 Configuration and First Examples Big Data - 26/03/2018 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of companies
More informationBig Data con MATLAB. Lucas García The MathWorks, Inc. 1
Big Data con MATLAB Lucas García 2015 The MathWorks, Inc. 1 Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 2 Architecture of an analytics system Data from instruments
More informationAbout this exam review
Final Exam Review About this exam review I ve prepared an outline of the material covered in class May not be totally complete! Exam may ask about things that were covered in class but not in this review
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationVendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo
Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Pattern Hadoop Mix Graphs Giraph Spark Zoo Keeper Spark But first Partitioner & Combiner
More informationCSC 1315! Data Science
CSC 1315! Data Science Data Visualization Based on: Python for Data Analysis: http://hamelg.blogspot.com/2015/ Learning IPython for Interactive Computation and Visualization by C. Rossant Plotting with
More informationMap-Reduce in Various Programming Languages
Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the
More informationToday s topics. FAQs. Modify the way data is loaded on disk. Methods of the InputFormat abstract. Input and Output Patterns --Continued
Spring 2017 3/29/2017 W11.B.1 CS435 BIG DATA Today s topics FAQs /Output Pattern Recommendation systems Collaborative Filtering Item-to-Item Collaborative filtering PART 2. DATA ANALYTICS WITH VOLUMINOUS
More informationOutline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop
Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed
More information1/30/2019 Week 2- B Sangmi Lee Pallickara
Week 2-A-0 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Term project deliverable
More informationAttacking & Protecting Big Data Environments
Attacking & Protecting Big Data Environments Birk Kauer & Matthias Luft {bkauer, mluft}@ernw.de #WhoAreWe Birk Kauer - Security Researcher @ERNW - Mainly Exploit Developer Matthias Luft - Security Researcher
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve
More informationIT 313 Advanced Application Development Midterm Exam
Page 1 of 9 February 12, 2019 IT 313 Advanced Application Development Midterm Exam Name Part A. Multiple Choice Questions. Circle the letter of the correct answer for each question. Optional: supply a
More informationHigh-Performance Analytics on Large- Scale GPS Taxi Trip Records in NYC
High-Performance Analytics on Large- Scale GPS Taxi Trip Records in NYC Jianting Zhang Department of Computer Science The City College of New York Outline Background and Motivation Parallel Taxi data management
More informationBenchmarking Distributed Stream Processing Platforms for IoT Applications
DISTRIBUTED RESEARCH ON EMERGING APPLICATIONS & MACHINES dream-lab.in Indian Institute of Science, Bangalore DREAM:Lab Benchmarking Distributed Stream Processing Platforms for IoT Applications Anshu Shukla
More informationHadoop Cluster Implementation
Hadoop Cluster Implementation By Aysha Binta Sayed ID:2013-1-60-068 Supervised By Dr. Md. Shamim Akhter Assistant Professor Department of Computer Science and Engineering East West University A project
More informationStudying software design patterns is an effective way to learn from the experience of others
Studying software design patterns is an effective way to learn from the experience of others Design Pattern allows the requester of a particular action to be decoupled from the object that performs the
More informationBig Data landscape Lecture #2
Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13
More information