FINAL PROJECT REPORT

Size: px
Start display at page:

Download "FINAL PROJECT REPORT"

Transcription

1 FINAL PROJECT REPORT NYC TAXI DATA ANALYSIS Reshmi Padavala Project Summary: For my final project, I decided to showcase my big data analysis skills by working on a large amount of dataset. On which is very difficult to identify patterns and visualize using any normal tools other than the powerful concepts like "MapReduce" that I learned during the current course tenure. After doing research and weighing my options, finally, I set my mind to work on the NewYork City(NYC) taxi data to produce some in-depth analysis on taxi ride patterns and behaviors. These datasets were made accessible by the NYC Taxi and Limousine Commission (TLC). The NYC taxi dataset contains trip records which include fields like " pick-up and drop-off, dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts"[1]. The dataset is huge than I expected containing millions of records so I truncated the scope of the project to display some core analysis patterns. In this paper, I performed analysis to identify the trends in the taxi ride in NYC from different boroughs during the period of Dataset: NYC taxi dataset: NYC geostats:

2 Analysis: Analysis 1: Data Cleansing The dataset I chose is huge and has raw data, I had to cleanse the data for the date fields and null records and filter only the records that I needed, I performed Simple Filtering Pattern. Output: Analysis 2: Statistical Analysis This analysis is performed to identify the statistical data such as total rides, revenue, maximum toll charges, maximum tip amount of every day in So, I considered the date as the key and using a Custom Writable object I have generated the values. Since, the functions performed on the reducer is aggregation, the performance is optimized by using reducer as a combiner.

3 Output: Analysis 3: Peak Hour Analysis The analysis performed is to identify the peak hours in a day and the amount earned during those peak hours. The motivation behind this analysis is that rides in a day are not consistent every

4 hour. Most of the rides would be taken during the office hours either in the morning or at the evening. Using {Date, Hours as a Custom Writable object I performed Secondary Sorting generated {total rides, total amount as the value. This output is chained to another Secondary Sorting where {Date, total rides is taken as key, {Hours, total amount as the value and thus generated the peak hours in a day.

5 Output: Analysis 4: Day Based Surcharge Analysis The surcharges applied on a ride depends on the day of the week. Due to high demand of riders during weekends, Saturday and Sunday are expected to have more applicable surcharges than the rest of the week.

6 To perform this analysis, I made use of Partitioning Pattern and divided the data into 7 partitions one for each day of the week respectively. Since, the data is partitioned, the analysis can be performed on any particular partition of choice without having the load of running the MR on the entire data. So, after partitioning, the total surcharge for the entire dataset on that particular day of the week has been calculated from the respective partitions. And the results are obvious. Output: Analysis 5: Boroughs with Most Number of Riders Every ride has an inter-connection with the neighborhood of pickup. Passengers from some locations prefer more to commute on taxi. This analysis is used to find out how many passengers from each borough have commuted on NYC taxi.

7 To perform this analysis, I had to do an inner join between taxi rides data and the location data. Now, the new generated output has the pickup boroughs. The output is grouped by boroughs to find the total passengers. Output: Analysis 6: Identifying Distinct Neighborhoods Each borough has multiple zones. I wanted to know how many unique zones or neighborhoods exist in total. I used Distinct Pattern to filter out the unique zones from the NYC taxi dataset.

8 Analysis 7: Top 10 Pickup Zones This analysis is to identify the most frequent pick up zones. For this I have first check how pickups from different zones are distributed across New York City and from this output based on the total rides, I have emitted the top ten zones using Top Ten Pattern.

9 Output: Analysis 8: Fare Analysis on Top 5 Zones Based on the result from the above, for my analysis, I wanted to concentrate on the top 5 zones which has the most riders from. Since, I have an idea about what I am looking for, I made use of a Bloom Filter to filter out the remaining zones which are not in the top 5 list. And on the filtered data, I was looking for the median fare charges from these zones. And for optimization, I made use of a combiner.

10 Output: Analysis 9: Calculate the longest rides I made use of Pig Latin script to perform this analysis. I have LOADED the taxi data and zone data into variables and performed JOIN using the pickup ID to get the pickup zone name then joined the resulting data with zone data on drop off ID to get the drop off zone. Then ORDERED it based on the total distance covered in the ride and LIMITED the output to top 20 longest rides.

11

12 Analysis 10: Calculate total rides from each borough I made use of Pig Latin script to perform this analysis. I have GROUPED the loaded data based on the pickup boroughs. On each group, the total number of rides is calculated. Total rides from each borough 31% 0%4% 30% 35% 0% EWR Bronx Queens Unknown Brooklyn Manhattan Staten Island Output: Analysis 11: Find the number of riders from each borough every day To perform this analysis I made use of Hive. After the data is loaded, I have grouped the data based on the pickup data and pickup borough and generated the number of riders.

13

14 Programming Code: Analysis 1: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

15 package analysis3; import java.io.ioexception; import java.text.decimalformat; import java.text.parseexception; import java.text.simpledateformat; import java.util.calendar; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; /** * reshmip public class Analysis3 { /** args the command line arguments

16 public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "filtering data"); job.setjarbyclass(analysis3.class); job.setmapperclass(filteringmapper.class); job.setmapoutputvalueclass(customwritable.class); job.setmapoutputkeyclass(nullwritable.class); //job.setoutputkeyclass(nullwritable.class); //job.setoutputvalueclass(customwritable.class); job.setnumreducetasks(1); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); public static class FilteringMapper extends Mapper<Object,Text,NullWritable,CustomWritable>{ private CustomWritable customwritable = new CustomWritable(); private final static SimpleDateFormat frmt = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException {

17 String line = value.tostring(); String[] line_values = line.split(","); Calendar cal = Calendar.getInstance(); DecimalFormat numberformat = new DecimalFormat("#.0000"); try{ if(line_values.length == 21){ if(!line_values[1].equals("lpep_pickup_datetime") && line_values[1]!=(null) &&!line_values[1].equals("null")){ String[] pickupdatestring = line_values[1].split(" "); String pickupdate = pickupdatestring[0]; String pickuptime = pickupdatestring[1]; cal.settime(frmt.parse(pickupdate)); String pick_date = cal.gettime().tostring(); // String[] new_date = pick_date.split(" "); // String new_pickdate = new StringBuilder().append(new_date[0]) //.append(" ").append(new_date[1]).append(" "). // append(new_date[2]).append(" ").append(new_date[5]).tostring(); customwritable.setride_pickup_date(pickupdate); customwritable.setride_pickup_time(pickuptime); // String[] dropoffdatestring = line_values[2].split(" "); // String dropoffdate = dropoffdatestring[0]; // String dropofftime = dropoffdatestring[1]; // customwritable.setride_dropoff_date(dropoffdate); // customwritable.setride_dropoff_time(dropofftime);

18 // customwritable.setratecodeid(integer.parseint(line_values[4])); String pickup_longitude="",drop_longitude=""; String pickup_latitude="",drop_latitiude=""; if(line_values[5].length()>8 && line_values[6].length()>7 && line_values[7].length()>8 && line_values[8].length()>7){ pickup_longitude = line_values[5].substring(0, 8); pickup_latitude = line_values[6].substring(0,7); drop_longitude = line_values[7].substring(0,8); drop_latitiude = line_values[8].substring(0, 7); //System.err.println("coordinates:"+longitude); customwritable.setpick_longitude((pickup_longitude)); customwritable.setpickup_latitude((pickup_latitude)); // customwritable.setdropoff_longitude((drop_longitude)); // customwritable.setdropoff_latitude((drop_latitiude)); customwritable.setpassengers(integer.parseint(line_values[9])); customwritable.settrip_distance(double.parsedouble(line_values[10])); customwritable.setfare_amount(double.parsedouble(line_values[11])); customwritable.setexta(double.parsedouble(line_values[12])); customwritable.setmta_tax(double.parsedouble(line_values[13])); customwritable.settip_amount(double.parsedouble(line_values[14])); customwritable.settotal_amount(double.parsedouble(line_values[18])); customwritable.setpayment_type(integer.parseint(line_values[19])); context.write(nullwritable.get(),customwritable);

19 catch(nullpointerexception ex){ ex.getmessage(); catch (ParseException ex) { Logger.getLogger(Analysis3.class.getName()).log(Level.SEVERE, null, ex); Custom Writable.java /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis3; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils; /**

20 * reshmip public class CustomWritable implements Writable{ private String ride_pickup_date; private String ride_pickup_time; private String ride_dropoff_date; private String ride_dropoff_time; private int ratecodeid; private String pick_longitude; private String pickup_latitude; private String dropoff_longitude; private String dropoff_latitude; private int passengers; private Double trip_distance; private Double fare_amount; private Double exta; private Double mta_tax; private Double tip_amount; private Double total_amount; private int payment_type; public String getride_pickup_date() { return ride_pickup_date; public void setride_pickup_date(string ride_pickup_date) {

21 this.ride_pickup_date = ride_pickup_date; public String getride_pickup_time() { return ride_pickup_time; public void setride_pickup_time(string ride_pickup_time) { this.ride_pickup_time = ride_pickup_time; public String getride_dropoff_date() { return ride_dropoff_date; public void setride_dropoff_date(string ride_dropoff_date) { this.ride_dropoff_date = ride_dropoff_date; public String getride_dropoff_time() { return ride_dropoff_time; public void setride_dropoff_time(string ride_dropoff_time) { this.ride_dropoff_time = ride_dropoff_time;

22 public int getratecodeid() { return ratecodeid; public void setratecodeid(int ratecodeid) { this.ratecodeid = ratecodeid; public String getpick_longitude() { return pick_longitude; public void setpick_longitude(string pick_longitude) { this.pick_longitude = pick_longitude; public String getpickup_latitude() { return pickup_latitude; public void setpickup_latitude(string pickup_latitude) { this.pickup_latitude = pickup_latitude; public String getdropoff_longitude() { return dropoff_longitude;

23 public void setdropoff_longitude(string dropoff_longitude) { this.dropoff_longitude = dropoff_longitude; public String getdropoff_latitude() { return dropoff_latitude; public void setdropoff_latitude(string dropoff_latitude) { this.dropoff_latitude = dropoff_latitude; public int getpassengers() { return passengers; public void setpassengers(int passengers) { this.passengers = passengers; public Double gettrip_distance() { return trip_distance; public void settrip_distance(double trip_distance) { this.trip_distance = trip_distance;

24 public Double getfare_amount() { return fare_amount; public void setfare_amount(double fare_amount) { this.fare_amount = fare_amount; public Double getexta() { return exta; public void setexta(double exta) { this.exta = exta; public Double getmta_tax() { return mta_tax; public void setmta_tax(double mta_tax) { this.mta_tax = mta_tax; public Double gettip_amount() {

25 return tip_amount; public void settip_amount(double tip_amount) { this.tip_amount = tip_amount; public Double gettotal_amount() { return total_amount; public void settotal_amount(double total_amount) { this.total_amount = total_amount; public int getpayment_type() { return payment_type; public void setpayment_type(int payment_type) { this.payment_type = payment_type;

26 @Override public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_pickup_date); WritableUtils.writeString(d,ride_pickup_time); //WritableUtils.writeString(d, ride_dropoff_date); //WritableUtils.writeString(d,ride_dropoff_time); d.writeint(ratecodeid); d.writeint(passengers); d.writeint(payment_type); WritableUtils.writeString(d,pick_longitude); WritableUtils.writeString(d,pickup_latitude); //WritableUtils.writeString(d,dropoff_longitude); //WritableUtils.writeString(d,dropoff_latitude); d.writedouble(trip_distance); d.writedouble(fare_amount); d.writedouble(exta); d.writedouble(mta_tax); d.writedouble(tip_amount); public void readfields(datainput di) throws IOException { ride_pickup_date = WritableUtils.readString(di); ride_pickup_time = WritableUtils.readString(di); //ride_dropoff_date = WritableUtils.readString(di);

27 //ride_dropoff_time = WritableUtils.readString(di); ratecodeid = di.readint(); passengers = di.readint(); payment_type = di.readint(); pick_longitude = WritableUtils.readString(di); pickup_latitude = WritableUtils.readString(di); //dropoff_latitude = WritableUtils.readString(di); //dropoff_longitude = WritableUtils.readString(di); trip_distance = di.readdouble(); fare_amount = di.readdouble(); exta = di.readdouble(); mta_tax = di.readdouble(); tip_amount = di.readdouble(); total_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_pickup_date). append("\t").append(ride_pickup_time). //append("\t").append(ride_dropoff_date). //append("\t").append(ride_dropoff_time). append("\t").append(ratecodeid). append("\t").append(pick_longitude). append("\t").append(pickup_latitude). //append("\t").append(dropoff_longitude).

28 //append("\t").append(dropoff_latitude). append("\t").append(passengers). append("\t").append(trip_distance). append("\t").append(fare_amount). append("\t").append(exta). append("\t").append(mta_tax). append("\t").append(tip_amount). append("\t").append(total_amount). append("\t").append(payment_type). tostring()); Analysis 2: Driver class : /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job;

29 import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; /** * reshmip public class Analysis1 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "summarize trips"); job.setjarbyclass(analysis1.class); job.setmapperclass(analysis1_mapper.class); job.setmapoutputvalueclass(customwritable.class); job.setcombinerclass(analysis1_reducer.class); job.setreducerclass(analysis1_reducer.class); job.setmapoutputkeyclass(text.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(customwritable.class);

30 FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class Analysis1_Mapper extends Mapper<Object,Text,Text,CustomWritable>{ private CustomWritable customwritable = new CustomWritable();

31 //private IntWritable ride; private Text tripdate = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); double distance=0; double fare=0; String[] line_values = line.split(","); try{ if(line_values.length == 21){ if((!(line_values[1].equals("pickup_date"))) && (!(line_values[1].equals(""))) && line_values[1]!="" && line_values[1]!="na" && line_values[1]!=null){ tripdate.set(line_values[1].split(" ")[0]); customwritable.settrip_distance(double.parsedouble(line_values[10])); customwritable.settrip_fare(double.parsedouble(line_values[11])); customwritable.setmax_tip(double.parsedouble(line_values[14])); customwritable.setmax_toll(double.parsedouble(line_values[15])); context.write(tripdate,customwritable); catch (NumberFormatException ex) { ex.getmessage();

32 catch(nullpointerexception ex){ ex.getmessage(); Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class Analysis1_Reducer extends Reducer<Text, CustomWritable, Text, CustomWritable>{ private CustomWritable result = new CustomWritable();

33 @Override protected void reduce(text key, Iterable<CustomWritable> values, Context context) throws IOException, InterruptedException { double sumtrip = 0; double sumfare = 0; Double max_tip = 0.0; Double max_toll = 0.0; result.setmax_tip(0.0); result.setmax_toll(0.0); result.settrip_distance(0.0); result.settrip_fare(0.0); for(customwritable val : values){ sumtrip+= val.gettrip_distance(); sumfare+=val.gettrip_fare(); max_tip = val.getmax_tip(); max_toll = val.getmax_toll(); if(result.getmax_tip()== null max_tip.compareto(result.getmax_tip()) > 0){ result.setmax_tip(max_tip); if(result.getmax_toll()== null max_toll.compareto(result.getmax_toll()) > 0){ result.setmax_toll(max_toll);

34 result.settrip_distance(sumtrip); result.settrip_fare(sumfare); context.write(key, result); Custom Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis1; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; /** *

35 reshmip public class CustomWritable implements Writable{ private Double trip_distance; private Double trip_fare; private Double max_tip; private Double max_toll; public Double gettrip_distance() { return trip_distance; public void settrip_distance(double trip_distance) { this.trip_distance = trip_distance; public Double gettrip_fare() { return trip_fare; public void settrip_fare(double trip_fare) { this.trip_fare = trip_fare; public Double getmax_tip() { return max_tip;

36 public void setmax_tip(double max_tip) { this.max_tip = max_tip; public Double getmax_toll() { return max_toll; public void setmax_toll(double max_toll) { this.max_toll = public void write(dataoutput d) throws IOException { d.writedouble(trip_distance); d.writedouble(trip_fare); d.writedouble(max_tip); public void readfields(datainput di) throws IOException { trip_distance = di.readdouble(); trip_fare = di.readdouble(); max_tip = di.readdouble();

37 max_toll = di.readdouble(); public String tostring(){ return (new StringBuilder().append(trip_distance).append("\t").append(trip_fare).append("\t").append(max_tip).append("\t").append(max_toll).toString()); Analysis 3: Driver class: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable;

38 import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** * reshmip public class Analysis2 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf,"first secondary sorting by date and hours"); job.setjarbyclass(analysis2.class); job.setmapperclass(secondarysortmapper.class);

39 job.setmapoutputkeyclass(compositekeywritable.class); job.setmapoutputvalueclass(customvaluewritable.class); //job.setgroupingcomparatorclass(groupingcomparator.class); //job.setnumreducetasks(0); job.setreducerclass(secondarysortreducer.class); job.setoutputkeyclass(compositekeywritable.class); job.setoutputvalueclass(customvaluewritable.class); job.setinputformatclass(textinputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "second secondary sorting on date and rides"); if(complete){ job2.setjarbyclass(analysis2.class); job2.setmapperclass(peakanalysismapper.class); job2.setmapoutputkeyclass(peakanalysiswritable.class); job2.setmapoutputvalueclass(peakanalysisvaluewritable.class); job2.setreducerclass(peakanalysisreducer.class); job2.setoutputkeyclass(peakanalysiswritable.class); job2.setoutputvalueclass(peakanalysisvaluewritable.class);

40 FileInputFormat.addInputPath(job2, new Path (args[1])); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1); Composite Key Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writableutils; /**

41 * reshmip public class CompositeKeyWritable implements WritableComparable<CompositeKeyWritable>{ private String ride_date; private String ride_time; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public String getride_time() { return ride_time; public void setride_time(string ride_time) { this.ride_time = public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date);

42 WritableUtils.writeString(d, public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); ride_time = WritableUtils.readString(di); public String tostring(){ return (new public int compareto(compositekeywritable o) { int result = ride_date.compareto(o.ride_date); if(result == 0){ result = ride_time.compareto(o.ride_time); return (-1)*result;

43 Composite Value Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; /** * reshmip public class CustomValueWritable implements Writable{ private Double ride_amount; private int count_rides; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) {

44 this.ride_amount = ride_amount; public int getcount_rides() { return count_rides; public void setcount_rides(int count_rides) { this.count_rides = public void write(dataoutput d) throws IOException { d.writeint(count_rides); public void readfields(datainput di) throws IOException { count_rides = di.readint(); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_amount).append("\t").append(count_rides).toString());

45 Grouping Comparator: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writablecomparator; /** * reshmip public class GroupingComparator extends WritableComparator{ protected GroupingComparator() { super(compositekeywritable.class,true);

46 @Override public int compare(writablecomparable w1, WritableComparable w2){ CompositeKeyWritable cw1 = (CompositeKeyWritable) w1; CompositeKeyWritable cw2 = (CompositeKeyWritable) w2; return cw1.getride_date().compareto(cw2.getride_date()); Secondary Sort Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip

47 public class SecondarySortMapper extends Mapper<Object, Text, CompositeKeyWritable,CustomValueWritable>{ private DoubleWritable total_amount = new DoubleWritable(); private CompositeKeyWritable cw = new CompositeKeyWritable(); private CustomValueWritable customval = new CustomValueWritable(); public void map(object key, Text value, Context context){ String values[] = value.tostring().split("\\t"); cw.setride_date(""); cw.setride_time(""); customval.setcount_rides(0); customval.setride_amount(0.0); try{ if(values.length==13){ String date = values[0]; String hours = values[1].split(":")[0]; Double amount = Double.parseDouble(values[12]); //cw = new CompositeKeyWritable(date,hours); cw.setride_date(date); cw.setride_time(hours); //total_amount.set(amount); customval.setcount_rides(1); customval.setride_amount(amount); context.write(cw,customval); catch(ioexception InterruptedException ex){

48 System.out.println("Error Message:" +ex.getmessage()); Secondary Sort Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class SecondarySortReducer extends Reducer<CompositeKeyWritable,CustomValueWritable,CompositeKeyWritable,CustomValueWr itable>{ //Double totalamt = 0.0; CustomValueWritable customval = new CustomValueWritable();

49 private DoubleWritable total_amount = new protected void reduce(compositekeywritable key, Iterable<CustomValueWritable> values, Context context) throws IOException, InterruptedException { double sumamount = 0; int totalrides = 0; for(customvaluewritable val : values){ sumamount+= val.getride_amount(); totalrides+=val.getcount_rides(); customval.setride_amount(sumamount); customval.setcount_rides(totalrides); context.write(key, customval); Peak Analysis Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

50 package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.io.writableutils; /** * reshmip public class PeakAnalysisWritable implements Writable,WritableComparable<PeakAnalysisWritable>{ private String ride_date; private Integer count_rides; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public Integer getcount_rides() {

51 return count_rides; public void setcount_rides(integer count_rides) { this.count_rides = public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date); public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); count_rides = public int compareto(peakanalysiswritable o) { int result = ride_date.compareto(o.ride_date); if(result == 0){

52 result = count_rides.compareto(o.count_rides); return (-1)*result; public String tostring(){ return (new StringBuilder().append(ride_date).append("\t").append(count_rides).toString()); Peak Analysis Value Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils; /**

53 * reshmip public class PeakAnalysisValueWritable implements Writable{ private Double ride_amount; private String ride_time; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) { this.ride_amount = ride_amount; public String getride_time() { return ride_time; public void setride_time(string ride_time) { this.ride_time = public void write(dataoutput d) throws IOException {

54 WritableUtils.writeString(d, ride_time); public void readfields(datainput di) throws IOException { ride_time = WritableUtils.readString(di); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_time).append("\t").append(ride_amount).toString()); Peak Analysis Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception;

55 import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class PeakAnalysisMapper extends Mapper<Object, Text, PeakAnalysisWritable,PeakAnalysisValueWritable>{ private PeakAnalysisWritable cw = new PeakAnalysisWritable(); private PeakAnalysisValueWritable customval = new PeakAnalysisValueWritable(); public void map(object key, Text value, Context context){ String values[] = value.tostring().split("\\t"); cw.setride_date(""); cw.setcount_rides(0); customval.setride_time(""); customval.setride_amount(0.0); try{ if(values.length==4){ String date = values[0]; String hours = values[1]; Double amount = Double.parseDouble(values[2]); int count = Integer.parseInt(values[3]); //cw = new CompositeKeyWritable(date,hours); cw.setride_date(date); cw.setcount_rides(count);

56 //total_amount.set(amount); customval.setride_time(hours); customval.setride_amount(amount); context.write(cw,customval); catch(ioexception InterruptedException ex){ System.out.println("Error Message:" +ex.getmessage()); Peak Analysis Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis2; import java.io.ioexception; import org.apache.hadoop.mapreduce.reducer; /** * reshmip

57 public class PeakAnalysisReducer extends Reducer<PeakAnalysisWritable,PeakAnalysisValueWritable,PeakAnalysisWritable,PeakAnalysisV aluewritable>{ //Double totalamt = 0.0; private PeakAnalysisWritable customkey = new PeakAnalysisWritable(); private PeakAnalysisValueWritable customvalue = new protected void reduce(peakanalysiswritable key, Iterable<PeakAnalysisValueWritable> values, Context context) throws IOException, InterruptedException { for(peakanalysisvaluewritable val : values){ context.write(key, val); Analysis 4: Driver class: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.ioexception;

58 import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.multipleoutputs; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** * reshmip public class Analysis4 { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { // TODO code application logic here Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "partitioning pattern");

59 job.setjarbyclass(analysis4.class); job.setmapperclass(analysis4_mapper.class); job.setmapoutputkeyclass(intwritable.class); job.setmapoutputvalueclass(floatwritable.class); // MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class, Text.class, NullWritable.class); // MultipleOutputs.setCountersEnabled(job, true); job.setpartitionerclass(groupbydaypartitioner.class); job.setcombinerclass(analysis4_reducer.class); job.setnumreducetasks(7); //job.setnumreducetasks(0); job.setcombinerclass(analysis4_reducer.class); job.setreducerclass(analysis4_reducer.class); job.setoutputkeyclass(intwritable.class); job.setoutputvalueclass(floatwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "Borough Rides"); if(complete){ job2.setjarbyclass(analysis4.class); job2.setmapperclass(identitiymapper.class);

60 job2.setmapoutputkeyclass(nullwritable.class); job2.setmapoutputvalueclass(text.class); job2.setreducerclass(identityreducer.class); job2.setoutputkeyclass(nullwritable.class); job2.setoutputvalueclass(text.class); FileInputFormat.addInputPath(job2, new Path (args[1])); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1); Custom Writable: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.datainput; import java.io.dataoutput; import java.io.ioexception; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writableutils;

61 /** * reshmip public class CustomWritable implements Writable{ private String ride_date; private Double ride_amount; public String getride_date() { return ride_date; public void setride_date(string ride_date) { this.ride_date = ride_date; public Double getride_amount() { return ride_amount; public void setride_amount(double ride_amount) { this.ride_amount =

62 public void write(dataoutput d) throws IOException { WritableUtils.writeString(d, ride_date); public void readfields(datainput di) throws IOException { ride_date = WritableUtils.readString(di); ride_amount = di.readdouble(); public String tostring(){ return (new StringBuilder().append(ride_date).append("\t").append(ride_amount).toString()); //.append("\t").append(ride_amount).tostring()); Group by Partitioner: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import org.apache.hadoop.io.floatwritable;

63 import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.partitioner; /** * reshmip public class GroupByDayPartitioner extends Partitioner<IntWritable, public int getpartition(intwritable key, FloatWritable value, int i) { return (key.get()%i); Mapper: package analysis4; import java.io.ioexception; import java.text.parseexception; import java.text.simpledateformat; import java.util.calendar; import java.util.logging.level; import java.util.logging.logger; import org.apache.hadoop.io.floatwritable;

64 import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.lib.output.multipleoutputs; /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. /** * reshmip public class Analysis4_Mapper extends Mapper<Object, Text, IntWritable, FloatWritable>{ // private MultipleOutputs<Text, NullWritable> mos = null; private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-mm-dd"); private CustomWritable tuple = new CustomWritable(); // protected void setup(context context) throws IOException, InterruptedException { // mos = new MultipleOutputs(context); //

65 @Override protected void map(object key, Text value, Context context) throws IOException, InterruptedException { Calendar cal = Calendar.getInstance(); String[] row = value.tostring().split("\\t"); String pickupdate = row[0]; int day=0; float surcharge = 0; try { cal.settime(frmt.parse(pickupdate)); day = cal.get(calendar.day_of_week); tuple.setride_date(pickupdate); tuple.setride_amount(double.parsedouble(row[11])); surcharge = Float.parseFloat(row[11])-Float.parseFloat(row[7]); context.write(new IntWritable(day), new FloatWritable(surcharge)); catch (ParseException ex) { Logger.getLogger(Analysis4_Mapper.class.getName()).log(Level.SEVERE, null, ex); Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

66 package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class Analysis4_Reducer extends Reducer<IntWritable, FloatWritable, IntWritable, FloatWritable>{ private CustomWritable result = new CustomWritable(); // protected void reduce(intwritable key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException { // float total_amount = 0; // for(floatwritable t : values){ // total_amount += t.get(); // // // amount.set(total_amount);

67 // context.write(key,amount); protected void reduce(intwritable key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException { float amount = 0; String date=""; for(floatwritable val : values){ //date = val.getride_date(); amount += val.get(); //result.setride_amount(amount); result.setride_date(date); context.write(key, new FloatWritable(amount)); Identity Mapper: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

68 package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class IdentitiyMapper extends Mapper<Object, Text, NullWritable,Text>{ private Text outkey = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { //To change body of generated methods, choose Tools Templates. context.write(nullwritable.get(),value);

69 Identity Reducer: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package analysis4; import java.io.ioexception; import org.apache.hadoop.io.floatwritable; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class IdentityReducer extends protected void reduce(nullwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for(text value : values){ context.write(key,value);

70 Analysis 5: Driver: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.multipleinputs; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; /** *

71 reshmip public class InnerJoin { /** args the command line arguments public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "inner_join"); job.setjarbyclass(innerjoin.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, InnerJoin_Mapper1.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, InnerJoin_Mapper2.class); job.setreducerclass(innerjoin_reducer1.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputformatclass(textoutputformat.class); TextOutputFormat.setOutputPath(job, new Path(args[2])); boolean complete = job.waitforcompletion(true); Configuration conf2 = new Configuration(); Job job2 = Job.getInstance(conf2, "Most Passangers");

72 if(complete){ job2.setjarbyclass(innerjoin.class); FileInputFormat.addInputPath(job, new Path(args[2])); //MultipleInputs.addInputPath(job2, new Path(args[3]), TextInputFormat.class,JoinMapper4.class); job2.setmapperclass(innerjoin_mapper3.class); job2.setmapoutputkeyclass(text.class); job2.setmapoutputvalueclass(intwritable.class); job2.setreducerclass(innerjoin_reducer2.class); job2.setoutputformatclass(textoutputformat.class); TextOutputFormat.setOutputPath(job2, new Path(args[3])); job2.setoutputkeyclass(text.class); job2.setoutputvalueclass(intwritable.class); System.exit(job2.waitForCompletion(true)? 0 : 1); Mapper 1: /* * To change this license header, choose License Headers in Project Properties.

73 * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper1 extends Mapper<Object, Text, Text, Text> { private Text outkey = new Text(); private Text outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String[] separatedinput = value.tostring().split("\\t"); //String id = separatedinput[6]; String pickuploc = separatedinput[6]; if(pickuploc==null pickuploc == "" pickuploc.equalsignorecase("")){ return;

74 outkey.set(pickuploc); outvalue.set("a" + value); context.write(outkey, outvalue); Mapper 2: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper2 extends Mapper<Object, Text, Text, Text> { private Text outkey = new Text();

75 private Text outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String[] line_values = line.split(","); if(line_values.length==5){ String latitude = value.tostring().split(",")[4].trim(); if(latitude==null latitude =="" latitude.equalsignorecase("")){ return; outkey.set(latitude); outvalue.set("b" + value); context.write(outkey, outvalue); Mapper 3: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor.

76 package innerjoin; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; /** * reshmip public class InnerJoin_Mapper3 extends Mapper<Object, Text, Text, IntWritable> { private Text outkey = new Text(); private IntWritable outvalue = new protected void map(object key, Text value, Context context) throws IOException, InterruptedException { String lines = value.tostring(); String[] line = lines.split("\\t"); if(line.length==18){ String area = line[17]; if(!area.equalsignorecase("") area!=null!area.equalsignorecase("null"))

77 { String[] locations = area.split(","); if(locations.length>1){ String borough = locations[0]; outkey.set(borough); outvalue.set(integer.parseint(locations[5])); context.write(outkey,outvalue); Reducer 1: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin; import java.io.ioexception; import java.util.arraylist; import org.apache.hadoop.io.text;

78 import org.apache.hadoop.mapreduce.reducer; /** * reshmip public class InnerJoin_Reducer1 extends Reducer<Text, Text, Text, Text> { public static final Text EMPTY_TEXT = new Text(); private Text tmp = new Text(); private ArrayList<Text> lista = new ArrayList<Text>(); private ArrayList<Text> listb = new ArrayList<Text>(); private String jointype = null; protected void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { lista.clear(); listb.clear(); while (values.iterator().hasnext()) { tmp = values.iterator().next(); if (tmp.charat(0) == 'A') { lista.add(new Text(tmp.toString().substring(1))); else if (tmp.charat(0) == 'B') { listb.add(new Text(tmp.toString().substring(1)));

79 executejoinlogic(context); private void executejoinlogic(context context) throws IOException, InterruptedException { if (!lista.isempty() &&!listb.isempty()) { for (Text A : lista) { for (Text B : listb) { context.write(a, B); Reducer 2: /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools Templates * and open the template in the editor. package innerjoin;

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

Topics covered in this lecture

Topics covered in this lecture 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

The core source code of the edge detection of the Otsu-Canny operator in the Hadoop

The core source code of the edge detection of the Otsu-Canny operator in the Hadoop Attachment: The core source code of the edge detection of the Otsu-Canny operator in the Hadoop platform (ImageCanny.java) //Map task is as follows. package bishe; import java.io.ioexception; import org.apache.hadoop.fs.path;

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

15/03/2018. Combiner

15/03/2018. Combiner Combiner 2 1 Standard MapReduce applications The (key,value) pairs emitted by the Mappers are sent to the Reducers through the network Some pre-aggregations could be performed to limit the amount of network

More information

Steps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/

Steps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/ SL-V BE IT EXP 7 Aim: Design and develop a distributed application to find the coolest/hottest year from the available weather data. Use weather data from the Internet and process it using MapReduce. Steps:

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics July 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern W4.A.0.0 CS435 Introduction to Big Data W4.A.1 FAQs PA0 submission is open Feb. 6, 5:00PM via Canvas Individual submission (No team submission) If you have not been assigned the port range, please contact

More information

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics January 22, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc. D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels

More information

Hadoop 3.X more examples

Hadoop 3.X more examples Hadoop 3.X more examples Big Data - 09/04/2018 Let s start with some examples! http://www.dia.uniroma3.it/~dvr/es2_material.zip Example: LastFM Listeners per Track Consider the following log file UserId

More information

Chapter 3. Distributed Algorithms based on MapReduce

Chapter 3. Distributed Algorithms based on MapReduce Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data

More information

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/

More information

MapReduce and Hadoop. The reference Big Data stack

MapReduce and Hadoop. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

More information

Hadoop 2.X on a cluster environment

Hadoop 2.X on a cluster environment Hadoop 2.X on a cluster environment Big Data - 05/04/2017 Hadoop 2 on AMAZON Hadoop 2 on AMAZON Hadoop 2 on AMAZON Regions Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on

More information

Large-scale Information Processing

Large-scale Information Processing Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,

More information

Dhavide Aruliah Director of Training, Anaconda

Dhavide Aruliah Director of Training, Anaconda PARALLEL COMPUTING WITH DASK Using Dask DataFrames Dhavide Aruliah Director of Training, Anaconda Reading CSV In [1]: import dask.dataframe as dd dd.read_csv() function Accepts single filename or glob

More information

Package nyctaxi. October 26, 2017

Package nyctaxi. October 26, 2017 Title Accessing New York City Taxi Data Version 0.0.1 Date 2017-10-24 Package nyctaxi October 26, 2017 Description New York City's Taxi and Limousine Commission (TLC) Trip Data

More information

Recommended Literature

Recommended Literature COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2017 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic

More information

Example of a use case

Example of a use case 2 1 In some applications data are read from two or more datasets The datasets could have different formats Hadoop allows reading data from multiple inputs (multiple datasets) with different formats One

More information

2. MapReduce Programming Model

2. MapReduce Programming Model Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Hadoop Integration Guide

Hadoop Integration Guide HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 4/7/2016 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Apache Hadoop Chaining jobs Chaining MapReduce jobs Many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce

More information

Hadoop Integration Guide

Hadoop Integration Guide HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 5/2/2018 Legal Notices Warranty The only warranties for Micro Focus products and services are set forth in the express warranty

More information

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data

More information

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018 Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,

More information

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google

More information

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers

More information

Map-Reduce Applications: Counting, Graph Shortest Paths

Map-Reduce Applications: Counting, Graph Shortest Paths Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

MAPREDUCE - PARTITIONER

MAPREDUCE - PARTITIONER MAPREDUCE - PARTITIONER http://www.tutorialspoint.com/map_reduce/map_reduce_partitioner.htm Copyright tutorialspoint.com A partitioner works like a condition in processing an input dataset. The partition

More information

Map-Reduce Applications: Counting, Graph Shortest Paths

Map-Reduce Applications: Counting, Graph Shortest Paths Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Department of Information Technology Software Laboratory-V Assignment No: 1 Title of the Assignment:

Department of Information Technology Software Laboratory-V Assignment No: 1 Title of the Assignment: Department of Information Technology Software Laboratory-V --------------------------------------------------------------------------------------------------------------------- Assignment No: 1 ---------------------------------------------------------------------------------------------------------------------

More information

QUERY OPTIMIZATION IN BIG DATA USING HADOOP, HIVE AND NEO4J

QUERY OPTIMIZATION IN BIG DATA USING HADOOP, HIVE AND NEO4J QUERY OPTIMIZATION IN BIG DATA USING HADOOP, HIVE AND NEO4J SUMMER INTERNSHIP PROJECT REPORT Submitted by M. ARUN(2016103010) S. BEN STEWART(2016103513) P. SANJAY(2016103580) COLLEGE OF ENGINEERING, GUINDY

More information

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the

More information

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop.

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop. MRUnit Tutorial Setup development environment 1. Download the latest version of MRUnit jar from Apache website: https://repository.apache.org/content/repositories/releases/org/apache/ mrunit/mrunit/. For

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark

More information

Processing Distributed Data Using MapReduce, Part I

Processing Distributed Data Using MapReduce, Part I Processing Distributed Data Using MapReduce, Part I Computer Science E-66 Harvard University David G. Sullivan, Ph.D. MapReduce A framework for computation on large data sets that are fragmented and replicated

More information

Recommended Literature

Recommended Literature COSC 6339 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Fall 2018 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara W5.A.0.0 CS435 Introduction to Big Data W5.A.1 FAQs PA1 has been posted Feb. 21, 5:00PM via Canvas Individual submission (No team submission) Source code of examples in lectures: https://github.com/adamjshook/mapreducepatterns

More information

MapReduce. Arend Hintze

MapReduce. Arend Hintze MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce

More information

// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName("Spark Exam 2 - Exercise

// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName(Spark Exam 2 - Exercise import org.apache.spark.api.java.*; import org.apache.spark.sparkconf; public class SparkDriver { public static void main(string[] args) { String inputpathpm10readings; String outputpathmonthlystatistics;

More information

Java & Inheritance. Inheritance - Scenario

Java & Inheritance. Inheritance - Scenario Java & Inheritance ITNPBD7 Cluster Computing David Cairns Inheritance - Scenario Inheritance is a core feature of Object Oriented languages. A class hierarchy can be defined where the class at the top

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [MAPREDUCE & HADOOP] Does Shrideep write the poems on these title slides? Yes, he does. These musing are resolutely on track For obscurity shores, from whence

More information

Implementing Algorithmic Skeletons over Hadoop

Implementing Algorithmic Skeletons over Hadoop Implementing Algorithmic Skeletons over Hadoop Dimitrios Mouzopoulos E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2011

More information

Big Data Analysis using Hadoop Lecture 3

Big Data Analysis using Hadoop Lecture 3 Big Data Analysis using Hadoop Lecture 3 Last Week - Recap Driver Class Mapper Class Reducer Class Create our first MR process Ran on Hadoop Monitored on webpages Checked outputs using HDFS command line

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak dsh distributed shell commands execution -c concurrent --show-machine-names -M --group cluster -g cluster /etc/dsh/groups/cluster needs passwordless

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Using Big Data for the analysis of historic context information

Using Big Data for the analysis of historic context information 0 Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer francisco.romerobueno@telefonica.com Big Data: What is it and how

More information

Big Data Analytics CP3620

Big Data Analytics CP3620 Big Data Analytics CP3620 Big Data Some facts: 2.7 Zettabytes (2.7 billion TB) of data exists in the digital universe and it s growing. Facebook stores, accesses, and analyzes 30+ Petabytes (1000 TB) of

More information

IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertthat(comparator.compare(w1, w2), greaterthan(0));

IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertthat(comparator.compare(w1, w2), greaterthan(0)); factory for RawComparator instances (that Writable implementations have registered). For example, to obtain a comparator for IntWritable, we just use: RawComparator comparator = WritableComparator.get(IntWritable.class);

More information

EE657 Spring 2012 HW#4 Zhou Zhao

EE657 Spring 2012 HW#4 Zhou Zhao EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the

More information

Big Data Analytics: Insights and Innovations

Big Data Analytics: Insights and Innovations International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations

More information

LAMPIRAN. public static void runaniteration (String datafile, String clusterfile) {

LAMPIRAN. public static void runaniteration (String datafile, String clusterfile) { DAFTAR PUSTAKA [1] Mishra Shweta, Badhe Vivek. (2016), Improved Map Reduce K Means Clustering Algorithm for Hadoop Architectur, International Journal Of Engineering and Computer Science, 2016, IJECS. [2]

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 9 Map Reduce 2010-2011 Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1 Outline Map Reduce:

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Hadoop 2.8 Configuration and First Examples

Hadoop 2.8 Configuration and First Examples Hadoop 2.8 Configuration and First Examples Big Data - 29/03/2017 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of

More information

Table of Contents. Chapter Topics Page No. 1 Meet Hadoop

Table of Contents. Chapter Topics Page No. 1 Meet Hadoop Table of Contents Chapter Topics Page No 1 Meet Hadoop - - - - - - - - - - - - - - - - - - - - - - - - - - - 3 2 MapReduce - - - - - - - - - - - - - - - - - - - - - - - - - - - - 10 3 The Hadoop Distributed

More information

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Big Data and Scripting map reduce in Hadoop

Big Data and Scripting map reduce in Hadoop Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks

More information

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science Introduction The Hadoop cluster in Computing Science at Stirling allows users with a valid user account to submit and

More information

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014. COSC 6397 Big Data Analytics Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading Edgar Gabriel Spring 2014 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in

More information

An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup)

An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Rensselaer Polytechnic Institute Universidade Federal de Viçosa An efficient map-reduce algorithm for spatio-temporal analysis using Spark (GIS Cup) Prof. Dr. W Randolph Franklin, RPI Salles Viana Gomes

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer Processing big data with modern applications: Hadoop as DWH backend at Pro7 Dr. Kathrin Spreyer Big data engineer GridKa School Karlsruhe, 02.09.2014 Outline 1. Relational DWH 2. Data integration with

More information

Hadoop 3 Configuration and First Examples

Hadoop 3 Configuration and First Examples Hadoop 3 Configuration and First Examples Big Data - 26/03/2018 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of companies

More information

Big Data con MATLAB. Lucas García The MathWorks, Inc. 1

Big Data con MATLAB. Lucas García The MathWorks, Inc. 1 Big Data con MATLAB Lucas García 2015 The MathWorks, Inc. 1 Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 2 Architecture of an analytics system Data from instruments

More information

About this exam review

About this exam review Final Exam Review About this exam review I ve prepared an outline of the material covered in class May not be totally complete! Exam may ask about things that were covered in class but not in this review

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Pattern Hadoop Mix Graphs Giraph Spark Zoo Keeper Spark But first Partitioner & Combiner

More information

CSC 1315! Data Science

CSC 1315! Data Science CSC 1315! Data Science Data Visualization Based on: Python for Data Analysis: http://hamelg.blogspot.com/2015/ Learning IPython for Interactive Computation and Visualization by C. Rossant Plotting with

More information

Map-Reduce in Various Programming Languages

Map-Reduce in Various Programming Languages Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the

More information

Today s topics. FAQs. Modify the way data is loaded on disk. Methods of the InputFormat abstract. Input and Output Patterns --Continued

Today s topics. FAQs. Modify the way data is loaded on disk. Methods of the InputFormat abstract. Input and Output Patterns --Continued Spring 2017 3/29/2017 W11.B.1 CS435 BIG DATA Today s topics FAQs /Output Pattern Recommendation systems Collaborative Filtering Item-to-Item Collaborative filtering PART 2. DATA ANALYTICS WITH VOLUMINOUS

More information

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed

More information

1/30/2019 Week 2- B Sangmi Lee Pallickara

1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Term project deliverable

More information

Attacking & Protecting Big Data Environments

Attacking & Protecting Big Data Environments Attacking & Protecting Big Data Environments Birk Kauer & Matthias Luft {bkauer, mluft}@ernw.de #WhoAreWe Birk Kauer - Security Researcher @ERNW - Mainly Exploit Developer Matthias Luft - Security Researcher

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve

More information

IT 313 Advanced Application Development Midterm Exam

IT 313 Advanced Application Development Midterm Exam Page 1 of 9 February 12, 2019 IT 313 Advanced Application Development Midterm Exam Name Part A. Multiple Choice Questions. Circle the letter of the correct answer for each question. Optional: supply a

More information

High-Performance Analytics on Large- Scale GPS Taxi Trip Records in NYC

High-Performance Analytics on Large- Scale GPS Taxi Trip Records in NYC High-Performance Analytics on Large- Scale GPS Taxi Trip Records in NYC Jianting Zhang Department of Computer Science The City College of New York Outline Background and Motivation Parallel Taxi data management

More information

Benchmarking Distributed Stream Processing Platforms for IoT Applications

Benchmarking Distributed Stream Processing Platforms for IoT Applications DISTRIBUTED RESEARCH ON EMERGING APPLICATIONS & MACHINES dream-lab.in Indian Institute of Science, Bangalore DREAM:Lab Benchmarking Distributed Stream Processing Platforms for IoT Applications Anshu Shukla

More information

Hadoop Cluster Implementation

Hadoop Cluster Implementation Hadoop Cluster Implementation By Aysha Binta Sayed ID:2013-1-60-068 Supervised By Dr. Md. Shamim Akhter Assistant Professor Department of Computer Science and Engineering East West University A project

More information

Studying software design patterns is an effective way to learn from the experience of others

Studying software design patterns is an effective way to learn from the experience of others Studying software design patterns is an effective way to learn from the experience of others Design Pattern allows the requester of a particular action to be decoupled from the object that performs the

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information