COSC 6339 Big Data Analytics NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment Edgar Gabriel Spring 2017 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in Tables Tables contain rows Rows made of columns which are grouped in column families Data is stored in cells Identified by row - column-family column Cells' values are versioned Value = Table+RowKey+Family+Column+Timestamp 1
Recap on HBase Internally, a table is made of regions Region a range of rows stored together Region Server- serves one or more regions A region is served by only 1 Region Server Master Server daemon responsible for managing HBase cluster, aka Region Servers Java API example: put import static org.apache.hadoop.hbase.util.bytes.*; public class PutExample { public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); Put put1 = new Put(toBytes("row1")); put1.add(tobytes("test"), tobytes("col1"), tobytes("val1")); put1.add(tobytes("test"), tobytes("col2"), tobytes("val2")); htable.put(put1); htable.close(); Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/ column family value 2
Java API example: get public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); Get get = new Get(toBytes("row1")); Result result = htable.get(get); print(result); get.addcolumn(tobytes("test"), tobytes("col2")); result = htable.get(get); print(result); htable.close(); Get the entire row Select a single column Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/ Create and Initialize Scan Construction options new Scan() - will scan through the entire table new Scan(startRow) begin scan at the provided row, scan to the end of the table new Scan(startRow, stoprow) begin scan at the provided startrow, stop scan when a row id is equal to or greater than to the provided stoprow new Scan(startRow, filter) begin scan at the provided row, scan to the end of the table, apply the provided filter Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/ 3
Create and Initialize Scan Once Scan is constructed you can further narrow down (very similar to Get) scan.addfamily(family) scan.addcolumn(family, column) scan.settimerange(minstamp, maxstamp) scan.setmaxversions(maxversions) scan.setfilter(filter) For example: Scan scan = new Scan(toBytes(startRow), tobytes(stoprow)); scan.addcolumn(tobytes("metrics"), tobytes("counter")); scan.addfamily(tobytes("info")); ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ // do stuff with result Using HBase in MapReduce job TableInputFormat Converts data in HTable to format consumable to MapReduce Split: Rows in one HBase Region (provided Scan may narrow down the result) Record: Row, returned columns are controlled by a provided scan Key: ImmutableBytesWritable Value: Result (HBase class) TableOutputFormat Saves data into HTable Reducer output key is ignored Reducer output value must be HBase s Put or Delete objects 4
Using HBase in MapReduce job Mapper class needs to extend TableMapper Reducer class needs to extend TableReducer static class Mapper extends TableMapper<ImmutableBytesWritable, DoubleWritable> { public void map(immutablebyteswritable row, Result values, Context context) throws IOException { byte[] results = values.getvalue( ); ImmutableBytesWritable userkey = new ImmutableBytesWritable(key-name); context.write(userkey, new DoubleWritable ( Bytes.toDouble (results))); 5
public static class Reducer extends TableReducer <ImmutableBytesWritable, DoubleWritable, ImmutableBytesWritable> { public void reduce(immutablebyteswritable key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { Put put = new Put(key.get()); put.add(bytes.tobytes("data"), Bytes.toBytes("average"), Bytes.toBytes(sum / count)); context.write(key, put); public static void main(string[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "AverageGINByCountryCalcualtor"); job.setjarbyclass(averageginbycountrycalcualtor.class); Scan scan = new Scan(); scan.addfamily("bycountry".getbytes()); scan.setcaching(500); // 1 is the default in Scan, which // will be bad for MapReduce jobs scan.setcacheblocks(false); // don't set to true for MR //jobs TableMapReduceUtil.initTableMapperJob( HDI", // input table scan, // scan instance Mapper.class, //mapper class ImmutableBytesWritable.class, // mapper output key DoubleWritable.class, // mapper output value job); 6
TableMapReduceUtil.initTableReducerJob( "HDIResult", // output table Reducer.class, // reducer class job); System.exit(job.waitForCompletion(true)? 0 : 1); Using HBase in MapReduce jobs (III) public static void main(string[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, AirPollution"); job.setjarbyclass(airpollution.class); Scan scan = new Scan(); scan.addfamily(bytes.tobytes( location ); scan.addcolumn(bytes.tobytes( data ),Bytes.toBytes( value )); FilterList li = newfilterlist(filterlist.operator.must_pass_all); SingleColumnValueFilter filter = new SingleColumnValueFilter( Bytes.toBytes("location"),Bytes.toBytes("region"), CompareOp.EQUAL, Bytes.toBytes("12")); li.addfilter(filter); scan.setfilter(li); 7
3 rd Homework Rules Each student should deliver Source code (.py files) compressed to a zip or tar.gz file Source code has to be using python 2.7, spark 2.0.2 Documentation (.pdf,.doc, or.txt file) explanations to the code answers to questions Deliver electronically on blackboard Expected by Wednesday, May 3 rd, 11.59pm absolutely no extensions possible! In case of questions: ask, ask, ask! Given a data set containing all flights in the US between 2004 and 2008 ~7 Million flights per year ~3.5 GB of data each line in the input file is one flight with information as listed on the next pages Directory in HDFS /cosc6339_s17/flightdata-full/ small file for code development with ~17,000 flights available in HDFS as well /cosc6339_s17/flightdata-short/ 8
Part 1: develop a pyspark code, which convert the csv file into a) parquet file, b) sequence file, c) json file. Compare the size of the generated files to the original input file Part 2: develop a separate spark code for each four of the input formats (csv, parquet, sequence file, json) which determines the percentage of delayed flights per Origin Airport. Compare the execution for each input format with the large dataset for 5, 10, and 15 executors. Description of the input file Comma separated list of data, the elements of which are explained on the next page more information available at http://stat-computing.org/dataexpo/2009/the-data.html 2008,1,3,4,NA,905,NA,1025,WN,469,,NA,80,NA,NA,NA,LAX,SFO,337,NA,NA,1,A,0,NA,NA,NA,NA,NA 2008,1,3,4,1417,1345,1717,1645,WN,2524,N458WN,120,120,105,32,32,MDW,MHT,838,4,11,0,,0,28,0,0,0,4 2008,1,3,4,852,855,959,1015,WN,3602,N737JW,67,80,57,-16,-3,ONT,SMF,389,4,6,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,1726,1725,1932,1940,WN,563,N285WN,306,315,291,-8,1,RDU,LAS,2027,5,10,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,2014,1935,2129,2045,WN,1662,N461WN,75,70,47,44,39,SLC,BOI,291,3,25,0,,0,0,0,6,0,38 2008,1,4,5,1617,1610,1813,1810,WN,2374,N344SW,56,60,46,3,7,ABQ,MAF,332,3,7,0,,0,NA,NA,NA,NA,NA 2008,1,4,5,839,820,1019,1010,WN,535,N761RR,100,110,82,9,19,BWI,IND,515,5,13,0,,0,NA,NA,NA,NA,NA 2008,1,4,5,814,810,930,930,WN,502,N641SW,76,80,62,0,4,ELP,PHX,347,3,11,0,,0,NA,NA,NA,NA,NA Some values can be numeric or NA, some values are missing (i.e. there are two,, in a row) 9
Variable descriptions Name Description Year 1987-2008 Month 1-12 DayofMonth 1-31 DayOfWeek 1 (Monday) - 7 (Sunday) DepTime actual departure time (local, hhmm) CRSDepTime scheduled departure time (local, hhmm) ArrTime actual arrival time (local, hhmm) CRSArrTime scheduled arrival time (local, hhmm) UniqueCarrier unique carrier code FlightNum flight number TailNum plane tail number ActualElapsedTime CRSElapsedTime AirTime Variable descriptions ArrDelay arrival DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay delay, departure delay, origin IATA airport code destination IATA airport code in miles taxi in time, taxi out time was the flight cancelled? reason for cancellation (A = carrier, B = weather, C= NAS, D = security) 1 = yes, 0 = no 10
Documentation The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. No output files!! 11