COSC 6339 Big Data Analytics. NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment. Edgar Gabriel Spring 2017.

Similar documents
COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

Data Manipulation in R

Streaming vs. batch processing

HBase Java Client API

Process Big Data in MATLAB Using MapReduce

Actian Vector Evaluation Guide: Windows/MSSQL Edition

Package lvplot. August 29, 2016

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Tutorial for Using the Center for High Performance Computing at The University of Utah and an example using Random Forest

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

THỰC HÀNH GIẢI THUẬT PHÂN CỤM (GOM CỤM, CLUSTERING) Dữ liệu sử dụng: VN Airline 1. Mô tả dữ liệu

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Aims. Background. This exercise aims to get you to:

Hadoop Online Training

MapReduce. Arend Hintze

Introduction to BigData, Hadoop:-

Java in MapReduce. Scope

Chase Wu New Jersey Institute of Technology

Comparing SQL and NOSQL databases

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data: Architectures and Data Analytics

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Massive data, shared and distributed memory, and concurrent programming: bigmemory and foreach

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Topics covered in this lecture

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Big Data: Architectures and Data Analytics

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Deployment options in H2O. Date:03/04/2014

Processing of big data with Apache Spark

About 1. Chapter 1: Getting started with hbase 2. Remarks 2. Examples 2. Installing HBase in Standalone 2. Installing HBase in cluster 3

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Dept. Of Computer Science, Colorado State University

Pyspark standalone code

Hadoop Map Reduce 10/17/2018 1

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Hadoop Development Introduction

Big Data Analytics using Apache Hadoop and Spark with Scala

Digging into Data at Scale. 14 April 2011 Jordan Boyd-Graber

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

A Tutorial on Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

HBase. Леонид Налчаджи

CSE 344 JULY 9 TH NOSQL

Scalable Tools - Part I Introduction to Scalable Tools

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Data Informatics. Seon Ho Kim, Ph.D.

Architecture of Enterprise Applications 22 HBase & Hive

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Oracle Big Data Fundamentals Ed 2

Hadoop. copyright 2011 Trainologic LTD

Databases and Big Data Today. CS634 Class 22

A Glimpse of the Hadoop Echosystem

Big Data Architect.

Recommended Literature

Typical size of data you deal with on a daily basis

COSC 6374 Parallel Computation. Edgar Gabriel Fall Each student should deliver Source code (.c file) Documentation (.pdf,.doc,.tex or.

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Exam Questions

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

A BigData Tour HDFS, Ceph and MapReduce

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

A framework for data-related skills

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Meetup II IBM Open Platform with Apache Hadoop. Laboratorio de BigR

Expert Lecture plan proposal Hadoop& itsapplication

Database Systems CSE 414

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Big Data Development HADOOP Training - Workshop. FEB 12 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

TDDE31/732A54 - Big Data Analytics Lab compendium

Introduction to HDFS and MapReduce

Map Reduce & Hadoop Recommended Text:

Hadoop course content

Hadoop An Overview. - Socrates CCDH

5/1/17. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Fall HDFS Basics

Backtesting with Spark

Big Data XML Parsing in Pentaho Data Integration (PDI)

Hadoop. Introduction to BIGDATA and HADOOP

HBase Installation and Configuration

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

MapReduce Simplified Data Processing on Large Clusters

Transcription:

COSC 6339 Big Data Analytics NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment Edgar Gabriel Spring 2017 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in Tables Tables contain rows Rows made of columns which are grouped in column families Data is stored in cells Identified by row - column-family column Cells' values are versioned Value = Table+RowKey+Family+Column+Timestamp 1

Recap on HBase Internally, a table is made of regions Region a range of rows stored together Region Server- serves one or more regions A region is served by only 1 Region Server Master Server daemon responsible for managing HBase cluster, aka Region Servers Java API example: put import static org.apache.hadoop.hbase.util.bytes.*; public class PutExample { public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); Put put1 = new Put(toBytes("row1")); put1.add(tobytes("test"), tobytes("col1"), tobytes("val1")); put1.add(tobytes("test"), tobytes("col2"), tobytes("val2")); htable.put(put1); htable.close(); Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/ column family value 2

Java API example: get public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); Get get = new Get(toBytes("row1")); Result result = htable.get(get); print(result); get.addcolumn(tobytes("test"), tobytes("col2")); result = htable.get(get); print(result); htable.close(); Get the entire row Select a single column Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/ Create and Initialize Scan Construction options new Scan() - will scan through the entire table new Scan(startRow) begin scan at the provided row, scan to the end of the table new Scan(startRow, stoprow) begin scan at the provided startrow, stop scan when a row id is equal to or greater than to the provided stoprow new Scan(startRow, filter) begin scan at the provided row, scan to the end of the table, apply the provided filter Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/ 3

Create and Initialize Scan Once Scan is constructed you can further narrow down (very similar to Get) scan.addfamily(family) scan.addcolumn(family, column) scan.settimerange(minstamp, maxstamp) scan.setmaxversions(maxversions) scan.setfilter(filter) For example: Scan scan = new Scan(toBytes(startRow), tobytes(stoprow)); scan.addcolumn(tobytes("metrics"), tobytes("counter")); scan.addfamily(tobytes("info")); ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ // do stuff with result Using HBase in MapReduce job TableInputFormat Converts data in HTable to format consumable to MapReduce Split: Rows in one HBase Region (provided Scan may narrow down the result) Record: Row, returned columns are controlled by a provided scan Key: ImmutableBytesWritable Value: Result (HBase class) TableOutputFormat Saves data into HTable Reducer output key is ignored Reducer output value must be HBase s Put or Delete objects 4

Using HBase in MapReduce job Mapper class needs to extend TableMapper Reducer class needs to extend TableReducer static class Mapper extends TableMapper<ImmutableBytesWritable, DoubleWritable> { public void map(immutablebyteswritable row, Result values, Context context) throws IOException { byte[] results = values.getvalue( ); ImmutableBytesWritable userkey = new ImmutableBytesWritable(key-name); context.write(userkey, new DoubleWritable ( Bytes.toDouble (results))); 5

public static class Reducer extends TableReducer <ImmutableBytesWritable, DoubleWritable, ImmutableBytesWritable> { public void reduce(immutablebyteswritable key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { Put put = new Put(key.get()); put.add(bytes.tobytes("data"), Bytes.toBytes("average"), Bytes.toBytes(sum / count)); context.write(key, put); public static void main(string[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "AverageGINByCountryCalcualtor"); job.setjarbyclass(averageginbycountrycalcualtor.class); Scan scan = new Scan(); scan.addfamily("bycountry".getbytes()); scan.setcaching(500); // 1 is the default in Scan, which // will be bad for MapReduce jobs scan.setcacheblocks(false); // don't set to true for MR //jobs TableMapReduceUtil.initTableMapperJob( HDI", // input table scan, // scan instance Mapper.class, //mapper class ImmutableBytesWritable.class, // mapper output key DoubleWritable.class, // mapper output value job); 6

TableMapReduceUtil.initTableReducerJob( "HDIResult", // output table Reducer.class, // reducer class job); System.exit(job.waitForCompletion(true)? 0 : 1); Using HBase in MapReduce jobs (III) public static void main(string[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, AirPollution"); job.setjarbyclass(airpollution.class); Scan scan = new Scan(); scan.addfamily(bytes.tobytes( location ); scan.addcolumn(bytes.tobytes( data ),Bytes.toBytes( value )); FilterList li = newfilterlist(filterlist.operator.must_pass_all); SingleColumnValueFilter filter = new SingleColumnValueFilter( Bytes.toBytes("location"),Bytes.toBytes("region"), CompareOp.EQUAL, Bytes.toBytes("12")); li.addfilter(filter); scan.setfilter(li); 7

3 rd Homework Rules Each student should deliver Source code (.py files) compressed to a zip or tar.gz file Source code has to be using python 2.7, spark 2.0.2 Documentation (.pdf,.doc, or.txt file) explanations to the code answers to questions Deliver electronically on blackboard Expected by Wednesday, May 3 rd, 11.59pm absolutely no extensions possible! In case of questions: ask, ask, ask! Given a data set containing all flights in the US between 2004 and 2008 ~7 Million flights per year ~3.5 GB of data each line in the input file is one flight with information as listed on the next pages Directory in HDFS /cosc6339_s17/flightdata-full/ small file for code development with ~17,000 flights available in HDFS as well /cosc6339_s17/flightdata-short/ 8

Part 1: develop a pyspark code, which convert the csv file into a) parquet file, b) sequence file, c) json file. Compare the size of the generated files to the original input file Part 2: develop a separate spark code for each four of the input formats (csv, parquet, sequence file, json) which determines the percentage of delayed flights per Origin Airport. Compare the execution for each input format with the large dataset for 5, 10, and 15 executors. Description of the input file Comma separated list of data, the elements of which are explained on the next page more information available at http://stat-computing.org/dataexpo/2009/the-data.html 2008,1,3,4,NA,905,NA,1025,WN,469,,NA,80,NA,NA,NA,LAX,SFO,337,NA,NA,1,A,0,NA,NA,NA,NA,NA 2008,1,3,4,1417,1345,1717,1645,WN,2524,N458WN,120,120,105,32,32,MDW,MHT,838,4,11,0,,0,28,0,0,0,4 2008,1,3,4,852,855,959,1015,WN,3602,N737JW,67,80,57,-16,-3,ONT,SMF,389,4,6,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,1726,1725,1932,1940,WN,563,N285WN,306,315,291,-8,1,RDU,LAS,2027,5,10,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,2014,1935,2129,2045,WN,1662,N461WN,75,70,47,44,39,SLC,BOI,291,3,25,0,,0,0,0,6,0,38 2008,1,4,5,1617,1610,1813,1810,WN,2374,N344SW,56,60,46,3,7,ABQ,MAF,332,3,7,0,,0,NA,NA,NA,NA,NA 2008,1,4,5,839,820,1019,1010,WN,535,N761RR,100,110,82,9,19,BWI,IND,515,5,13,0,,0,NA,NA,NA,NA,NA 2008,1,4,5,814,810,930,930,WN,502,N641SW,76,80,62,0,4,ELP,PHX,347,3,11,0,,0,NA,NA,NA,NA,NA Some values can be numeric or NA, some values are missing (i.e. there are two,, in a row) 9

Variable descriptions Name Description Year 1987-2008 Month 1-12 DayofMonth 1-31 DayOfWeek 1 (Monday) - 7 (Sunday) DepTime actual departure time (local, hhmm) CRSDepTime scheduled departure time (local, hhmm) ArrTime actual arrival time (local, hhmm) CRSArrTime scheduled arrival time (local, hhmm) UniqueCarrier unique carrier code FlightNum flight number TailNum plane tail number ActualElapsedTime CRSElapsedTime AirTime Variable descriptions ArrDelay arrival DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay delay, departure delay, origin IATA airport code destination IATA airport code in miles taxi in time, taxi out time was the flight cancelled? reason for cancellation (A = carrier, B = weather, C= NAS, D = security) 1 = yes, 0 = no 10

Documentation The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. No output files!! 11