CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern

Similar documents
Topics covered in this lecture

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara

1. Stratified sampling is advantageous when sampling each stratum independently.

Big Data and Scripting map reduce in Hadoop

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

MapReduce Simplified Data Processing on Large Clusters

Today s topics. FAQs. Modify the way data is loaded on disk. Methods of the InputFormat abstract. Input and Output Patterns --Continued

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Informatics. Seon Ho Kim, Ph.D.

Map Reduce and Design Patterns Lecture 4

MapReduce. Arend Hintze

MapReduce-style data processing

Cloud Computing CS

Hadoop Map Reduce 10/17/2018 1

Recitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha March 23, 2018

1/30/2019 Week 2- B Sangmi Lee Pallickara

Dept. Of Computer Science, Colorado State University

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

Map-Reduce Applications: Counting, Graph Shortest Paths

Data-Intensive Computing with MapReduce

2/4/2019 Week 3- A Sangmi Lee Pallickara

Parallel Computing: MapReduce Jin, Hai

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Parallel Computing. Prof. Marco Bertini

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

BigData and MapReduce with Hadoop

An Introduction to Apache Spark

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Programming Models MapReduce

Introduction to BigData, Hadoop:-

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

Map-Reduce Applications: Counting, Graph Shortest Paths

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Hadoop. copyright 2011 Trainologic LTD

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Steps: First install hadoop (if not installed yet) by,

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Clustering Documents. Case Study 2: Document Retrieval

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

CS 378 Big Data Programming

Distributed Computation Models

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Developing MapReduce Programs

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Data Analysis Using MapReduce in Hadoop Environment

Hadoop/MapReduce Computing Paradigm

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

Big Data: Architectures and Data Analytics

Recitation 6 CS435: Introduction to Big Data

MapReduce Design Patterns

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Processing of big data with Apache Spark

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Big Data: Tremendous challenges, great solutions

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

An Introduction to Big Data Analysis using Spark

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

CA485 Ray Walshe NoSQL

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS

CS November 2018

CS November 2017

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Certified Big Data and Hadoop Course Curriculum

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

FINAL PROJECT REPORT

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

Available online at ScienceDirect. Procedia Computer Science 98 (2016 )

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Tuning the Hive Engine for Big Data Management

RedPoint Data Management for Hadoop Trial

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Database Applications (15-415)

Large-scale Information Processing

Introduction to MapReduce

Data-intensive computing systems

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Distributed Systems CS6421

MapReduce Algorithm Design

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Map- Reduce. Everything Data CompSci Spring 2014

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Transcription:

W4.A.0.0 CS435 Introduction to Big Data W4.A.1 FAQs PA0 submission is open Feb. 6, 5:00PM via Canvas Individual submission (No team submission) If you have not been assigned the port range, please contact the GTA immediately PART 1. LARGE SCALE DATA ANALYTICS PA1 has been posted Feb. 21, 5:00PM via Canvas Individual submission (No team submission) Computer Science, Colorado State University http://www.cs.colostate.edu/~cs435 Total Order Sorting Pattern W4.A.2 W4.A.3 Topics MapReduce Design Pattern III. Data Organization Patterns MapReduce Design Pattern IV. Patterns Part 1. Large Scale Data Analytics Design Pattern 3: Data Organization Patterns W4.A.4 W4.A.5 Total Order Sorting Pattern MapReduce Design Patterns II: Filtering Patterns 3. Total Order Sorting Pattern Sorts your data e.g. Sorting 1TB of numeric values e.g. Sorting comments by userid and you have a million users 1

W4.A.6 Structure of Total Order Sorting Pattern Two phases Analysis phase Determines the ranges Sorting phase Actually sorts the data W4.A.7 Structure of the Total Order Sorting Pattern - Analysis phase Performs a simple random sampling Generates outputs with the sort key as its output keys Data will show up as at the reducer Sampling rate? Assume that the number of records in the entire dataset is known (or can be estimated) If you plan on running the order with a thousand reducers Sampling about a hundred thousand records will be enough Only one reducer will be used Collects the sort keys together into a list The list of keys will be sliced into the data range boundaries W4.A.8 Structure of Total Order Sorting Pattern - Sorting phase extracts the sort key Stores the sort key to the value <outkey, value> W4.A.9 Custom partitioner Use TotalOrderPartitioner (Hadoop API) Takes the data ranges from the partition file and decides which reducer to send the data Dynamic and load balanced Part 1. Large Scale Data Analytics Design Pattern 4: Patterns Reducer The number of reducers needs to be equal to the number of partitions W4.A.10 Patterns Data is all over the place s allow users to create a smaller reference set or filter out or select dataset to discover interesting relationships across datasets ing a terabyte of data onto another terabyte dataset could require up to two terabytes of bandwidth! That s before any actual join logic can be done! 1. Reduce Side Pattern 2. Replicated Pattern 3. Composite Pattern 4. Cartesian Product Pattern W4.A.11 A Refresher on s A is an operation that combines records from two or more datasets based on a field or set of fields Foreign key The foreign key is the field in a relational table that matches the column of another table Used as a means to cross-reference between tables 2

W4.A.12 W4.A.13 Example 5 17556 San Diago, CA Dataset B Inner Datatset B 5 17556 San Diego, CA Default join style Records from both A and B that contain identical values for the given foreign key are brought together Inner join of A+B on User id A. UserID A.Reputation A.Location B.UserID B.PostID B.Text 3 35314 Not sure why this is getting downvoted 5 17556 San Diego, CA 5 17556 San Diego, CA W4.A.14 Outer 5 17556 San Diego, CA Datatset B Records from a foreign key not present in both table will be also in the final table Left Outer Unmatched records in the left table will be in the final table Null values in the columns of the right table that did not match Right Outer The right table records are kept and the left table values are null where appropriate Full outer join contains all unmatched records from both tables W4.A.15 Left Outer 5 17556 San Diego, CA A. UserID A.Reputation A.Location B.UserID B.PostID B.Text 3 35314 Not sure why this is getting downvoted null null null 5 17556 San Diego, CA 5 17556 San Diego, CA null null null Datatset B Left Outer join of A+B on User id W4.A.16 Anti 5 17556 San Diego, CA Anti Full outer join minus the inner join A. UserID A.Reputation A.Location B.UserID B.PostID B.Text null null null null null null 8 48775 HTML is not a subset of XML! Datatset B Anti join of A+B on User id W4.A.17 MapReduce Design Patterns IV: Patterns 1. Reduce Side Pattern null null null 3

W4.A.18 Reduce Side Pattern Most straightforward implementation of a join in MapReduce Requires a large amount of network bandwidth Bulk of the data is sent to the reduce phase If you have resources available this will be a possible solution W4.A.19 Structure of the reduce side join pattern Dataset B (bob, teacher ) Shuffle and sort (bob, 1092847) Reducer Reducer Part A Part B (bob, 3849273) W4.A.20 Performance analysis W4.A.21 Driver Code The reducer side join puts a lot of strain on the cluster s network The foreign key and output record of each input record are extracted No data can be filtered ahead of time Almost all of the data will be sent to the shuffle and sort step Reduce side joins will typically utilize relatively more reducers than your typical analytics // Use Multiples to set which input uses what mapper // This will keep parsing of each data set separate from a logical standpoint // The first two elements of the args array are the two inputs Multiples.addPath(job, new Path(args[0]), TextFormat.class, User.class); Multiples.addPath( job, new Path(args[1]), TextFormat.class, Comment.class); job.getconfiguration() W4.A.22 User Code public static class User extends < Object, Text, Text, Text > { private Text outkey = new Text(); private Text outvalue = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { // Parse the input string into a nice map Map < String, String > parsed = MRDPUtils.transformXmlToMap(value.toString()); String userid = parsed.get("id"); // The foreign join key is the user ID outkey.set(userid); // Flag this record for the reducer and then output outvalue.set("a" + value.tostring()); context.write(outkey, outvalue); W4.A.23 Comment mapper code public static class Comment extends lt; Object, Text, Text, Text > { private Text outkey = new Text(); private Text outvalue = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { Map < String, String > parsed = transformxmltomap( value.tostring()); // The foreign join key is the user ID outkey.set( parsed.get("userid")); // Flag this record for the reducer and then output outvalue.set("b" + value.tostring()); context.write(outkey, outvalue); 4

W4.A.24 Reducer Code public static class UserReducer extends Reducer < Text, Text, Text, Text > { private static final Text EMPTY_TEXT = Text(""); private Text tmp = new Text(); private ArrayList < Text > lista = new ArrayList < Text >(); private ArrayList < Text > listb = new ArrayList < Text >(); private String jointype = null; public void setup( Context context) { // Get the type of join from our configuration jointype = context.getconfiguration().get("join.type"); public void reduce( Text key, Iterable < Text > values, Context context) throws IOException, InterruptedException { // Clear our lists lista.clear(); W4.A.25 Reducer Code listb.clear(); // iterate through all our values, binning each record based on what // it was tagged with. Make sure to remove the tag! while (values.hasnext()) { tmp = values.next(); if (tmp.charat(0) == 'A') { lista.add(new Text(tmp.toString().substring(1))); else if (tmp.charat('0') == 'B') { listb.add(new Text(tmp.toString().substring(1))); // Execute our join logic now that the lists are filled executelogic(context); private void executelogic(context context) throws IOException, InterruptedException { W4.A.26 Inner Code if (jointype.equalsignorecase("inner")) { // If both lists are not empty, join A with B if (!lista.isempty() &&!listb.isempty()) { for (Text A : lista) { for (Text B : listb) { context.write(a, B); W4.A.27 Left outer Code else if (jointype.equalsignorecase(" leftouter")) { // For each entry in A, for (Text A : lista) { // If list B is not empty, join A and B if (! listb.isempty()) { for (Text B : listb) { else { context.write( A, B); // Else, output A by itself context.write( A, EMPTY_TEXT); W4.A.28 W4.A.29 Right outer Code else if (jointype.equalsignorecase( rightouter")) { // For each entry in B, for (Text B : listb) { // If list A is not empty, join A and B if (! lista.isempty()) { for (Text A : lista) { else { context.write( A, B); // Else, output B by itself context.write( EMPTY_TEXT, B); MapReduce Design Patterns IV: Patterns 2. Replicated 5

W4.A.30 Replicated Special type of join operation between one large and (many) small data set(s) that can be performed on the map-side Reads all files from the distributed cache during the setup phase Sorting them in to in-memory lookup tables Performs mapper process ing data If the foreign key is not found in the in-memory structure? The record is either omitted or output (based on the join type) No combiner/partitioner/reducer needed W4.A.31 Structure of the replicated join pattern Data set B Distributed Cache Rep Rep Rep Rep Part A Part B Part A Part B W4.A.32 Hadoop DistributedCache W4.A.33 Working with DistributedCache Provided by the Hadoop MapReduce Framework Caches read only text files, archives, jar files etc. Once a file is cached for a job using Distributed cache Data will be available on each data node where map/reduce tasks are running Make sure Your file is available and accessible via http:// or hdfs:// Setup the application s JobConf in your Driver class DistributeCache.addFileToClasspath(new Path( /usr/datafile/xyz ) W4.A.34 Size of DistributedCache in Hadoop W4.A.35 Using DistributedCache for replicated join Size Default size of the Hadoop distributed cache is 10GB Configurable in mapred-site.xml Data consistency Hadoop Distributed Cache tracks the modification of timestamps of the cache file Overhead Object serialization A small file is pushed to all map tasks using DistributedCache Useful for join between a small set and a large set of data e.g. user information vs. transaction records, user information vs. comment history Code Setup phase User data is read from the DistributedCache and stored in memory (userid, records) pairs are stored in a HashMap for data retrieval during the map process Map phase For each input record (from the large dataset), the user information is retrieved from the HashMap Assemble a joined record 6

W4.A.36 Code [1/3] - (Comment, user, on userid) public static class Replicated extends < Object, Text, Text, Text > { private static final Text EMPTY_TEXT = new Text(""); private HashMap < String, String > useridtoinfo = new HashMap < String, String >(); private Text outvalue = new Text(); private String jointype = null; public void setup(context context) throws IOException,InterruptedException { Path[] files = DistributedCache.getLocalCacheFiles( context.getconfiguration()); W4.A.37 Code [2/3] - (Comment, user, on userid) // Read all files in the DistributedCache for (Path p : files) { BufferedReader rdr = new BufferedReader(new StreamReader(new GZIPStream(new FileStream(new File(p.toString()))))); String line = null; // For each record in the user file while ((line = rdr.readline())!= null) { // Get the user ID for this record Map < String, String > parsed = transformxmltomap(line); String userid = parsed.get("id"); // Map the user ID to the record useridtoinfo.put(userid, line); // Get the join type from the configuration jointype = context.getconfiguration(). get(" join.type"); W4.A.38 Code [3/3] - (Comment, user, on userid) public void map(object key, Text value, Context context) throws IOException, InterruptedException { Map <String, String> parsed = transformxmltomap(value.tostring()); String userid = parsed.get("userid"); String userinformation = useridtoinfo.get(userid); // If the user information is not null, then output if (userinformation!= null) { outvalue.set(userinformation); context.write(value, outvalue); else if (jointype.equalsignorecase("leftouter")) { // If we are doing a left outer join, // output the record with an empty value context.write(value, EMPTY_TEXT); W4.A.39 MapReduce Design Patterns IV: Patterns 3. Composite W4.A.40 Composite s very large datasets together And if the datasets are by foreign key No shuffle and sort needed Each input dataset must be partitioned and in a specific way and divided into the same number of partitions W4.A.41 Structure of the composite join pattern hash(fk)%5 =0 hash(fk)%5 =1 hash(fk)%5 =2 hash(fk)%5 =3 Foreign keys Adam Adam James Bradley Stella William Andrew Donald Peter Wade Chris Denis Denis Dataset B Foreign keys Adam James James Bradley Donald Donald Chris hash(fk)%5 =4 Frank Fred Nicholas Frank Fred Nicholas 7

W4.A.42 Questions? 8