EE657 Spring 2012 HW#4 Zhou Zhao

Size: px
Start display at page:

Download "EE657 Spring 2012 HW#4 Zhou Zhao"


1 EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the first item has 7 attributes, namely Category, Subcategory, Name etc. sampledata.add(new ReplaceableItem("Item_01").withAttributes( new ReplaceableAttribute("Category", "Clothes", true), new ReplaceableAttribute("Subcategory", "Sweater", true), new ReplaceableAttribute("Name", "Cathair Sweater", true), new ReplaceableAttribute("Color", "Siamese", true), new ReplaceableAttribute("Size", "Small", true), new ReplaceableAttribute("Size", "Medium", true), new ReplaceableAttribute("Size", "Large", true))); Variety of database operations are implemented, which are listed below, 1. create a domain 2. list existing domain 3. put data into one of the domains 4. select data from a domain 5. delete values from an attribute 6. delete an attribute 7. replace an attribute 8. delete item and domain The code show below corresponds to operation of creating domain, listing existing domain, and put data in domain, respectively. // Create a domain String mydomain = "MyStore2"; System.out.println("Creating domain called " + mydomain + ".\n"); sdb.createdomain(new CreateDomainRequest(myDomain)); // List domains System.out.println("Listing all domains in your account:\n"); for (String domainname : sdb.listdomains().getdomainnames()) { System.out.println(" " + domainname); System.out.println(); // Put data into a domain System.out.println("Putting data into " + mydomain + " domain.\n"); sdb.batchputattributes(new BatchPutAttributesRequest(myDomain, createsampledata())); The execution output on the terminal is shown in Fig. 1.

2 Fig. 1 Execution output of SimpleDB application on AWS. Problem 6.4 Solution The coding of MapReduce for matrix multiplication reference the link Assume the matrix multiplication is A*B=C, in which A, B, and C are all N*N integer matrices. Each matrices will be divided into almost equal blocks for each nodes in the cluster. For instance, the N*N matrix will be divided into 2*2 blocks for cluster which has 4 nodes. Mapper nodes do partitioning of the input matrices, while Reducer nodes do real matrix multiplication. Four implementation strategies of Reducer nodes are presented below. Strategy one to three need to submit two jobs, while strategy four only needs to submit one job to cluster. 1. Each reducer do just one block multiplication. 2. Each reducer do multiplication of single A block with all row of B blocks. 3. Each reducer do multiplication of single B block with all column of A blocks. 4. Each reducer compute the final blocks of product matrix C. The experiment is conducted in Java on Hadoop cluster with 4 and 16 nodes on EC2, respectively. The Hadoop cluster is configured by following steps, 1. Use Apache Whirr script to automatically provision cluster on EC2. 2. Setup Proxy VM instance to submit job to tasktracker in cluster. 3. Upload source code to Proxy VM instance through FileZilla or download files from S3 bucket using s2cmd. 4. Submit matrix multiplication job to cluster and record execution time. The provisioned Hadoop cluster, which has 16 nodes is shown in Fig. 2 and code execution on Hadoop cluster is shown in Fig. 3. Table 1 Measured execution time in second of 10000*10000 integer matrices. # of nodes in cluster Strategy 1 Strategy 2 Strategy 3 Strategy 4 4 nodes 228s 227s 198s 118s 16 nodes 370s 385s 412s 199s 62 nodes Note: I have submit request to AWS to release the limit of 20 provisioned instance. Now, I can provision up to 1024

3 VM instances. Fig. 2 Provisioned 16 nodes Hadoop cluster on EC2. Fig. 3 Execution of MapReduce job Hadoop cluster on EC2.

4 Problem 6.5 Solution The S3 on AWS is for simple file storage. The code below prepares a text file named aws-java-sdk-.txt, which contains 256 characters. Then, the text file is updated to S3 and then downloaded from S3. private static File createsamplefile() throws IOException { File file = File.createTempFile("aws-java-sdk-", ".txt"); file.deleteonexit(); Writer writer = new OutputStreamWriter(new FileOutputStream(file)); writer.write("abcdefghijklmnopqrstuvwxyz\n"); writer.write(" \n"); writer.write("!@#$%^&*()-=[]{;':',.<>/?\n"); writer.write(" \n"); writer.write("abcdefghijklmnopqrstuvwxyz\n"); writer.close(); return file; The implemented operations on S3 file system is 1. create bucket 2. list buckets in one account 3. upload object into the bucket 4. download object from the bucket 5. list object in one bucket 6. delete bucket The code shown below is segment of uploading object to S3 bucket. The execution output is shown in Fig. 4. System.out.println("Uploading a new object to S3 from a file\n"); s3.putobject(new PutObjectRequest(bucketName, key, createsamplefile()));

5 Fig. 4 Execution output of S3 application on AWS. Problem 6.15 Solution MapReduce programming model has simplified the implementation of many data parallel. Its programming model is based on bipartite graph. However, it has limitation in applying sorts of applications. Twister provide enhancement features including, 1. distinction on static and variable data 2. configure long running map/reduce tasks 3. message based communication 4. support iterative MapReduce computations 5. combine phase to collect all outputs 6. data access via local disk 7. lightweight Problem 6.16 Solution The original code in textbook has so many semantic errors. Thus, the code is modified as show below to clear compiling errors. The class of OSCountMapper is the subclass of MapReduceBase, which implement interface Mapper. The input key-value pair is <LongWritable, Text> and output key-value pair is <Text, IntWritable>. The map function find the last substring and search the last character before ')'. Finally, add the generated key-value pair in the collector. public class OSCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Text UserInfo = new Text(); Text OSversion = new Text(); int StartIndex = 0; int EndIndex = 0; int i = 0; String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens() && i!= 8) { i++; UserInfo.set(tokenizer.nextToken()); i = 0; while (UserInfo.charAt(i)!= ';') { if(userinfo.charat(i)!= '('){ StartIndex = i; i++; EndIndex = i; OSversion.set((UserInfo.toString().substring(StartIndex, EndIndex))); output.collect(osversion, new IntWritable(1)); The class of OSCount is also modified as shown below to clear compiling errors. The reducer iterate all the values corresponding to one key and count the number of values. The count result is stored in variable sum. The main class of OSCount is also modified. During initialization, the job configure the classes of mapper, combiner, and reducer. The file input and output directory is also specified.

6 public class OSCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{ int sum = 0; while(values.hasnext()){ sum +=; output.collect(key, new IntWritable(sum)); public class OSCount { /** args */ public static void main(string[] args) throws IOException{ // TODO Auto-generated method stub JobConf conf = new JobConf(new Configuration(), OSCount.class); conf.setjobname("oscount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(oscountmapper.class); conf.setcombinerclass(oscountreducer.class); conf.setreducerclass(oscountreducer.class); // conf.setinputformat(textinputformat.class); // conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); Since the problem is lacking the input data set, the execution of program on Hadoop cluster is shown in Fig. 5.

7 Problem 9.3 Solution Fig. 5 Execution output of sample application on the terminal. There are three types of RFID tags, namely active RFID tags which contains battery and transmitting signal autonomously. Passive RFID tag does NOT have battery and require external source to provoke communications. Battery-assisted passive RFID tags require external source to wake up the battery. 1. Active and semi-active tag has battery to transmit over 30 to 100 meters. They are more costly than passive RFID tags. 2. Passive RFID tag has no battery source and can only transmit up to 20 feet. However, they are cheap and disposable. Similarly, there are two types of GPS tracking system, namely passive and active. 1. Passive GPS is just a receiver and primarily used for data recording. Passive GPS device stores GPS location data in their internal memory, which is cheaper than active GPS device. 2. Active GPS device can transmit data to satellite through cellular communication. The active GPS device can send the data at regular time interval in real time. Problem 9.6 Solution The IoT (Internet of Thing) refers to the network interconnection of everyday objects, tools, devices, and computers, while traditional Internet connects computers. With development of RFID and GPS technology, all things in our daily life can be tagged and connected no matter where and when the object is. The IoT has event-driven architecture as shown in Fig in the textbook. The top layer is formed by driven applications, which includes retailing and supply-chain management, logistics service, smart grid and building etc. The bottom layer represent various types of sensor devices, namely RFID tags, ZigBee, GPS navigators etc. These sensor are widely connected and collect real-time information. The cloud computing platform in the middle will process the collected information and generate intelligence for decision-making. Many technologies can be applied to build IoT infrastructure, which are divided into two categories, enabling and synergistic technologies. Toward 2020, IoT will be placed in global scale and significantly upgrade national economy and quality of life.

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego

More information

Map-Reduce in Various Programming Languages

Map-Reduce in Various Programming Languages Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the

More information

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following:

More information

PARLab Parallel Boot Camp

PARLab Parallel Boot Camp PARLab Parallel Boot Camp Cloud Computing with MapReduce and Hadoop Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley What is Cloud Computing? Cloud refers to

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 9 Map Reduce 2010-2011 Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1 Outline Map Reduce:

More information


UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax Spark and Cassandra Solving classical data analytic task by using modern distributed databases Artem Aliev DataStax Spark and Cassandra Solving classical data analytic task by using modern distributed

More information

Map- reduce programming paradigm

Map- reduce programming paradigm Map- reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. In pioneer days they

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Using the Cloud to Crunch Your Data. Adrian Cockcro,

Using the Cloud to Crunch Your Data. Adrian Cockcro, Using the Cloud to Crunch Your Data Adrian Cockcro, acockcro, What is Cloud Compu;ng? What is Capacity Planning We care about CPU, Memory, Network and Disk resources, and Applica;on response

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L12: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

Teaching Map-reduce Parallel Computing in CS1

Teaching Map-reduce Parallel Computing in CS1 Teaching Map-reduce Parallel Computing in CS1 Richard Brown, Patrick Garrity, Timothy Yates Mathematics, Statistics, and Computer Science St. Olaf College Northfield, MN Elizabeth Shoop

More information

ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing

ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing ECE 587 Hardware/Software Co-Design Spring 2018 1/14 ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing Professor Jia Wang Department of Electrical and Computer Engineering

More information

Large-scale Information Processing

Large-scale Information Processing Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment Anecdotal evidence... I think there is a world market for about five computers,

More information

Lab 11 Hadoop MapReduce (2)

Lab 11 Hadoop MapReduce (2) Lab 11 Hadoop MapReduce (2) 1 Giới thiệu Để thiết lập một Hadoop cluster, SV chọn ít nhất là 4 máy tính. Mỗi máy có vai trò như sau: - 1 máy làm NameNode: dùng để quản lý không gian tên (namespace) và

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah 14/09/2018 The Course Web Page 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

CS 525 Advanced Distributed Systems Spring 2018

CS 525 Advanced Distributed Systems Spring 2018 CS 525 Advanced Distributed Systems Spring 2018 Indranil Gupta (Indy) Lecture 3 Cloud Computing (Contd.) January 24, 2018 All slides IG 1 What is MapReduce? Terms are borrowed from Functional Language

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

Internet Measurement and Data Analysis (13)

Internet Measurement and Data Analysis (13) Internet Measurement and Data Analysis (13) Kenjiro Cho 2016-07-11 review of previous class Class 12 Search and Ranking (7/4) Search systems PageRank exercise: PageRank algorithm 2 / 64 today s topics

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop.

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop. MRUnit Tutorial Setup development environment 1. Download the latest version of MRUnit jar from Apache website: mrunit/mrunit/. For

More information

Interfaces 3. Reynold Xin Aug 22, Databricks Retreat. Repurposed Jan 27, 2015 for Spark community

Interfaces 3. Reynold Xin Aug 22, Databricks Retreat. Repurposed Jan 27, 2015 for Spark community Interfaces 3 Reynold Xin Aug 22, 2014 @ Databricks Retreat Repurposed Jan 27, 2015 for Spark community Spark s two improvements over Hadoop MR Performance: 100X faster than Hadoop MR Programming model:

More information

Introduction to Hadoop

Introduction to Hadoop Hadoop 1 Why Hadoop Drivers: 500M+ unique users per month Billions of interesting events per day Data analysis is key Need massive scalability PB s of storage, millions of files, 1000 s of nodes Need cost

More information

Computing as a Utility. Cloud Computing. Why? Good for...

Computing as a Utility. Cloud Computing. Why? Good for... Computing as a Utility Cloud Computing Leonidas Fegaras University of Texas at Arlington Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing

More information

Map Reduce & Hadoop. Lecture BigData Analytics. Julian M. Kunkel.

Map Reduce & Hadoop. Lecture BigData Analytics. Julian M. Kunkel. Map Reduce & Hadoop Lecture BigData Analytics Julian M. Kunkel University of Hamburg / German Climate Computing Center (DKRZ) 2017-11-10 Disclaimer: Big Data software is constantly

More information

3. Big Data Processing

3. Big Data Processing 3. Big Data Processing Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC Fall - 2013 Jordi Torres, UPC - BSC Slides are only for presentation guide We will discuss+debate

More information

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce Prof. George Baciu PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

Hadoop 2.X on a cluster environment

Hadoop 2.X on a cluster environment Hadoop 2.X on a cluster environment Big Data - 05/04/2017 Hadoop 2 on AMAZON Hadoop 2 on AMAZON Hadoop 2 on AMAZON Regions Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on

More information

Semantics with Failures

Semantics with Failures Semantics with Failures If map and reduce are deterministic, then output identical to non-faulting sequential execution For non-deterministic operators, different reduce tasks might see output of different

More information

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources

More information

Introduction to Map/Reduce & Hadoop

Introduction to Map/Reduce & Hadoop Introduction to Map/Reduce & Hadoop Vassilis Christophides University of Crete 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Amir H. Payberah SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark

More information

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted

More information

Hadoop 3 Configuration and First Examples

Hadoop 3 Configuration and First Examples Hadoop 3 Configuration and First Examples Big Data - 26/03/2018 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of companies

More information

Hadoop 2.8 Configuration and First Examples

Hadoop 2.8 Configuration and First Examples Hadoop 2.8 Configuration and First Examples Big Data - 29/03/2017 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of

More information

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data! Hadoop 1 Data Deluge Billions of users connected through the net WWW, FB, twitter, cell phones, 80% of the data on FB was produced last year Storage getting cheaper Store more data! Why Hadoop Drivers:

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve

More information

Introduction to Map/Reduce & Hadoop

Introduction to Map/Reduce & Hadoop Introduction to Map/Reduce & Hadoop V. CHRISTOPHIDES University of Crete & INRIA Paris 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming model and associated implementation for

More information

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings CS435 Introduction to Big Data 1/17/2018 W2.A.0 W1.A.0 CS435 Introduction to Big Data W2.A.1.A.1 FAQs PA0 has been posted Feb. 6, 5:00PM via Canvas Individual submission (No team submission) Accommodation

More information

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough Apache Spark Easy and Fast Big Data Analytics Pat McDonough Founded by the creators of Apache Spark out of UC Berkeley s AMPLab Fully committed to 100% open source Apache Spark Support and Grow the Spark

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to

More information

BigData and MapReduce with Hadoop

BigData and MapReduce with Hadoop BigData and MapReduce with Hadoop Ivan Tomašić 1, Roman Trobec 1, Aleksandra Rashkovska 1, Matjaž Depolli 1, Peter Mežnar 2, Andrej Lipej 2 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana 2 TURBOINŠTITUT

More information

Hadoop Map-Reduce Tutorial

Hadoop Map-Reduce Tutorial Table of contents 1 Purpose...2 2 Pre-requisites...2 3 Overview...2 4 Inputs and Outputs... 3 5 Example: WordCount v1.0... 3 5.1 Source Code...3 5.2 Usage... 6 5.3 Walk-through...7 6 Map-Reduce - User

More information

Topics covered in this lecture

Topics covered in this lecture 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?

More information

Hadoop 3.X more examples

Hadoop 3.X more examples Hadoop 3.X more examples Big Data - 09/04/2018 Let s start with some examples! Example: LastFM Listeners per Track Consider the following log file UserId

More information

1/30/2019 Week 2- B Sangmi Lee Pallickara

1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Term project deliverable

More information

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google

More information

Hadoop Map/Reduce Tutorial

Hadoop Map/Reduce Tutorial Table of contents 1 Purpose...2 2 Pre-requisites...2 3 Overview...2 4 Inputs and Outputs... 3 5 Example: WordCount v1.0... 3 5.1 Source Code...3 5.2 Usage... 6 5.3 Walk-through...7 6 Map/Reduce - User

More information

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara CS435 Introduction to Big Data Week 1-B W1.B.0 CS435 Introduction to Big Data No Cell-phones in the class. W1.B.1 FAQs PA0 has been posted If you need to use a laptop, please sit in the back row. August

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics July 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Big Data Overview. Nenad Jukic Loyola University Chicago. Abhishek Sharma Awishkar, Inc. & Loyola University Chicago

Big Data Overview. Nenad Jukic Loyola University Chicago. Abhishek Sharma Awishkar, Inc. & Loyola University Chicago Big Data Overview Nenad Jukic Loyola University Chicago Abhishek Sharma Awishkar, Inc. & Loyola University Chicago Introduction Three Types of Data stored in Corporations and Organizations Transactional

More information

Map-Reduce for Parallel Computing

Map-Reduce for Parallel Computing Map-Reduce for Parallel Computing Amit Jain Department of Computer Science College of Engineering Boise State University Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy pulling,

More information

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc.

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc. Enter the Elephant Massively Parallel Computing With Hadoop Toby DiPasquale Chief Architect Invite Media, Inc. Philadelphia Emerging Technologies for the Enterprise March 26, 2008 Image credit, http,//

More information

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies Lecture 8 Cloud Programming & Software Environments: High Performance Computing & AWS Services Part 2 of 2 Spring 2015 A Specialty Course

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak Distributed data caching environment Hadoop Apache Ignite "2 Cache what is cache? how it is used? "3 Cache - hardware buffer temporary storage

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Apache Hadoop Chaining jobs Chaining MapReduce jobs Many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce

More information

Big Data Analytics: Insights and Innovations

Big Data Analytics: Insights and Innovations International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations

More information

MapReduce. Arend Hintze

MapReduce. Arend Hintze MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,

More information

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders:

More information

Chapter 3. Distributed Algorithms based on MapReduce

Chapter 3. Distributed Algorithms based on MapReduce Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data

More information

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

ILO3:Algorithms and Programming Patterns for Cloud Applications (Hadoop)

ILO3:Algorithms and Programming Patterns for Cloud Applications (Hadoop) DISTRIBUTED RESEARCH ON EMERGING APPLICATIONS & MACHINES Indian Institute of Science, Bangalore DREAM:Lab SE252:Lecture 13-14, Feb 24/25 ILO3:Algorithms and Programming Patterns for Cloud

More information

September 2013 Alberto Abelló & Oscar Romero 1

September 2013 Alberto Abelló & Oscar Romero 1 duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.

More information

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science Introduction The Hadoop cluster in Computing Science at Stirling allows users with a valid user account to submit and

More information

PageRank Implementa.on in MapReduce. TA: Kun Li

PageRank Implementa.on in MapReduce. TA: Kun Li PageRank Implementa.on in MapReduce TA: Kun Li Hadoop version Your code will be tested under EMR AMI version 2.4.2 You can develop and test your code using Hadoop 1.0.3, which is corresponding

More information

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014. COSC 6397 Big Data Analytics Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading Edgar Gabriel Spring 2014 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in

More information

Local MapReduce debugging

Local MapReduce debugging Local MapReduce debugging Tools, tips, and tricks Aaron Kimball Cloudera Inc. July 21, 2009 urce: Wikipedia Japanese rock garden Common sense debugging tips Build incrementally Build compositionally Use

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Steps: First install hadoop (if not installed yet) by,

Steps: First install hadoop (if not installed yet) by, SL-V BE IT EXP 7 Aim: Design and develop a distributed application to find the coolest/hottest year from the available weather data. Use weather data from the Internet and process it using MapReduce. Steps:

More information


PROGRAMMING FUNDAMENTALS PROGRAMMING FUNDAMENTALS Q1. Name any two Object Oriented Programming languages? Q2. Why is java called a platform independent language? Q3. Elaborate the java Compilation process. Q4. Why do we write

More information

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

At Course Completion Prepares you as per certification requirements for AWS Developer Associate. [AWS-DAW]: AWS Cloud Developer Associate Workshop Length Delivery Method : 4 days : Instructor-led (Classroom) At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

More information

Altus Data Engineering

Altus Data Engineering Altus Data Engineering Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

Big Data: Tremendous challenges, great solutions

Big Data: Tremendous challenges, great solutions Big Data: Tremendous challenges, great solutions Luc Bougé ENS Rennes Alexandru Costan INSA Rennes Gabriel Antoniu INRIA Rennes Survive the data deluge! Équipe KerData 1 Big Data? 2 Big Picture The digital

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

Processing Distributed Data Using MapReduce, Part I

Processing Distributed Data Using MapReduce, Part I Processing Distributed Data Using MapReduce, Part I Computer Science E-66 Harvard University David G. Sullivan, Ph.D. MapReduce A framework for computation on large data sets that are fragmented and replicated

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [MAPREDUCE & HADOOP] Does Shrideep write the poems on these title slides? Yes, he does. These musing are resolutely on track For obscurity shores, from whence

More information


IN ACTION. Chuck Lam SAMPLE CHAPTER MANNING IN ACTION Chuck Lam SAMPLE CHAPTER MANNING Hadoop in Action by Chuck Lam Chapter 1 Copyright 2010 Manning Publications brief contents PART I HADOOP A DISTRIBUTED PROGRAMMING FRAMEWORK... 1 1 Introducing

More information

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)? What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)? What is Amazon Machine Image (AMI)? Amazon Elastic Compute Cloud (EC2)?

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Exercise 4: Loops, Arrays and Files

Exercise 4: Loops, Arrays and Files Exercise 4: Loops, Arrays and Files worth 24% of the final mark November 4, 2004 Instructions Submit your programs in a floppy disk. Deliver the disk to Michele Zito at the 12noon lecture on Tuesday November

More information

Introducing Hadoop. This chapter covers

Introducing Hadoop. This chapter covers 1 Introducing Hadoop This chapter covers The basics of writing a scalable, distributed data-intensive program Understanding Hadoop and MapReduce Writing and running a basic MapReduce program Today, we

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These

More information

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog

More information

(A) 99 (B) 100 (C) 101 (D) 100 initial integers plus any additional integers required during program execution

(A) 99 (B) 100 (C) 101 (D) 100 initial integers plus any additional integers required during program execution Ch 5 Arrays Multiple Choice 01. An array is a (A) (B) (C) (D) data structure with one, or more, elements of the same type. data structure with LIFO access. data structure, which allows transfer between

More information