EE657 Spring 2012 HW#4 Zhou Zhao

Similar documents
Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Map-Reduce in Various Programming Languages

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Big Data landscape Lecture #2

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

PARLab Parallel Boot Camp

Cloud Computing. Up until now

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

MapReduce Simplified Data Processing on Large Clusters

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax

Map- reduce programming paradigm

Introduction to HDFS and MapReduce

Using the Cloud to Crunch Your Data. Adrian Cockcro,

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Teaching Map-reduce Parallel Computing in CS1

ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing

Large-scale Information Processing

Lab 11 Hadoop MapReduce (2)

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Data-Intensive Computing with MapReduce

CS 525 Advanced Distributed Systems Spring 2018

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Internet Measurement and Data Analysis (13)

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Java in MapReduce. Scope

Clustering Documents. Case Study 2: Document Retrieval

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop.

Interfaces 3. Reynold Xin Aug 22, Databricks Retreat. Repurposed Jan 27, 2015 for Spark community

Introduction to Hadoop

Computing as a Utility. Cloud Computing. Why? Good for...

Map Reduce & Hadoop. Lecture BigData Analytics. Julian M. Kunkel.

3. Big Data Processing

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Hadoop 2.X on a cluster environment

Semantics with Failures

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Introduction to Map/Reduce & Hadoop

An Introduction to Apache Spark

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

Hadoop 3 Configuration and First Examples

Hadoop 2.8 Configuration and First Examples

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Introduction to Map/Reduce & Hadoop

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Guidelines For Hadoop and Spark Cluster Usage

BigData and MapReduce with Hadoop

Hadoop Map-Reduce Tutorial

Topics covered in this lecture

Hadoop 3.X more examples

1/30/2019 Week 2- B Sangmi Lee Pallickara

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Hadoop Map/Reduce Tutorial

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

Big Data: Architectures and Data Analytics

Big Data Overview. Nenad Jukic Loyola University Chicago. Abhishek Sharma Awishkar, Inc. & Loyola University Chicago

Map-Reduce for Parallel Computing

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc.

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Parallel Computing. Prof. Marco Bertini

Big Data Analytics: Insights and Innovations

MapReduce. Arend Hintze

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Chapter 3. Distributed Algorithms based on MapReduce

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

ILO3:Algorithms and Programming Patterns for Cloud Applications (Hadoop)

September 2013 Alberto Abelló & Oscar Romero 1

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

PageRank Implementa.on in MapReduce. TA: Kun Li

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

Local MapReduce debugging

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Big Data: Architectures and Data Analytics

Steps: First install hadoop (if not installed yet) by,

PROGRAMMING FUNDAMENTALS

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Altus Data Engineering

Big Data: Tremendous challenges, great solutions

MapReduce programming model

Processing Distributed Data Using MapReduce, Part I

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

IN ACTION. Chuck Lam SAMPLE CHAPTER MANNING

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

Big Data: Architectures and Data Analytics

Exercise 4: Loops, Arrays and Files

Introducing Hadoop. This chapter covers

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

(A) 99 (B) 100 (C) 101 (D) 100 initial integers plus any additional integers required during program execution

Transcription:

EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the first item has 7 attributes, namely Category, Subcategory, Name etc. sampledata.add(new ReplaceableItem("Item_01").withAttributes( new ReplaceableAttribute("Category", "Clothes", true), new ReplaceableAttribute("Subcategory", "Sweater", true), new ReplaceableAttribute("Name", "Cathair Sweater", true), new ReplaceableAttribute("Color", "Siamese", true), new ReplaceableAttribute("Size", "Small", true), new ReplaceableAttribute("Size", "Medium", true), new ReplaceableAttribute("Size", "Large", true))); Variety of database operations are implemented, which are listed below, 1. create a domain 2. list existing domain 3. put data into one of the domains 4. select data from a domain 5. delete values from an attribute 6. delete an attribute 7. replace an attribute 8. delete item and domain The code show below corresponds to operation of creating domain, listing existing domain, and put data in domain, respectively. // Create a domain String mydomain = "MyStore2"; System.out.println("Creating domain called " + mydomain + ".\n"); sdb.createdomain(new CreateDomainRequest(myDomain)); // List domains System.out.println("Listing all domains in your account:\n"); for (String domainname : sdb.listdomains().getdomainnames()) { System.out.println(" " + domainname); System.out.println(); // Put data into a domain System.out.println("Putting data into " + mydomain + " domain.\n"); sdb.batchputattributes(new BatchPutAttributesRequest(myDomain, createsampledata())); The execution output on the terminal is shown in Fig. 1.

Fig. 1 Execution output of SimpleDB application on AWS. Problem 6.4 Solution The coding of MapReduce for matrix multiplication reference the link http://www.norstad.org/matrixmultiply/index.html, Assume the matrix multiplication is A*B=C, in which A, B, and C are all N*N integer matrices. Each matrices will be divided into almost equal blocks for each nodes in the cluster. For instance, the N*N matrix will be divided into 2*2 blocks for cluster which has 4 nodes. Mapper nodes do partitioning of the input matrices, while Reducer nodes do real matrix multiplication. Four implementation strategies of Reducer nodes are presented below. Strategy one to three need to submit two jobs, while strategy four only needs to submit one job to cluster. 1. Each reducer do just one block multiplication. 2. Each reducer do multiplication of single A block with all row of B blocks. 3. Each reducer do multiplication of single B block with all column of A blocks. 4. Each reducer compute the final blocks of product matrix C. The experiment is conducted in Java on Hadoop cluster with 4 and 16 nodes on EC2, respectively. The Hadoop cluster is configured by following steps, 1. Use Apache Whirr script to automatically provision cluster on EC2. 2. Setup Proxy VM instance to submit job to tasktracker in cluster. 3. Upload source code to Proxy VM instance through FileZilla or download files from S3 bucket using s2cmd. 4. Submit matrix multiplication job to cluster and record execution time. The provisioned Hadoop cluster, which has 16 nodes is shown in Fig. 2 and code execution on Hadoop cluster is shown in Fig. 3. Table 1 Measured execution time in second of 10000*10000 integer matrices. # of nodes in cluster Strategy 1 Strategy 2 Strategy 3 Strategy 4 4 nodes 228s 227s 198s 118s 16 nodes 370s 385s 412s 199s 62 nodes Note: I have submit request to AWS to release the limit of 20 provisioned instance. Now, I can provision up to 1024

VM instances. Fig. 2 Provisioned 16 nodes Hadoop cluster on EC2. Fig. 3 Execution of MapReduce job Hadoop cluster on EC2.

Problem 6.5 Solution The S3 on AWS is for simple file storage. The code below prepares a text file named aws-java-sdk-.txt, which contains 256 characters. Then, the text file is updated to S3 and then downloaded from S3. private static File createsamplefile() throws IOException { File file = File.createTempFile("aws-java-sdk-", ".txt"); file.deleteonexit(); Writer writer = new OutputStreamWriter(new FileOutputStream(file)); writer.write("abcdefghijklmnopqrstuvwxyz\n"); writer.write("01234567890112345678901234\n"); writer.write("!@#$%^&*()-=[]{;':',.<>/?\n"); writer.write("01234567890112345678901234\n"); writer.write("abcdefghijklmnopqrstuvwxyz\n"); writer.close(); return file; The implemented operations on S3 file system is 1. create bucket 2. list buckets in one account 3. upload object into the bucket 4. download object from the bucket 5. list object in one bucket 6. delete bucket The code shown below is segment of uploading object to S3 bucket. The execution output is shown in Fig. 4. System.out.println("Uploading a new object to S3 from a file\n"); s3.putobject(new PutObjectRequest(bucketName, key, createsamplefile()));

Fig. 4 Execution output of S3 application on AWS. Problem 6.15 Solution MapReduce programming model has simplified the implementation of many data parallel. Its programming model is based on bipartite graph. However, it has limitation in applying sorts of applications. Twister provide enhancement features including, 1. distinction on static and variable data 2. configure long running map/reduce tasks 3. message based communication 4. support iterative MapReduce computations 5. combine phase to collect all outputs 6. data access via local disk 7. lightweight Problem 6.16 Solution The original code in textbook has so many semantic errors. Thus, the code is modified as show below to clear compiling errors. The class of OSCountMapper is the subclass of MapReduceBase, which implement interface Mapper. The input key-value pair is <LongWritable, Text> and output key-value pair is <Text, IntWritable>. The map function find the last substring and search the last character before ')'. Finally, add the generated key-value pair in the collector. public class OSCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Text UserInfo = new Text(); Text OSversion = new Text(); int StartIndex = 0; int EndIndex = 0; int i = 0; String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens() && i!= 8) { i++; UserInfo.set(tokenizer.nextToken()); i = 0; while (UserInfo.charAt(i)!= ';') { if(userinfo.charat(i)!= '('){ StartIndex = i; i++; EndIndex = i; OSversion.set((UserInfo.toString().substring(StartIndex, EndIndex))); output.collect(osversion, new IntWritable(1)); The class of OSCount is also modified as shown below to clear compiling errors. The reducer iterate all the values corresponding to one key and count the number of values. The count result is stored in variable sum. The main class of OSCount is also modified. During initialization, the job configure the classes of mapper, combiner, and reducer. The file input and output directory is also specified.

public class OSCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{ int sum = 0; while(values.hasnext()){ sum += values.next().get(); output.collect(key, new IntWritable(sum)); public class OSCount { /** * @param args */ public static void main(string[] args) throws IOException{ // TODO Auto-generated method stub JobConf conf = new JobConf(new Configuration(), OSCount.class); conf.setjobname("oscount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(oscountmapper.class); conf.setcombinerclass(oscountreducer.class); conf.setreducerclass(oscountreducer.class); // conf.setinputformat(textinputformat.class); // conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); Since the problem is lacking the input data set, the execution of program on Hadoop cluster is shown in Fig. 5.

Problem 9.3 Solution Fig. 5 Execution output of sample application on the terminal. There are three types of RFID tags, namely active RFID tags which contains battery and transmitting signal autonomously. Passive RFID tag does NOT have battery and require external source to provoke communications. Battery-assisted passive RFID tags require external source to wake up the battery. 1. Active and semi-active tag has battery to transmit over 30 to 100 meters. They are more costly than passive RFID tags. 2. Passive RFID tag has no battery source and can only transmit up to 20 feet. However, they are cheap and disposable. Similarly, there are two types of GPS tracking system, namely passive and active. 1. Passive GPS is just a receiver and primarily used for data recording. Passive GPS device stores GPS location data in their internal memory, which is cheaper than active GPS device. 2. Active GPS device can transmit data to satellite through cellular communication. The active GPS device can send the data at regular time interval in real time. Problem 9.6 Solution The IoT (Internet of Thing) refers to the network interconnection of everyday objects, tools, devices, and computers, while traditional Internet connects computers. With development of RFID and GPS technology, all things in our daily life can be tagged and connected no matter where and when the object is. The IoT has event-driven architecture as shown in Fig. 9.15 in the textbook. The top layer is formed by driven applications, which includes retailing and supply-chain management, logistics service, smart grid and building etc. The bottom layer represent various types of sensor devices, namely RFID tags, ZigBee, GPS navigators etc. These sensor are widely connected and collect real-time information. The cloud computing platform in the middle will process the collected information and generate intelligence for decision-making. Many technologies can be applied to build IoT infrastructure, which are divided into two categories, enabling and synergistic technologies. Toward 2020, IoT will be placed in global scale and significantly upgrade national economy and quality of life.