EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the first item has 7 attributes, namely Category, Subcategory, Name etc. sampledata.add(new ReplaceableItem("Item_01").withAttributes( new ReplaceableAttribute("Category", "Clothes", true), new ReplaceableAttribute("Subcategory", "Sweater", true), new ReplaceableAttribute("Name", "Cathair Sweater", true), new ReplaceableAttribute("Color", "Siamese", true), new ReplaceableAttribute("Size", "Small", true), new ReplaceableAttribute("Size", "Medium", true), new ReplaceableAttribute("Size", "Large", true))); Variety of database operations are implemented, which are listed below, 1. create a domain 2. list existing domain 3. put data into one of the domains 4. select data from a domain 5. delete values from an attribute 6. delete an attribute 7. replace an attribute 8. delete item and domain The code show below corresponds to operation of creating domain, listing existing domain, and put data in domain, respectively. // Create a domain String mydomain = "MyStore2"; System.out.println("Creating domain called " + mydomain + ".\n"); sdb.createdomain(new CreateDomainRequest(myDomain)); // List domains System.out.println("Listing all domains in your account:\n"); for (String domainname : sdb.listdomains().getdomainnames()) { System.out.println(" " + domainname); System.out.println(); // Put data into a domain System.out.println("Putting data into " + mydomain + " domain.\n"); sdb.batchputattributes(new BatchPutAttributesRequest(myDomain, createsampledata())); The execution output on the terminal is shown in Fig. 1.
Fig. 1 Execution output of SimpleDB application on AWS. Problem 6.4 Solution The coding of MapReduce for matrix multiplication reference the link http://www.norstad.org/matrixmultiply/index.html, Assume the matrix multiplication is A*B=C, in which A, B, and C are all N*N integer matrices. Each matrices will be divided into almost equal blocks for each nodes in the cluster. For instance, the N*N matrix will be divided into 2*2 blocks for cluster which has 4 nodes. Mapper nodes do partitioning of the input matrices, while Reducer nodes do real matrix multiplication. Four implementation strategies of Reducer nodes are presented below. Strategy one to three need to submit two jobs, while strategy four only needs to submit one job to cluster. 1. Each reducer do just one block multiplication. 2. Each reducer do multiplication of single A block with all row of B blocks. 3. Each reducer do multiplication of single B block with all column of A blocks. 4. Each reducer compute the final blocks of product matrix C. The experiment is conducted in Java on Hadoop cluster with 4 and 16 nodes on EC2, respectively. The Hadoop cluster is configured by following steps, 1. Use Apache Whirr script to automatically provision cluster on EC2. 2. Setup Proxy VM instance to submit job to tasktracker in cluster. 3. Upload source code to Proxy VM instance through FileZilla or download files from S3 bucket using s2cmd. 4. Submit matrix multiplication job to cluster and record execution time. The provisioned Hadoop cluster, which has 16 nodes is shown in Fig. 2 and code execution on Hadoop cluster is shown in Fig. 3. Table 1 Measured execution time in second of 10000*10000 integer matrices. # of nodes in cluster Strategy 1 Strategy 2 Strategy 3 Strategy 4 4 nodes 228s 227s 198s 118s 16 nodes 370s 385s 412s 199s 62 nodes Note: I have submit request to AWS to release the limit of 20 provisioned instance. Now, I can provision up to 1024
VM instances. Fig. 2 Provisioned 16 nodes Hadoop cluster on EC2. Fig. 3 Execution of MapReduce job Hadoop cluster on EC2.
Problem 6.5 Solution The S3 on AWS is for simple file storage. The code below prepares a text file named aws-java-sdk-.txt, which contains 256 characters. Then, the text file is updated to S3 and then downloaded from S3. private static File createsamplefile() throws IOException { File file = File.createTempFile("aws-java-sdk-", ".txt"); file.deleteonexit(); Writer writer = new OutputStreamWriter(new FileOutputStream(file)); writer.write("abcdefghijklmnopqrstuvwxyz\n"); writer.write("01234567890112345678901234\n"); writer.write("!@#$%^&*()-=[]{;':',.<>/?\n"); writer.write("01234567890112345678901234\n"); writer.write("abcdefghijklmnopqrstuvwxyz\n"); writer.close(); return file; The implemented operations on S3 file system is 1. create bucket 2. list buckets in one account 3. upload object into the bucket 4. download object from the bucket 5. list object in one bucket 6. delete bucket The code shown below is segment of uploading object to S3 bucket. The execution output is shown in Fig. 4. System.out.println("Uploading a new object to S3 from a file\n"); s3.putobject(new PutObjectRequest(bucketName, key, createsamplefile()));
Fig. 4 Execution output of S3 application on AWS. Problem 6.15 Solution MapReduce programming model has simplified the implementation of many data parallel. Its programming model is based on bipartite graph. However, it has limitation in applying sorts of applications. Twister provide enhancement features including, 1. distinction on static and variable data 2. configure long running map/reduce tasks 3. message based communication 4. support iterative MapReduce computations 5. combine phase to collect all outputs 6. data access via local disk 7. lightweight Problem 6.16 Solution The original code in textbook has so many semantic errors. Thus, the code is modified as show below to clear compiling errors. The class of OSCountMapper is the subclass of MapReduceBase, which implement interface Mapper. The input key-value pair is <LongWritable, Text> and output key-value pair is <Text, IntWritable>. The map function find the last substring and search the last character before ')'. Finally, add the generated key-value pair in the collector. public class OSCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Text UserInfo = new Text(); Text OSversion = new Text(); int StartIndex = 0; int EndIndex = 0; int i = 0; String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens() && i!= 8) { i++; UserInfo.set(tokenizer.nextToken()); i = 0; while (UserInfo.charAt(i)!= ';') { if(userinfo.charat(i)!= '('){ StartIndex = i; i++; EndIndex = i; OSversion.set((UserInfo.toString().substring(StartIndex, EndIndex))); output.collect(osversion, new IntWritable(1)); The class of OSCount is also modified as shown below to clear compiling errors. The reducer iterate all the values corresponding to one key and count the number of values. The count result is stored in variable sum. The main class of OSCount is also modified. During initialization, the job configure the classes of mapper, combiner, and reducer. The file input and output directory is also specified.
public class OSCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{ int sum = 0; while(values.hasnext()){ sum += values.next().get(); output.collect(key, new IntWritable(sum)); public class OSCount { /** * @param args */ public static void main(string[] args) throws IOException{ // TODO Auto-generated method stub JobConf conf = new JobConf(new Configuration(), OSCount.class); conf.setjobname("oscount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(oscountmapper.class); conf.setcombinerclass(oscountreducer.class); conf.setreducerclass(oscountreducer.class); // conf.setinputformat(textinputformat.class); // conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); Since the problem is lacking the input data set, the execution of program on Hadoop cluster is shown in Fig. 5.
Problem 9.3 Solution Fig. 5 Execution output of sample application on the terminal. There are three types of RFID tags, namely active RFID tags which contains battery and transmitting signal autonomously. Passive RFID tag does NOT have battery and require external source to provoke communications. Battery-assisted passive RFID tags require external source to wake up the battery. 1. Active and semi-active tag has battery to transmit over 30 to 100 meters. They are more costly than passive RFID tags. 2. Passive RFID tag has no battery source and can only transmit up to 20 feet. However, they are cheap and disposable. Similarly, there are two types of GPS tracking system, namely passive and active. 1. Passive GPS is just a receiver and primarily used for data recording. Passive GPS device stores GPS location data in their internal memory, which is cheaper than active GPS device. 2. Active GPS device can transmit data to satellite through cellular communication. The active GPS device can send the data at regular time interval in real time. Problem 9.6 Solution The IoT (Internet of Thing) refers to the network interconnection of everyday objects, tools, devices, and computers, while traditional Internet connects computers. With development of RFID and GPS technology, all things in our daily life can be tagged and connected no matter where and when the object is. The IoT has event-driven architecture as shown in Fig. 9.15 in the textbook. The top layer is formed by driven applications, which includes retailing and supply-chain management, logistics service, smart grid and building etc. The bottom layer represent various types of sensor devices, namely RFID tags, ZigBee, GPS navigators etc. These sensor are widely connected and collect real-time information. The cloud computing platform in the middle will process the collected information and generate intelligence for decision-making. Many technologies can be applied to build IoT infrastructure, which are divided into two categories, enabling and synergistic technologies. Toward 2020, IoT will be placed in global scale and significantly upgrade national economy and quality of life.