Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Size: px

Start display at page:

Download "Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak"

Edward Hunter
5 years ago
Views:

1 Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak

2 dsh distributed shell commands execution -c concurrent --show-machine-names -M --group cluster -g cluster /etc/dsh/groups/cluster needs passwordless ssh ssh-copy-id -i ~/.ssh/id_rsa.pub

3 Vagrant how to download a box before vagrant init? vagrant box add ubuntu/trusty64 vagrant-disk1.box initialize environment vagrant init ubuntu/trusty64 vagrant up stop machine vagrant halt destroy vagrant destroy or vagrant destroy -f access with ssh vagrant ssh every command accepts machine name vagrant destroy -f mymachinename

4 Vagrantfile produced with vagrant init # -*- mode: ruby -*- # vi: set ft=ruby : # All Vagrant configuration is done below. The "2" in Vagrant.configure # configures the configuration version (we support older styles for # backwards compatibility). Please don't change it unless you know what # you're doing. Vagrant.configure("2") do config # The most common configuration options are documented and commented below. # For a complete reference, please see the online documentation at # # Every Vagrant development environment requires a box. You can search for # boxes at config.vm.box = "base" # Disable automatic box update checking. If you disable this, then # boxes will only be checked for updates when the user runs # `vagrant box outdated`. This is not recommended. # config.vm.box_check_update = false # Create a forwarded port mapping which allows access to a specific port # within the machine from a port on the host machine. In the example below, # accessing "localhost:8080" will access port 80 on the guest machine. # config.vm.network "forwarded_port", guest: 80, host: 8080 # Create a private network, which allows host-only access to the machine # using a specific IP. # config.vm.network "private_network", ip: " " # Create a public network, which generally matched to bridged network. # Bridged networks make the machine appear as another physical device on # your network. # config.vm.network "public_network" # Share an additional folder to the guest VM. The first argument is # the path on the host to the actual folder. The second argument is # the path on the guest to mount the folder. And the optional third # argument is a set of non-required options. # config.vm.synced_folder "../data", "/vagrant_data" # Provider-specific configuration so you can fine-tune various # backing providers for Vagrant. These expose provider-specific options. # Example for VirtualBox: # # config.vm.provider "virtualbox" do vb # # Display the VirtualBox GUI when booting the machine # vb.gui = true # # # Customize the amount of memory on the VM: # vb.memory = "1024" # end # # View the documentation for the provider you are using for more # information on available options. # Define a Vagrant Push strategy for pushing to Atlas. Other push strategies # such as FTP and Heroku are also available. See the documentation at # for more information. # config.push.define "atlas" do push # push.app = "YOUR_ATLAS_USERNAME/YOUR_APPLICATION_NAME" # end # Enable provisioning with a shell script. Additional provisioners such as # Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the # documentation for more information about their specific syntax and use. # config.vm.provision "shell", inline: <<-SHELL # apt-get update # apt-get install -y apache2 # SHELL end

5 Vagrantfile # -*- mode: ruby -*- # vi: set ft=ruby : boxes = [ { :name => "mymachinename", :eth1 => " ", :mem => "4096", :cpu => "2", :box => "ubuntu/trusty64", :files => [ { :src => "files/file.tar", :dst => "files/file.tar", { :src => "files/dir", :dst => "files/dir" ], :shells => [ "shells/apt-update.sh" ], :ports_forward => [ { :guest => "56789", :host => "56789" ] ] Vagrant.configure(2) do config boxes.each do opts config.vm.define opts[:name] do config config.vbguest.auto_update = false config.vm.provider "virtualbox" do v, override override.vm.box = opts[:box] end config.vm.hostname = opts[:name] config.vm.provider "virtualbox" do v v.customize ["modifyvm", :id, "--memory", opts[:mem]] v.customize ["modifyvm", :id, "--cpus", opts[:cpu]] end config.vm.network :private_network, ip: opts[:eth1] opts[:ports_forward].each do item config.vm.network "forwarded_port", guest: item[:guest], host: item[:host] end end end end opts[:files].each do fileitem config.vm.provision :file do file file.source = fileitem[:src] file.destination = fileitem[:dst] end end opts[:shells].each do item config.vm.provision :shell, path: item end

6 Vagrantfile Vagrant.configure(2) do config boxes.each do opts config.vm.define opts[:name] do config config.vbguest.auto_update = false config.vm.provider "virtualbox" do v, override override.vm.box = opts[:box] end config.vm.hostname = opts[:name] config.vm.provider "virtualbox" do v v.customize ["modifyvm", :id, "--memory", opts[:mem]] v.customize ["modifyvm", :id, "--cpus", opts[:cpu]] end config.vm.network :private_network, ip: opts[:eth1] opts[:ports_forward].each do item config.vm.network "forwarded_port", guest: item[:guest], host: item[:host] end end end end opts[:files].each do fileitem config.vm.provision :file do file file.source = fileitem[:src] file.destination = fileitem[:dst] end end opts[:shells].each do item config.vm.provision :shell, path: item end

7 Vagrantfile # -*- mode: ruby -*- # vi: set ft=ruby : boxes = [ { :name => "mymachinename", :eth1 => " ", :mem => "4096", :cpu => "2", :box => "ubuntu/trusty64", :files => [ { :src => "files/file.tar", :dst => "files/file.tar", { :src => "files/dir", :dst => "files/dir" ], :shells => [ "shells/apt-update.sh" ], :ports_forward => [ { :guest => "56789", :host => "56789" ] ]

8 Hadoop Limits: in case Hadoop has unexpectedly failed with your job run again and use htop to monitor system resources and verify if the there is enough RAM 2048 MB of RAM may be not enough

9 Hadoop HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.

10 HDFS Architecture The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is

11 HDFS : Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

12 Moving Computation is Cheaper than Moving Data A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimises network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

13 Namenode The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

14 Datanode Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

15 Replication HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

YARN YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping.

16 YARN YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping. Just like HDFS, it also has 2 main components: a ResourceManager which keeps track of the cluster resources and NodeManagers in each of the nodes which communicates with the ResourceManager and sets up containers for execution of tasks.

17 Hadoop MapReduce - API input set of files directory output directory set of results files most important methods are: map() reduce()

18 Hadoop MapReduce - Stages Map Shuffle Reduce

19 Hadoop MapReduce - API public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ public void map(object key, Text value, Context context ) throws IOException, InterruptedException { public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

20 Hadoop MapReduce typical implementation uses Mapper and Reducer interfaces most important implemented methods are: map() reduce() class implementing Mapper Class implementing Shuffling class implementing Reducer

21 Map stage Parsing function maps input key/value pairs to set of intermediate key/ value pairs one map for each InputSplit generated by InputFormat docs/stable/api/org/ apache/hadoop/ mapreduce/mapper.html private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

22 Map stage How many Maps? depending on number of input items generally maps per-node

23 Stage Between Map and Reduce Phases Shuffle & Sort output from mappers grouping Shuffle can be customized to use encryption with HTTPS protocol

24 Shuffle stage Combiner Mapper output is combined according to defined rules merges duplicates executes local aggregation public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); The output of the first map: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The output of the second map: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> The combined output of the first map: < Bye, 1> < Hello, 1> < World, 2> The combined output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1>

Reducer stage input data public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); http://a4academics.

25 Reducer stage input data public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); uses Mapper output as input if there is no class defined to process data after Mapper public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); uses data processed by implemented classes, that do the processing after Mapper stage computes final result The reducers receive output of combined maps: < Bye, <1>> < Hello, <1,1>> < World, <2>> < Goodbye, <1>> < Hadoop, <2>>

26 After Map and Reduce Secondary sort may be used for custom grouping Comparator

27 Job Primary interface for user-job interaction with ResourceManager MapReduce job configuration setters for Mapper, Combiner, Reducer Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); defined Job

28 Job Control Complex tasks can be set into Jobs chain Job.submit() - submit job to cluster and return immediately Job.waitForCompletion(boolean) - submit job and wait for it to finish ToolRunner class useful for remote job execution

29 Job Input InputFormat validate the input for the job split-up input files into InputSplit instances assigned to separate Mapper blocksize of FileSystem is the upper bound for splits blocksize can be defined by user it is recommended to implement RecordReader for InputSplit

30 Job Output OutputFormat validate the output of the job - for example by checking that the output directory doesn t already exist provide RecordWriter to write job results files

31 Example code public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

32 Example code, multiple jobs private Grep() { // singleton public int run(string[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <indir> <outdir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return 2; ; Path tempdir = new Path("grep-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); Configuration conf = getconf(); conf.set(regexmapper.pattern, args[2]); if (args.length == 4) conf.set(regexmapper.group, args[3]); Job grepjob = new Job(conf); try { grepjob.setjobname("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); grepjob.setmapperclass(regexmapper.class); grepjob.setcombinerclass(longsumreducer.class); grepjob.setreducerclass(longsumreducer.class); f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoopmapreduce-project/hadoop-mapreduce-examples/src/ main/java/org/apache/hadoop/examples/grep.java FileOutputFormat.setOutputPath(grepJob, tempdir); grepjob.setoutputformatclass(sequencefileoutputformat.class); grepjob.setoutputkeyclass(text.class); grepjob.setoutputvalueclass(longwritable.class); grepjob.waitforcompletion(true); Job sortjob = new Job(conf); sortjob.setjobname("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempdir); sortjob.setinputformatclass(sequencefileinputformat.class); sortjob.setmapperclass(inversemapper.class); sortjob.setnumreducetasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); sortjob.setsortcomparatorclass( // sort by decreasing freq LongWritable.DecreasingComparator.class); sortjob.waitforcompletion(true);

33 Run example

34 Links MapReduceTutorial.html research.google.com/en//archive/mapreduceosdi04.pdf

35 Samples Hadoop examples source hadoop-mapreduce-project/hadoop-mapreduceexamples

36 Build a customized sample generate project, use mvn archetype:generate -DgroupId=com.mycompany.app -DartifactId=my-app -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false copy example source link to the generated project add Map Reduce app (this is from cloned hadoop repo in /vagrant) cp /vagrant/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/ WordCount.java /vagrant/my-app/src/main/java/com/mycompany/app/ add missing packages to the generated project, follow poms from hadoop repo <dependency><groupid>org.apache.hadoop</groupid> <artifactid>hadoop-client</artifactid><version>2.6.4</version> </dependency> <dependency><groupid>org.apache.hadoop</groupid> <artifactid>hadoop-common</artifactid> <version>2.6.4</version></dependency> build package, use mvn package start hadoop run scripts/hadoop_start.sh from hadoop-multi-node vagrant@hadoop-master:~/scripts$ sh hadoop_start.sh WordCount needs: sh hdfs_mkdir_input.sh; sh hdfs_put_input.sh start your example vagrant@hadoop-master:~/scripts$ hadoop jar target/my-app-1.0-snapshot.jar com.mycompany.app.wordcount input/hadoop output

37 Connect with hadoop Configuration class Example : f67237cbe7bc48a1b9088e990800b37529f1db2a/ hadoop-hdfs-project/hadoop-hdfs/src/main/java/ org/apache/hadoop/hdfs/tools/hdfsconcat.java private final static String def_uri = hdfs://localhost:9000"; Configuration conf = new Configuration(); String uri = conf.get("fs.default.name", def_uri); Path path = new Path(uri); DistributedFileSystem dfs = (DistributedFileSystem)FileSystem.get(path.toUri(), conf);

38 Connect with hadoop Configuration class Example : b6f66b0da1cc77f4e a008b4bd7e1a752/ hadoop-mapreduce-project/hadoop-mapreduceexamples/src/main/java/org/apache/hadoop/ examples/wordstandarddeviation.java FileSystem fs = FileSystem.get(conf); Path file = new Path(path, "part-r-00000"); if (!fs.exists(file)) throw new IOException("Output not found!");

39 Connect to the hadoop WebHDFS REST API Example request : curl -i -L :50070/webhdfs/v1/user/ vagrant/input/hadoop/? op=liststatus&namenoderpcaddress=hadoop -master:9000" when dfs-site.xml is configured as follows :

40 Connect to the hadoop WebHDFS REST API Example request : when dfs-site.xml is contains dfs.namenode.rpcaddress: <property> <name>dfs.namenode.rpc-address</name> <value>hadoop-master:9000</value> </property> curl -i -L :50070/webhdfs/v1/user/ vagrant/input/hadoop/?op=liststatus

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example