Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak
|
|
- Edward Hunter
- 5 years ago
- Views:
Transcription
1 Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak
2 dsh distributed shell commands execution -c concurrent --show-machine-names -M --group cluster -g cluster /etc/dsh/groups/cluster needs passwordless ssh ssh-copy-id -i ~/.ssh/id_rsa.pub
3 Vagrant how to download a box before vagrant init? vagrant box add ubuntu/trusty64 vagrant-disk1.box initialize environment vagrant init ubuntu/trusty64 vagrant up stop machine vagrant halt destroy vagrant destroy or vagrant destroy -f access with ssh vagrant ssh every command accepts machine name vagrant destroy -f mymachinename
4 Vagrantfile produced with vagrant init # -*- mode: ruby -*- # vi: set ft=ruby : # All Vagrant configuration is done below. The "2" in Vagrant.configure # configures the configuration version (we support older styles for # backwards compatibility). Please don't change it unless you know what # you're doing. Vagrant.configure("2") do config # The most common configuration options are documented and commented below. # For a complete reference, please see the online documentation at # # Every Vagrant development environment requires a box. You can search for # boxes at config.vm.box = "base" # Disable automatic box update checking. If you disable this, then # boxes will only be checked for updates when the user runs # `vagrant box outdated`. This is not recommended. # config.vm.box_check_update = false # Create a forwarded port mapping which allows access to a specific port # within the machine from a port on the host machine. In the example below, # accessing "localhost:8080" will access port 80 on the guest machine. # config.vm.network "forwarded_port", guest: 80, host: 8080 # Create a private network, which allows host-only access to the machine # using a specific IP. # config.vm.network "private_network", ip: " " # Create a public network, which generally matched to bridged network. # Bridged networks make the machine appear as another physical device on # your network. # config.vm.network "public_network" # Share an additional folder to the guest VM. The first argument is # the path on the host to the actual folder. The second argument is # the path on the guest to mount the folder. And the optional third # argument is a set of non-required options. # config.vm.synced_folder "../data", "/vagrant_data" # Provider-specific configuration so you can fine-tune various # backing providers for Vagrant. These expose provider-specific options. # Example for VirtualBox: # # config.vm.provider "virtualbox" do vb # # Display the VirtualBox GUI when booting the machine # vb.gui = true # # # Customize the amount of memory on the VM: # vb.memory = "1024" # end # # View the documentation for the provider you are using for more # information on available options. # Define a Vagrant Push strategy for pushing to Atlas. Other push strategies # such as FTP and Heroku are also available. See the documentation at # for more information. # config.push.define "atlas" do push # push.app = "YOUR_ATLAS_USERNAME/YOUR_APPLICATION_NAME" # end # Enable provisioning with a shell script. Additional provisioners such as # Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the # documentation for more information about their specific syntax and use. # config.vm.provision "shell", inline: <<-SHELL # apt-get update # apt-get install -y apache2 # SHELL end
5 Vagrantfile # -*- mode: ruby -*- # vi: set ft=ruby : boxes = [ { :name => "mymachinename", :eth1 => " ", :mem => "4096", :cpu => "2", :box => "ubuntu/trusty64", :files => [ { :src => "files/file.tar", :dst => "files/file.tar", { :src => "files/dir", :dst => "files/dir" ], :shells => [ "shells/apt-update.sh" ], :ports_forward => [ { :guest => "56789", :host => "56789" ] ] Vagrant.configure(2) do config boxes.each do opts config.vm.define opts[:name] do config config.vbguest.auto_update = false config.vm.provider "virtualbox" do v, override override.vm.box = opts[:box] end config.vm.hostname = opts[:name] config.vm.provider "virtualbox" do v v.customize ["modifyvm", :id, "--memory", opts[:mem]] v.customize ["modifyvm", :id, "--cpus", opts[:cpu]] end config.vm.network :private_network, ip: opts[:eth1] opts[:ports_forward].each do item config.vm.network "forwarded_port", guest: item[:guest], host: item[:host] end end end end opts[:files].each do fileitem config.vm.provision :file do file file.source = fileitem[:src] file.destination = fileitem[:dst] end end opts[:shells].each do item config.vm.provision :shell, path: item end
6 Vagrantfile Vagrant.configure(2) do config boxes.each do opts config.vm.define opts[:name] do config config.vbguest.auto_update = false config.vm.provider "virtualbox" do v, override override.vm.box = opts[:box] end config.vm.hostname = opts[:name] config.vm.provider "virtualbox" do v v.customize ["modifyvm", :id, "--memory", opts[:mem]] v.customize ["modifyvm", :id, "--cpus", opts[:cpu]] end config.vm.network :private_network, ip: opts[:eth1] opts[:ports_forward].each do item config.vm.network "forwarded_port", guest: item[:guest], host: item[:host] end end end end opts[:files].each do fileitem config.vm.provision :file do file file.source = fileitem[:src] file.destination = fileitem[:dst] end end opts[:shells].each do item config.vm.provision :shell, path: item end
7 Vagrantfile # -*- mode: ruby -*- # vi: set ft=ruby : boxes = [ { :name => "mymachinename", :eth1 => " ", :mem => "4096", :cpu => "2", :box => "ubuntu/trusty64", :files => [ { :src => "files/file.tar", :dst => "files/file.tar", { :src => "files/dir", :dst => "files/dir" ], :shells => [ "shells/apt-update.sh" ], :ports_forward => [ { :guest => "56789", :host => "56789" ] ]
8 Hadoop Limits: in case Hadoop has unexpectedly failed with your job run again and use htop to monitor system resources and verify if the there is enough RAM 2048 MB of RAM may be not enough
9 Hadoop HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
10 HDFS Architecture The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is
11 HDFS : Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
12 Moving Computation is Cheaper than Moving Data A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimises network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
13 Namenode The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
14 Datanode Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
15 Replication HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
16 YARN YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping. Just like HDFS, it also has 2 main components: a ResourceManager which keeps track of the cluster resources and NodeManagers in each of the nodes which communicates with the ResourceManager and sets up containers for execution of tasks.
17 Hadoop MapReduce - API input set of files directory output directory set of results files most important methods are: map() reduce()
18 Hadoop MapReduce - Stages Map Shuffle Reduce
19 Hadoop MapReduce - API public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ public void map(object key, Text value, Context context ) throws IOException, InterruptedException { public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);
20 Hadoop MapReduce typical implementation uses Mapper and Reducer interfaces most important implemented methods are: map() reduce() class implementing Mapper Class implementing Shuffling class implementing Reducer
21 Map stage Parsing function maps input key/value pairs to set of intermediate key/ value pairs one map for each InputSplit generated by InputFormat docs/stable/api/org/ apache/hadoop/ mapreduce/mapper.html private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
22 Map stage How many Maps? depending on number of input items generally maps per-node
23 Stage Between Map and Reduce Phases Shuffle & Sort output from mappers grouping Shuffle can be customized to use encryption with HTTPS protocol
24 Shuffle stage Combiner Mapper output is combined according to defined rules merges duplicates executes local aggregation public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); The output of the first map: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The output of the second map: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> The combined output of the first map: < Bye, 1> < Hello, 1> < World, 2> The combined output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1>
25 Reducer stage input data public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); uses Mapper output as input if there is no class defined to process data after Mapper public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); uses data processed by implemented classes, that do the processing after Mapper stage computes final result The reducers receive output of combined maps: < Bye, <1>> < Hello, <1,1>> < World, <2>> < Goodbye, <1>> < Hadoop, <2>>
26 After Map and Reduce Secondary sort may be used for custom grouping Comparator
27 Job Primary interface for user-job interaction with ResourceManager MapReduce job configuration setters for Mapper, Combiner, Reducer Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); defined Job
28 Job Control Complex tasks can be set into Jobs chain Job.submit() - submit job to cluster and return immediately Job.waitForCompletion(boolean) - submit job and wait for it to finish ToolRunner class useful for remote job execution
29 Job Input InputFormat validate the input for the job split-up input files into InputSplit instances assigned to separate Mapper blocksize of FileSystem is the upper bound for splits blocksize can be defined by user it is recommended to implement RecordReader for InputSplit
30 Job Output OutputFormat validate the output of the job - for example by checking that the output directory doesn t already exist provide RecordWriter to write job results files
31 Example code public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);
32 Example code, multiple jobs private Grep() { // singleton public int run(string[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <indir> <outdir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return 2; ; Path tempdir = new Path("grep-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); Configuration conf = getconf(); conf.set(regexmapper.pattern, args[2]); if (args.length == 4) conf.set(regexmapper.group, args[3]); Job grepjob = new Job(conf); try { grepjob.setjobname("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); grepjob.setmapperclass(regexmapper.class); grepjob.setcombinerclass(longsumreducer.class); grepjob.setreducerclass(longsumreducer.class); f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoopmapreduce-project/hadoop-mapreduce-examples/src/ main/java/org/apache/hadoop/examples/grep.java FileOutputFormat.setOutputPath(grepJob, tempdir); grepjob.setoutputformatclass(sequencefileoutputformat.class); grepjob.setoutputkeyclass(text.class); grepjob.setoutputvalueclass(longwritable.class); grepjob.waitforcompletion(true); Job sortjob = new Job(conf); sortjob.setjobname("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempdir); sortjob.setinputformatclass(sequencefileinputformat.class); sortjob.setmapperclass(inversemapper.class); sortjob.setnumreducetasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); sortjob.setsortcomparatorclass( // sort by decreasing freq LongWritable.DecreasingComparator.class); sortjob.waitforcompletion(true);
33 Run example
34 Links MapReduceTutorial.html research.google.com/en//archive/mapreduceosdi04.pdf
35 Samples Hadoop examples source hadoop-mapreduce-project/hadoop-mapreduceexamples
36 Build a customized sample generate project, use mvn archetype:generate -DgroupId=com.mycompany.app -DartifactId=my-app -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false copy example source link to the generated project add Map Reduce app (this is from cloned hadoop repo in /vagrant) cp /vagrant/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/ WordCount.java /vagrant/my-app/src/main/java/com/mycompany/app/ add missing packages to the generated project, follow poms from hadoop repo <dependency><groupid>org.apache.hadoop</groupid> <artifactid>hadoop-client</artifactid><version>2.6.4</version> </dependency> <dependency><groupid>org.apache.hadoop</groupid> <artifactid>hadoop-common</artifactid> <version>2.6.4</version></dependency> build package, use mvn package start hadoop run scripts/hadoop_start.sh from hadoop-multi-node vagrant@hadoop-master:~/scripts$ sh hadoop_start.sh WordCount needs: sh hdfs_mkdir_input.sh; sh hdfs_put_input.sh start your example vagrant@hadoop-master:~/scripts$ hadoop jar target/my-app-1.0-snapshot.jar com.mycompany.app.wordcount input/hadoop output
37 Connect with hadoop Configuration class Example : f67237cbe7bc48a1b9088e990800b37529f1db2a/ hadoop-hdfs-project/hadoop-hdfs/src/main/java/ org/apache/hadoop/hdfs/tools/hdfsconcat.java private final static String def_uri = hdfs://localhost:9000"; Configuration conf = new Configuration(); String uri = conf.get("fs.default.name", def_uri); Path path = new Path(uri); DistributedFileSystem dfs = (DistributedFileSystem)FileSystem.get(path.toUri(), conf);
38 Connect with hadoop Configuration class Example : b6f66b0da1cc77f4e a008b4bd7e1a752/ hadoop-mapreduce-project/hadoop-mapreduceexamples/src/main/java/org/apache/hadoop/ examples/wordstandarddeviation.java FileSystem fs = FileSystem.get(conf); Path file = new Path(path, "part-r-00000"); if (!fs.exists(file)) throw new IOException("Output not found!");
39 Connect to the hadoop WebHDFS REST API Example request : curl -i -L :50070/webhdfs/v1/user/ vagrant/input/hadoop/? op=liststatus&namenoderpcaddress=hadoop -master:9000" when dfs-site.xml is configured as follows :
40 Connect to the hadoop WebHDFS REST API Example request : when dfs-site.xml is contains dfs.namenode.rpcaddress: <property> <name>dfs.namenode.rpc-address</name> <value>hadoop-master:9000</value> </property> curl -i -L :50070/webhdfs/v1/user/ vagrant/input/hadoop/?op=liststatus
COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.
COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example
More informationJava in MapReduce. Scope
Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on
More informationParallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014
Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm
More informationChapter 3. Distributed Algorithms based on MapReduce
Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data
More informationExperiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018
Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,
More informationOutline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.
D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels
More informationMapReduce and Hadoop. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The
More informationBig Data Analytics: Insights and Innovations
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations
More informationECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing
ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters
More informationIntroduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece
Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationUNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus
UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.
More informationBig Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2
Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer
More informationParallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018
Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much
More information2. MapReduce Programming Model
Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System
More informationRecommended Literature
COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2017 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic
More informationLarge-scale Information Processing
Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationAttacking & Protecting Big Data Environments
Attacking & Protecting Big Data Environments Birk Kauer & Matthias Luft {bkauer, mluft}@ernw.de #WhoAreWe Birk Kauer - Security Researcher @ERNW - Mainly Exploit Developer Matthias Luft - Security Researcher
More informationGhislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data
More informationGhislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers
More informationGuidelines For Hadoop and Spark Cluster Usage
Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationA Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science
A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science Introduction The Hadoop cluster in Computing Science at Stirling allows users with a valid user account to submit and
More informationCompile and Run WordCount via Command Line
Aims This exercise aims to get you to: Compile, run, and debug MapReduce tasks via Command Line Compile, run, and debug MapReduce tasks via Eclipse One Tip on Hadoop File System Shell Following are the
More informationSession 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi
Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/
More informationA brief history on Hadoop
Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)
More informationHadoop 2.X on a cluster environment
Hadoop 2.X on a cluster environment Big Data - 05/04/2017 Hadoop 2 on AMAZON Hadoop 2 on AMAZON Hadoop 2 on AMAZON Regions Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,
More informationMap Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms
Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google
More informationHadoop 3.X more examples
Hadoop 3.X more examples Big Data - 09/04/2018 Let s start with some examples! http://www.dia.uniroma3.it/~dvr/es2_material.zip Example: LastFM Listeners per Track Consider the following log file UserId
More informationGetting Started with Hadoop
Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation
More informationSteps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/
SL-V BE IT EXP 7 Aim: Design and develop a distributed application to find the coolest/hottest year from the available weather data. Use weather data from the Internet and process it using MapReduce. Steps:
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationIntroduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems
Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources
More informationKillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX
KillTest Q&A Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Version : DEMO 1 / 4 1.When is the earliest point at which the reduce method of a given Reducer can be called?
More informationJava & Inheritance. Inheritance - Scenario
Java & Inheritance ITNPBD7 Cluster Computing David Cairns Inheritance - Scenario Inheritance is a core feature of Object Oriented languages. A class hierarchy can be defined where the class at the top
More informationChuck Cartledge, PhD. 24 September 2017
Introduction Basics Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Hadoop and R Chuck Cartledge, PhD 24 September 2017 1/26 Table of contents (1 of 1) 1 Introduction 2 Basics
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction...3 2 Assumptions and Goals...3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets...3 2.4 Simple Coherency Model... 4 2.5
More informationCSE6331: Cloud Computing
CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce
More informationVendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo
Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationCOSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics
COSC 6397 Big Data Analytics Distributed File Systems (II) Edgar Gabriel Spring 2017 HDFS Basics An open-source implementation of Google File System Assume that node failure rate is high Assumes a small
More informationMapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java
MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationRecommended Literature
COSC 6339 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Fall 2018 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationCS60021: Scalable Data Mining. Sourangshu Bhattacharya
CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer
More informationBig Data Exercises. Fall 2017 Week 5 ETH Zurich. MapReduce
Big Data Exercises Fall 2017 Week 5 ETH Zurich MapReduce Reading: White, T. (2015). Hadoop: The Definitive Guide (4th ed.). O Reilly Media, Inc. [ETH library] (Chapters 2, 6, 7, 8: mandatory, Chapter 9:
More informationHadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics July 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationUNIT-IV HDFS. Ms. Selva Mary. G
UNIT-IV HDFS HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system The Design of HDFS HDFS is a file system designed for storing very large files with
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationHortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :
Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.
More informationMap-Reduce for Parallel Computing
Map-Reduce for Parallel Computing Amit Jain Department of Computer Science College of Engineering Boise State University Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy pulling,
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationTP1-2: Analyzing Hadoop Logs
TP1-2: Analyzing Hadoop Logs Shadi Ibrahim January 26th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationCloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1
Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L12: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationHDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationTopics covered in this lecture
9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?
More informationExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you
ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig
More informationBig Data and Scripting map reduce in Hadoop
Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationHadoop MapReduce Framework
Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce
More informationVagrant CookBook. A practical guide to Vagrant. Erika Heidi. This book is for sale at
Vagrant CookBook A practical guide to Vagrant Erika Heidi This book is for sale at http://leanpub.com/vagrantcookbook This version was published on 2014-09-16 This is a Leanpub book. Leanpub empowers authors
More informationUsing Big Data for the analysis of historic context information
0 Using Big Data for the analysis of historic context information Francisco Romero Bueno Technological Specialist. FIWARE data engineer francisco.romerobueno@telefonica.com Big Data: What is it and how
More informationGetting Started with Hadoop/YARN
Getting Started with Hadoop/YARN Michael Völske 1 April 28, 2016 1 michael.voelske@uni-weimar.de Michael Völske Getting Started with Hadoop/YARN April 28, 2016 1 / 66 Outline Part One: Hadoop, HDFS, and
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationCommands Manual. Table of contents
Table of contents 1 Overview...2 1.1 Generic Options...2 2 User Commands...3 2.1 archive... 3 2.2 distcp...3 2.3 fs... 3 2.4 fsck... 3 2.5 jar...4 2.6 job...4 2.7 pipes...5 2.8 version... 6 2.9 CLASSNAME...6
More informationCloud Programming on Java EE Platforms. mgr inż. Piotr Nowak
Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak Distributed data caching environment Hadoop Apache Ignite "2 Cache what is cache? how it is used? "3 Cache - hardware buffer temporary storage
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Pattern Hadoop Mix Graphs Giraph Spark Zoo Keeper Spark But first Partitioner & Combiner
More informationMapReduce. Arend Hintze
MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,
More informationImplementing Algorithmic Skeletons over Hadoop
Implementing Algorithmic Skeletons over Hadoop Dimitrios Mouzopoulos E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2011
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationHadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)
Hortonworks Hadoop-PR000007 Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) http://killexams.com/pass4sure/exam-detail/hadoop-pr000007 QUESTION: 99 Which one of the following
More informationIntroduction into Big Data analytics
Introduction into Big Data analytics Lecture 5 MapReduce Janusz Szwabiński Outlook: 1. MapReduce overview 2. Inputs and outputs 3. First example: WordCount 1.0 4. MapReduce - user interfaces 5. Second
More informationCommands Guide. Table of contents
Table of contents 1 Overview...2 1.1 Generic Options...2 2 User Commands...3 2.1 archive... 3 2.2 distcp...3 2.3 fs... 3 2.4 fsck... 3 2.5 jar...4 2.6 job...4 2.7 pipes...5 2.8 queue...6 2.9 version...
More informationOverview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.
MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What
More informationHadoop 2.8 Configuration and First Examples
Hadoop 2.8 Configuration and First Examples Big Data - 29/03/2017 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationitpass4sure Helps you pass the actual test with valid and latest training material.
itpass4sure http://www.itpass4sure.com/ Helps you pass the actual test with valid and latest training material. Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor : Cloudera
More information1/30/2019 Week 2- B Sangmi Lee Pallickara
Week 2-A-0 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Term project deliverable
More informationCloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1
Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand
More information