Large-scale Information Processing

Similar documents
Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

MapReduce Simplified Data Processing on Large Clusters

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Java in MapReduce. Scope

An Introduction to Apache Spark

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

Map-Reduce in Various Programming Languages

Guidelines For Hadoop and Spark Cluster Usage

Big Data landscape Lecture #2

1/30/2019 Week 2- B Sangmi Lee Pallickara

September 2013 Alberto Abelló & Oscar Romero 1

Big Data: Architectures and Data Analytics

EE657 Spring 2012 HW#4 Zhou Zhao

Dept. Of Computer Science, Colorado State University

Hadoop 3.X more examples

Chapter 3. Distributed Algorithms based on MapReduce

Recommended Literature

TI2736-B Big Data Processing. Claudia Hauff

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am

Steps: First install hadoop (if not installed yet) by,

Hadoop 2.X on a cluster environment

Big Data Analytics: Insights and Innovations

Big Data: Architectures and Data Analytics

MapReduce. Arend Hintze

PARLab Parallel Boot Camp

Java & Inheritance. Inheritance - Scenario

MapReduce and Hadoop. The reference Big Data stack

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Big Data: Architectures and Data Analytics

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Introduction to HDFS and MapReduce

Topics covered in this lecture

Introduction to Map/Reduce & Hadoop

Hadoop 2.8 Configuration and First Examples

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop.

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Big Data: Architectures and Data Analytics

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS

Hadoop 3 Configuration and First Examples

Cloud Computing. Up until now

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

2. MapReduce Programming Model

Data-Intensive Computing with MapReduce

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Introduction to Hadoop and MapReduce

Introduction to Map/Reduce & Hadoop

Attacking & Protecting Big Data Environments

Recommended Literature

Prototyping Data Intensive Apps: TrendingTopics.org

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Improved MapReduce k-means Clustering Algorithm with Combiner

Parallel Computing. Prof. Marco Bertini

A brief history on Hadoop

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Programmatic ETL. Christian Thomsen, Aalborg University. ebiss 2017

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Hadoop. copyright 2011 Trainologic LTD

Introduction to Data Management CSE 344

Big Data: Tremendous challenges, great solutions

TI2736-B Big Data Processing. Claudia Hauff

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax

Map- reduce programming paradigm

Map-Reduce Applications: Counting, Graph Shortest Paths

Introduction to Hadoop

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Processing Distributed Data Using MapReduce, Part I

Data Partitioning and MapReduce

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

MAPREDUCE - PARTITIONER

Challenges for Data Driven Systems

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Internet Measurement and Data Analysis (13)

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Using Big Data for the analysis of historic context information

MapReduce programming model

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS

Distributed Systems 16. Distributed File Systems II

BigData and MapReduce with Hadoop

Transcription:

Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de

Anecdotal evidence... I think there is a world market for about five computers, Thomas J. Watson (Chairman of the Board of International Business Machines), 1943...it is very possible that... one machine would suffice to solve all the problems that are demanded of it from the whole country., Sir Charles Darwin (grandson of the naturalist of the same name, head of Britain's National Physical Laboratory), 1946 Originally one thought that if there were a half dozen large computers in this country, hidden away in research laboratories, this would take care of all requirements we had throughout the country., Howard H. Aiken (computer pioneer, IBM Mark I designer), 1952 2

Today... 3

4

Quelle: IBM 5

Query Volumes Quelle: Google Trends 6

Market Segmentation A company launches three new campaigns There is 2GB of data from customers who bought one of the three in the past There is 5GB potential customer data Who will be interested in which campaign? 7

Topics 8

Index Structures Make data searchable Efficient data structres Inverted index Radix tries... 9

Unsupervised Machine Learning Clustering Find group structures Nearest neighbors Duplicate detection Outlier detection Find anomalies 10

Supervised Machine Learning Linear models Classification Regression Ranking Optimization Feature encodings/scaling Hashing tricks 11

Structured Prediction Variables inter-depend/correlate Sequences, trees, grids, etc. Named entity recognition, parsing Protein secondary structure prediction Image segmentation 12

Recommender Systems 13

Scenarios Single computer vs. cluster (Hadoop, MapReduce) Static data vs. data streams 14

Integrierte Veranstaltung TUCAN 20-00-0706 Termine: Dienstags, 15:20-17:00 Uhr, S105/24 Donnerstags, 11:40-13:20 Uhr, S103/100 4 SWS, 6 CP Mündliche Prüfung/Klausur Foliensatz plus Tafel... Sprechstunde n.v. 15

Übungen Tutor: Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Unregelmäßig (etwa jeder 4. Termin) Sinn und Zweck: Theoretisches Verständnis Praktische Vertiefungen Keine Abgabe/Korretur http://www.kma.informatik.tu-darmstadt.de/ teaching/large-scale-information-processing/ 16

Further Readings Rajaraman & Ullman, Mining of Massive Datasets http://i.stanford.edu/~ullman/mmds.html 17

Large-scale Information Processing, Sommer 2013 Hadoop MapReduce Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de

The Web 19

The size of the Web http://www.worldwidewebsize.com 20

Reading the Web... Average size of a web page in 2011 was 965 kb 50 billion times 965 kb = 899 TB Read 900 TB of data with 100MB/sec Takes 4 months on a single computer We need more computers... * * according to tech.slashdot.org 21

Cluster Issues Reliability problems Computers fail every day Expected lifetime about 3 years (1000 days) If you had 1000 computers, one would fail per day Google had about 900,000 computers in 2011 On average 900 failed every day Cluster size varies * * according to datacenterknowledge.com 22

Hadoop Open Source Apache Project Commodity Hardware Cluster (cheap Linux PCs) Hadoop Core includes: Distributed File System (distributes data) Map/Reduce (distributes application) Written in Java Runs on Linux, Mac OS/X, Windows, Solaris, 23

Distributed File System Optimized for streaming reads of large files Files are broken into large blocks Blocks are typically 128 MB Default is 3 replicas for each block Clients read from closest replica Automatic re-replication 24

Map Reduce Data Flow Input Map Shuffle Reduce Output 25

Features Java, C++, and text-based APIs Text-based (streaming) great for scripting Higher level interfaces: Pig, Hive, Jaql,... Automatic re-execution on failure Framework re-executes failed tasks Locality optimizations Map tasks are scheduled close to the inputs when possible 26

What is it good for? Indexing Log analysis Image manipulation Sorting large-scale data Data Mining (though not optimal) - For real-time processing - For processing intensive tasks with little data 27

Important Functions Setup Setup mapper task Load data, connect to database, Map / Reduce Data processing (Mapper) Value aggregation (Reducer) Cleanup Free memory, disconnect database,... 28

Example: Word Count Hadoop s Hello World Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 29

Example: Word Count Its all about key-value pairs... Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 30

Example: Word Count Two phases: map and reduce Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 31

Example: Word Count Map: input lines of text, output words and their counts Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 32

Example: Word Count Reduce: sum up counts for each word Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 33

Word Count Data Flow the cat and the box M1 R1 curiosity kills the cat M2 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 34

Word Count Data Flow the cat and the box M1 cat:1 the:2 box:1 and:1 R1 curiosity kills the cat M2 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 35

Word Count Data Flow the cat and the box M1 curiosity kills the cat M2 the:1 cat:1 curiosity:1 kills:1 R1 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 36

Word Count Data Flow the cat and the box M1 R1 curiosity kills the cat M2 Erwin and his cat M3 cat:1 Erwin:1 and:1 his:1 R2 Input Map Shuffle Reduce Output 37

Word Count Data Flow the cat and the box curiosity kills the cat Erwin and his cat M1 M2 M3 cat:1 the:1 box:1 the:1 and:1 the:1 cat:1 curiosity:1 kills:1 cat:1 Erwin:1 and:1 his:1 R1 R2 box:1 cat:3 the:3 curiosity:1 and:2 kills:1 Erwin:1 his:1 Input Map Shuffle Reduce Output 38

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 39

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 40

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 41

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 42

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 43

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 44

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 45

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 46

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 47

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 48

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 49

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 50

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 51

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 52

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 53

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 54

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 55

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 56

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 57

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 58

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 59

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 60

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 61

- Literatur Book (available online) http://it-ebooks.info/book/635/ Tutorials http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ http://code.google.com/edu/parallel/index.html Machine learning with Hadoop http://mahout.apache.org 62