Large-scale Information Processing

Size: px

Start display at page:

Download "Large-scale Information Processing"

Noah Price
6 years ago
Views:

1 Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment

2 Anecdotal evidence... I think there is a world market for about five computers, Thomas J. Watson (Chairman of the Board of International Business Machines), it is very possible that... one machine would suffice to solve all the problems that are demanded of it from the whole country., Sir Charles Darwin (grandson of the naturalist of the same name, head of Britain's National Physical Laboratory), 1946 Originally one thought that if there were a half dozen large computers in this country, hidden away in research laboratories, this would take care of all requirements we had throughout the country., Howard H. Aiken (computer pioneer, IBM Mark I designer),

3 Today... 3

4 4

5 Quelle: IBM 5

6 Query Volumes Quelle: Google Trends 6

7 Market Segmentation A company launches three new campaigns There is 2GB of data from customers who bought one of the three in the past There is 5GB potential customer data Who will be interested in which campaign? 7

8 Topics 8

9 Index Structures Make data searchable Efficient data structres Inverted index Radix tries... 9

10 Unsupervised Machine Learning Clustering Find group structures Nearest neighbors Duplicate detection Outlier detection Find anomalies 10

11 Supervised Machine Learning Linear models Classification Regression Ranking Optimization Feature encodings/scaling Hashing tricks 11

12 Structured Prediction Variables inter-depend/correlate Sequences, trees, grids, etc. Named entity recognition, parsing Protein secondary structure prediction Image segmentation 12

13 Recommender Systems 13

14 Scenarios Single computer vs. cluster (Hadoop, MapReduce) Static data vs. data streams 14

15 Integrierte Veranstaltung TUCAN Termine: Dienstags, 15:20-17:00 Uhr, S105/24 Donnerstags, 11:40-13:20 Uhr, S103/100 4 SWS, 6 CP Mündliche Prüfung/Klausur Foliensatz plus Tafel... Sprechstunde n.v. 15

Übungen Tutor: Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Unregelmäßig (etwa jeder 4.

16 Übungen Tutor: Emmanouil Tzouridis Unregelmäßig (etwa jeder 4. Termin) Sinn und Zweck: Theoretisches Verständnis Praktische Vertiefungen Keine Abgabe/Korretur teaching/large-scale-information-processing/ 16

17 Further Readings Rajaraman & Ullman, Mining of Massive Datasets 17

18 Large-scale Information Processing, Sommer 2013 Hadoop MapReduce Ulf Brefeld Knowledge Mining & Assessment

19 The Web 19

20 The size of the Web 20

21 Reading the Web... Average size of a web page in 2011 was 965 kb 50 billion times 965 kb = 899 TB Read 900 TB of data with 100MB/sec Takes 4 months on a single computer We need more computers... * * according to tech.slashdot.org 21

22 Cluster Issues Reliability problems Computers fail every day Expected lifetime about 3 years (1000 days) If you had 1000 computers, one would fail per day Google had about 900,000 computers in 2011 On average 900 failed every day Cluster size varies * * according to datacenterknowledge.com 22

23 Hadoop Open Source Apache Project Commodity Hardware Cluster (cheap Linux PCs) Hadoop Core includes: Distributed File System (distributes data) Map/Reduce (distributes application) Written in Java Runs on Linux, Mac OS/X, Windows, Solaris, 23

24 Distributed File System Optimized for streaming reads of large files Files are broken into large blocks Blocks are typically 128 MB Default is 3 replicas for each block Clients read from closest replica Automatic re-replication 24

25 Map Reduce Data Flow Input Map Shuffle Reduce Output 25

26 Features Java, C++, and text-based APIs Text-based (streaming) great for scripting Higher level interfaces: Pig, Hive, Jaql,... Automatic re-execution on failure Framework re-executes failed tasks Locality optimizations Map tasks are scheduled close to the inputs when possible 26

27 What is it good for? Indexing Log analysis Image manipulation Sorting large-scale data Data Mining (though not optimal) - For real-time processing - For processing intensive tasks with little data 27

28 Important Functions Setup Setup mapper task Load data, connect to database, Map / Reduce Data processing (Mapper) Value aggregation (Reducer) Cleanup Free memory, disconnect database,... 28

29 Example: Word Count Hadoop s Hello World Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 29

30 Example: Word Count Its all about key-value pairs... Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 30

31 Example: Word Count Two phases: map and reduce Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 31

32 Example: Word Count Map: input lines of text, output words and their counts Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 32

33 Example: Word Count Reduce: sum up counts for each word Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 33

34 Word Count Data Flow the cat and the box M1 R1 curiosity kills the cat M2 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 34

35 Word Count Data Flow the cat and the box M1 cat:1 the:2 box:1 and:1 R1 curiosity kills the cat M2 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 35

36 Word Count Data Flow the cat and the box M1 curiosity kills the cat M2 the:1 cat:1 curiosity:1 kills:1 R1 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 36

37 Word Count Data Flow the cat and the box M1 R1 curiosity kills the cat M2 Erwin and his cat M3 cat:1 Erwin:1 and:1 his:1 R2 Input Map Shuffle Reduce Output 37

38 Word Count Data Flow the cat and the box curiosity kills the cat Erwin and his cat M1 M2 M3 cat:1 the:1 box:1 the:1 and:1 the:1 cat:1 curiosity:1 kills:1 cat:1 Erwin:1 and:1 his:1 R1 R2 box:1 cat:3 the:3 curiosity:1 and:2 kills:1 Erwin:1 his:1 Input Map Shuffle Reduce Output 38

39 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 39

40 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 40

41 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 41

42 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 42

43 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 43

44 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 44

45 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 45

46 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 46

47 Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 47

48 Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 48

49 Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 49

50 Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 50

51 Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 51

52 Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 52

53 Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 53

54 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 54

55 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 55

56 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 56

57 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 57

58 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 58

59 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 59

60 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 60

61 Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 61

- Literatur Book (available online) http://it-ebooks.info/book/635/ Tutorials http://hadoop.apache.org/ http://developer.

62 - Literatur Book (available online) Tutorials Machine learning with Hadoop 62

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of