Large-scale Information Processing

Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de

Anecdotal evidence... I think there is a world market for about five computers, Thomas J. Watson (Chairman of the Board of International Business Machines), 1943...it is very possible that... one machine would suffice to solve all the problems that are demanded of it from the whole country., Sir Charles Darwin (grandson of the naturalist of the same name, head of Britain's National Physical Laboratory), 1946 Originally one thought that if there were a half dozen large computers in this country, hidden away in research laboratories, this would take care of all requirements we had throughout the country., Howard H. Aiken (computer pioneer, IBM Mark I designer), 1952 2

Today... 3

Quelle: IBM 5

Query Volumes Quelle: Google Trends 6

Market Segmentation A company launches three new campaigns There is 2GB of data from customers who bought one of the three in the past There is 5GB potential customer data Who will be interested in which campaign? 7

Topics 8

Index Structures Make data searchable Efficient data structres Inverted index Radix tries... 9

Unsupervised Machine Learning Clustering Find group structures Nearest neighbors Duplicate detection Outlier detection Find anomalies 10

Supervised Machine Learning Linear models Classification Regression Ranking Optimization Feature encodings/scaling Hashing tricks 11

Structured Prediction Variables inter-depend/correlate Sequences, trees, grids, etc. Named entity recognition, parsing Protein secondary structure prediction Image segmentation 12

Recommender Systems 13

Scenarios Single computer vs. cluster (Hadoop, MapReduce) Static data vs. data streams 14

Integrierte Veranstaltung TUCAN 20-00-0706 Termine: Dienstags, 15:20-17:00 Uhr, S105/24 Donnerstags, 11:40-13:20 Uhr, S103/100 4 SWS, 6 CP Mündliche Prüfung/Klausur Foliensatz plus Tafel... Sprechstunde n.v. 15

Übungen Tutor: Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Unregelmäßig (etwa jeder 4. Termin) Sinn und Zweck: Theoretisches Verständnis Praktische Vertiefungen Keine Abgabe/Korretur http://www.kma.informatik.tu-darmstadt.de/ teaching/large-scale-information-processing/ 16

Further Readings Rajaraman & Ullman, Mining of Massive Datasets http://i.stanford.edu/~ullman/mmds.html 17

Large-scale Information Processing, Sommer 2013 Hadoop MapReduce Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de

The Web 19

The size of the Web http://www.worldwidewebsize.com 20

Reading the Web... Average size of a web page in 2011 was 965 kb 50 billion times 965 kb = 899 TB Read 900 TB of data with 100MB/sec Takes 4 months on a single computer We need more computers... * * according to tech.slashdot.org 21

Cluster Issues Reliability problems Computers fail every day Expected lifetime about 3 years (1000 days) If you had 1000 computers, one would fail per day Google had about 900,000 computers in 2011 On average 900 failed every day Cluster size varies * * according to datacenterknowledge.com 22

Hadoop Open Source Apache Project Commodity Hardware Cluster (cheap Linux PCs) Hadoop Core includes: Distributed File System (distributes data) Map/Reduce (distributes application) Written in Java Runs on Linux, Mac OS/X, Windows, Solaris, 23

Distributed File System Optimized for streaming reads of large files Files are broken into large blocks Blocks are typically 128 MB Default is 3 replicas for each block Clients read from closest replica Automatic re-replication 24

Map Reduce Data Flow Input Map Shuffle Reduce Output 25

Features Java, C++, and text-based APIs Text-based (streaming) great for scripting Higher level interfaces: Pig, Hive, Jaql,... Automatic re-execution on failure Framework re-executes failed tasks Locality optimizations Map tasks are scheduled close to the inputs when possible 26

What is it good for? Indexing Log analysis Image manipulation Sorting large-scale data Data Mining (though not optimal) - For real-time processing - For processing intensive tasks with little data 27

Important Functions Setup Setup mapper task Load data, connect to database, Map / Reduce Data processing (Mapper) Value aggregation (Reducer) Cleanup Free memory, disconnect database,... 28

Example: Word Count Hadoop s Hello World Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 29

Example: Word Count Its all about key-value pairs... Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 30

Example: Word Count Two phases: map and reduce Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 31

Example: Word Count Map: input lines of text, output words and their counts Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 32

Example: Word Count Reduce: sum up counts for each word Input Output Key Value Key Value Mapper pos in file lines of text word 1 Reducer word set of counts word sum 33

Word Count Data Flow the cat and the box M1 R1 curiosity kills the cat M2 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 34

Word Count Data Flow the cat and the box M1 cat:1 the:2 box:1 and:1 R1 curiosity kills the cat M2 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 35

Word Count Data Flow the cat and the box M1 curiosity kills the cat M2 the:1 cat:1 curiosity:1 kills:1 R1 Erwin and his cat M3 R2 Input Map Shuffle Reduce Output 36

Word Count Data Flow the cat and the box M1 R1 curiosity kills the cat M2 Erwin and his cat M3 cat:1 Erwin:1 and:1 his:1 R2 Input Map Shuffle Reduce Output 37

Word Count Data Flow the cat and the box curiosity kills the cat Erwin and his cat M1 M2 M3 cat:1 the:1 box:1 the:1 and:1 the:1 cat:1 curiosity:1 kills:1 cat:1 Erwin:1 and:1 his:1 R1 R2 box:1 cat:3 the:3 curiosity:1 and:2 kills:1 Erwin:1 his:1 Input Map Shuffle Reduce Output 38

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context c) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 39

Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 47

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 48

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 49

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 50

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 51

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 52

Reducer public void reduce(text key, Iterable<IntWritable> values,......context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); 53

Main public static void main(string[] args) { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); 54

- Literatur Book (available online) http://it-ebooks.info/book/635/ Tutorials http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ http://code.google.com/edu/parallel/index.html Machine learning with Hadoop http://mahout.apache.org 62