Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu
Map-Reduce: Why? Need to process 100TB datasets On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster: scanning @ 50MB/s = 33 min MTBF = 1 day Need framework for distribu)on Efficient, reliable, easy to use
Hadoop: How? Commodity Hardware Cluster Distributed File System Modeled on GFS Distributed Processing Framework Using Map/Reduce metaphor Open Source, Java Apache Lucene subproject
Map-Reduce Execu)on User Program (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files
Distributed File System Single namespace for en)re cluster Managed by a single namenode. Hierarchal directories Op)mized for streaming reads of large files. Files are broken in to large blocks. Typically 64 or 128 MB Replicated to several datanodes, for reliability Clients can find loca)on of blocks Client talks to both namenode and datanodes Data is not sent through the namenode.
Distributed Processing User submits Map/Reduce job to JobTracker System: Splits job into lots of tasks Schedules tasks on nodes close to data Monitors tasks Kills and restarts if they fail/hang/disappear Pluggable file systems for input/output Local file system for tes)ng, debugging, etc
Map/Reduce Metaphor Data is a stream of keys and values Mapper Input: key1,value1 pair Output: key2, value2 pairs Reducer Called once per a key, in sorted order Input: key2, stream of value2 Output: key3, value3 pairs Launching Program Creates a JobConf to define a job. Submits JobConf and waits for comple)on.
MapReduce Programmers specify two func)ons: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values with the same key are reduced together The run)me handles everything else Not quite usually, programmers also specify: par))on (k, number of par))ons) par))on for k Olen a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce opera)ons combine (k, v ) <k, v >* Mini-reducers that run in memory aler the map phase Used as an op)miza)on to reduce network traffic
k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 par))oner par))oner par))oner par))oner Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3
Word Count Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);
MapReduce: Index Construc)on Map over all documents Emit term as key, (docno, *) as value Emit other informa)on as necessary (e.g., term posi)on) Sort/shuffle: group pos)ngs by term Reduce Gather and sort the pos)ngs (e.g., by docno or *) Write pos)ngs to disk MapReduce does all the heavy liling!
Inverted Indexing with MapReduce Doc 1 one fish, two fish Doc 2 red fish, blue fish Doc 3 cat in the hat Map one 1 1 two 1 1 red 2 1 blue 2 1 cat 3 1 hat 3 1 fish 1 2 fish 2 2 Shuffle and Sort: aggregate values by keys Reduce cat 3 1 fish 1 2 2 2 one 1 1 red 2 1 blue 2 1 hat 3 1 two 1 1
Inverted Indexing: Pseudo-Code
Posi)onal Indexes Doc 1 one fish, two fish Doc 2 red fish, blue fish Doc 3 cat in the hat Map one 1 1 two 1 1 [1] [3] red 2 1 blue 2 1 [1] [3] cat 3 1 hat 3 1 [1] [2] fish 1 2 [2,4] fish 2 2 [2,4] Shuffle and Sort: aggregate values by keys Reduce cat 3 1 fish 1 2 [2,4] 2 2 one 1 1 red 2 1 [1] [1] [1] [2,4] blue 2 1 hat 3 1 two 1 1 [3] [2] [3]
Inverted Indexing: Pseudo-Code What s the problem?
Scalability Botleneck Ini)al implementa)on: terms as keys, pos)ngs as values Reducers must buffer all pos)ngs associated with key (to sort) What if we run out of memory to buffer pos)ngs?
Another Try (key) (values) (keys) (values) fish 1 2 [2,4] fish 1 [2,4] 34 1 [23] fish 9 [9] 21 3 [1,8,22] 35 2 [8,41] 80 3 [2,9,76] fish fish fish 21 [1,8,22] 34 [23] 35 [8,41] 9 1 [9] fish 80 [2,9,76] How is this different? Let the framework do the sor)ng Term frequency implicitly stored Directly write pos)ngs to disk!
Another Approach
The indexing problem MapReduce it? Scalability is paramount Must be rela)vely fast, but need not be real )me Fundamentally a batch opera)on Incremental updates may or may not be important For the web, crawling is a challenge in itself The retrieval problem Must have sub-second response )me For the web, only need rela)vely few results
Retrieval with MapReduce? MapReduce is fundamentally batch-oriented Op)mized for throughput, not latency Startup of mappers and reducers is expensive MapReduce is not suitable for real-)me queries! Use separate infrastructure for retrieval
Term vs. Document Par))oning D T 1 D Term Par))oning T 2 T 3 T Document Par))oning T D 1 D 2 D 3
Parallel Queries Algorithm Assume standard inner-product formula)on: score( Algorithm sketch: Load queries into memory in each mapper Map over pos)ngs, compute par)al term contribu)ons and store in accumulators Emit accumulators as intermediate output Reducers merge accumulators to compute final document scores q, d) = w t, qwt, d t V
Parallel Queries: Map blue 9 2 21 1 35 1 Mapper query id = 1, blue fish Compute score contribu)ons for term key = 1, value = { 9:2, 21:1, 35:1 } fish 1 2 9 1 21 3 34 1 35 2 80 3 Mapper query id = 1, blue fish Compute score contribu)ons for term key = 1, value = { 1:2, 9:1, 21:3, 34:1, 35:2, 80:3 }
Parallel Queries: Reduce key = 1, value = { 9:2, 21:1, 35:1 } key = 1, value = { 1:2, 9:1, 21:3, 34:1, 35:2, 80:3 } Reducer Element-wise sum of associa)ve arrays key = 1, value = { 1:2, 9:3, 21:4, 34:1, 35:3, 80:3 } Query: blue fish doc 21, score=4 doc 2, score=3 doc 35, score=3 doc 80, score=3 doc 1, score=2 doc 34, score=1 Sort accumulators to generate final ranking