Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Similar documents
Introduction to MapReduce

Information Retrieval Processing with MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Computing: MapReduce Jin, Hai

Introduction to MapReduce

Clustering Lecture 8: MapReduce

Map Reduce. Yerevan.

Developing MapReduce Programs

MI-PDB, MIE-PDB: Advanced Database Systems

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

A brief history on Hadoop

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

MapReduce, Apache Hadoop

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Map- reduce programming paradigm

Introduction to Map Reduce

MapReduce, Apache Hadoop

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Laarge-Scale Data Engineering

Introduc)on to. CS60092: Informa0on Retrieval

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Hadoop. copyright 2011 Trainologic LTD

L22: SC Report, Map Reduce

TP1-2: Analyzing Hadoop Logs

The MapReduce Abstraction

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Map Reduce.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

HADOOP FRAMEWORK FOR BIG DATA

Map-Reduce and Adwords Problem

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

Index construc-on. Friday, 8 April 16 1

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

MapReduce Algorithm Design

Lecture 11 Hadoop & Spark

MapReduce Simplified Data Processing on Large Clusters

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Distributed computing: index building and use

Distributed Systems. CS422/522 Lecture17 17 November 2014

Data-Intensive Distributed Computing

Informa(on Retrieval

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Your First Hadoop App, Step by Step

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CS 378 Big Data Programming

Chapter 5. The MapReduce Programming Model and Implementation

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Introduction to Data Management CSE 344

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Data-Intensive Distributed Computing

CS 61C: Great Ideas in Computer Architecture. MapReduce

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Introduction to MapReduce

Database Systems CSE 414

Distributed computing: index building and use

MapReduce-style data processing

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to MapReduce

BigData and Map Reduce VITMAC03

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Databases 2 (VU) ( / )

MapReduce. U of Toronto, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Map-Reduce. Marco Mura 2010 March, 31th

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Lecture 12 DATA ANALYTICS ON WEB SCALE

MapReduce: Simplified Data Processing on Large Clusters

Hadoop MapReduce Framework

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Big Data Management and NoSQL Databases

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

A BigData Tour HDFS, Ceph and MapReduce

MapReduce Design Patterns

MapReduce. Tom Anderson

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Map-Reduce and Related Systems

Introduction to MapReduce

NoSQL. Based on slides by Mike Franklin and Jimmy Lin

Hadoop/MapReduce Computing Paradigm

Batch Inherence of Map Reduce Framework

Database Applications (15-415)

Introduction to Data Management CSE 344

CS-2510 COMPUTER OPERATING SYSTEMS

CS 345A Data Mining. MapReduce

Programming Models MapReduce

Parallel Programming Concepts

Transcription:

Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu

Map-Reduce: Why? Need to process 100TB datasets On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster: scanning @ 50MB/s = 33 min MTBF = 1 day Need framework for distribu)on Efficient, reliable, easy to use

Hadoop: How? Commodity Hardware Cluster Distributed File System Modeled on GFS Distributed Processing Framework Using Map/Reduce metaphor Open Source, Java Apache Lucene subproject

Map-Reduce Execu)on User Program (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files

Distributed File System Single namespace for en)re cluster Managed by a single namenode. Hierarchal directories Op)mized for streaming reads of large files. Files are broken in to large blocks. Typically 64 or 128 MB Replicated to several datanodes, for reliability Clients can find loca)on of blocks Client talks to both namenode and datanodes Data is not sent through the namenode.

Distributed Processing User submits Map/Reduce job to JobTracker System: Splits job into lots of tasks Schedules tasks on nodes close to data Monitors tasks Kills and restarts if they fail/hang/disappear Pluggable file systems for input/output Local file system for tes)ng, debugging, etc

Map/Reduce Metaphor Data is a stream of keys and values Mapper Input: key1,value1 pair Output: key2, value2 pairs Reducer Called once per a key, in sorted order Input: key2, stream of value2 Output: key3, value3 pairs Launching Program Creates a JobConf to define a job. Submits JobConf and waits for comple)on.

MapReduce Programmers specify two func)ons: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values with the same key are reduced together The run)me handles everything else Not quite usually, programmers also specify: par))on (k, number of par))ons) par))on for k Olen a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce opera)ons combine (k, v ) <k, v >* Mini-reducers that run in memory aler the map phase Used as an op)miza)on to reduce network traffic

k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 par))oner par))oner par))oner par))oner Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3

Word Count Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);

MapReduce: Index Construc)on Map over all documents Emit term as key, (docno, *) as value Emit other informa)on as necessary (e.g., term posi)on) Sort/shuffle: group pos)ngs by term Reduce Gather and sort the pos)ngs (e.g., by docno or *) Write pos)ngs to disk MapReduce does all the heavy liling!

Inverted Indexing with MapReduce Doc 1 one fish, two fish Doc 2 red fish, blue fish Doc 3 cat in the hat Map one 1 1 two 1 1 red 2 1 blue 2 1 cat 3 1 hat 3 1 fish 1 2 fish 2 2 Shuffle and Sort: aggregate values by keys Reduce cat 3 1 fish 1 2 2 2 one 1 1 red 2 1 blue 2 1 hat 3 1 two 1 1

Inverted Indexing: Pseudo-Code

Posi)onal Indexes Doc 1 one fish, two fish Doc 2 red fish, blue fish Doc 3 cat in the hat Map one 1 1 two 1 1 [1] [3] red 2 1 blue 2 1 [1] [3] cat 3 1 hat 3 1 [1] [2] fish 1 2 [2,4] fish 2 2 [2,4] Shuffle and Sort: aggregate values by keys Reduce cat 3 1 fish 1 2 [2,4] 2 2 one 1 1 red 2 1 [1] [1] [1] [2,4] blue 2 1 hat 3 1 two 1 1 [3] [2] [3]

Inverted Indexing: Pseudo-Code What s the problem?

Scalability Botleneck Ini)al implementa)on: terms as keys, pos)ngs as values Reducers must buffer all pos)ngs associated with key (to sort) What if we run out of memory to buffer pos)ngs?

Another Try (key) (values) (keys) (values) fish 1 2 [2,4] fish 1 [2,4] 34 1 [23] fish 9 [9] 21 3 [1,8,22] 35 2 [8,41] 80 3 [2,9,76] fish fish fish 21 [1,8,22] 34 [23] 35 [8,41] 9 1 [9] fish 80 [2,9,76] How is this different? Let the framework do the sor)ng Term frequency implicitly stored Directly write pos)ngs to disk!

Another Approach

The indexing problem MapReduce it? Scalability is paramount Must be rela)vely fast, but need not be real )me Fundamentally a batch opera)on Incremental updates may or may not be important For the web, crawling is a challenge in itself The retrieval problem Must have sub-second response )me For the web, only need rela)vely few results

Retrieval with MapReduce? MapReduce is fundamentally batch-oriented Op)mized for throughput, not latency Startup of mappers and reducers is expensive MapReduce is not suitable for real-)me queries! Use separate infrastructure for retrieval

Term vs. Document Par))oning D T 1 D Term Par))oning T 2 T 3 T Document Par))oning T D 1 D 2 D 3

Parallel Queries Algorithm Assume standard inner-product formula)on: score( Algorithm sketch: Load queries into memory in each mapper Map over pos)ngs, compute par)al term contribu)ons and store in accumulators Emit accumulators as intermediate output Reducers merge accumulators to compute final document scores q, d) = w t, qwt, d t V

Parallel Queries: Map blue 9 2 21 1 35 1 Mapper query id = 1, blue fish Compute score contribu)ons for term key = 1, value = { 9:2, 21:1, 35:1 } fish 1 2 9 1 21 3 34 1 35 2 80 3 Mapper query id = 1, blue fish Compute score contribu)ons for term key = 1, value = { 1:2, 9:1, 21:3, 34:1, 35:2, 80:3 }

Parallel Queries: Reduce key = 1, value = { 9:2, 21:1, 35:1 } key = 1, value = { 1:2, 9:1, 21:3, 34:1, 35:2, 80:3 } Reducer Element-wise sum of associa)ve arrays key = 1, value = { 1:2, 9:3, 21:4, 34:1, 35:3, 80:3 } Query: blue fish doc 21, score=4 doc 2, score=3 doc 35, score=3 doc 80, score=3 doc 1, score=2 doc 34, score=1 Sort accumulators to generate final ranking