Information Retrieval

Similar documents
Apache Pig. Craig Douglas and Mookwon Seo University of Wyoming

The Pig Experience. A. Gates et al., VLDB 2009

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Apache Pig. Big Data 2015

Pig A language for data processing in Hadoop

Pig Latin Reference Manual 1

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

PigUnit - Pig script testing simplified.

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Introduction to HDFS and MapReduce

Introduction to Apache Pig ja Hive

Techno Expert Solutions An institute for specialized studies!

Hadoop ecosystem. Nikos Parlavantzas

Beyond Hive Pig and Python

Information Retrieval

Hortonworks Certified Developer (HDPCD Exam) Training Program

Introduction to BigData, Hadoop:-

Big Data Hadoop Course Content

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Shell and Utility Commands

Click Stream Data Analysis Using Hadoop

Hadoop Online Training

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

Outline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344

MAP-REDUCE ABSTRACTIONS

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Apache DataFu (incubating)

Processing Big Data with Hadoop in Azure HDInsight

Shell and Utility Commands

Innovatus Technologies

Aims. Background. This exercise aims to get you to:

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

IN ACTION. Chuck Lam SAMPLE CHAPTER MANNING

this is so cumbersome!

Templates for Supporting Sequenced Temporal Semantics in Pig Latin

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Introduction to Hive Cloudera, Inc.

Hadoop Development Introduction

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

On behalf of ASE crew, I welcome you onboard. We will cover this journey in roughly 50 minutes.

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Dr. Chuck Cartledge. 18 Feb. 2015

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data Analytics using Apache Hadoop and Spark with Scala

HIVE INTERVIEW QUESTIONS

Big Data Hadoop Stack

SET DEFINITION 1 elements members

Hashing and sketching

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

User Defined Functions

Certified Big Data and Hadoop Course Curriculum

Shark: Hive (SQL) on Spark

Faster ETL Workflows using Apache Pig & Spark. - Praveen Rachabattuni,

Index. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Typing Massive JSON Datasets

OKC MySQL Users Group

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

Processing Big Data with Hadoop in Azure HDInsight

Importing and Exporting Data Between Hadoop and MySQL

Models for Document & Query Representation. Ziawasch Abedjan

Prototyping Data Intensive Apps: TrendingTopics.org

Hive SQL over Hadoop

Certified Big Data Hadoop and Spark Scala Course Curriculum

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Topics covered in this lecture

CSE 190D Spring 2017 Final Exam Answers

Big Data and Scripting map reduce in Hadoop

Talend Open Studio for Big Data. Getting Started Guide 5.3.2

Talend Open Studio for Big Data. Getting Started Guide 5.4.0

User Defined Functions

Database Usage (and Construction)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

Talend Open Studio for Big Data. Getting Started Guide 5.4.2

Hortonworks Data Platform

Higher level data processing in Apache Spark

CS 186/286 Spring 2018 Midterm 1

Introduction to Hive. Feng Li School of Statistics and Mathematics Central University of Finance and Economics

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Hadoop. Introduction to BIGDATA and HADOOP

Data-intensive computing systems

Pig Latin: A Not-So-Foreign Language for Data Processing

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

????? An Exercise. Suppose that we have written a class named Play that has a String instance variable named filename.

Apache Pig coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

An Information Retrieval Approach for Source Code Plagiarism Detection

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

CS115 - Module 4 - Compound data: structures

Transcription:

https://vvtesh.sarahah.com/ Information Retrieval Venkatesh Vinayakarao Term: Aug Dec, 2018 Indian Institute of Information Technology, Sri City So much of life, it seems to me, is determined by pure randomness. Sidney Poitier. Love Tarun Venkatesh Vinayakarao (Vv)

Big Data Hadoop, Map Reduce and Pig

Near Duplicates How to detect near duplicate documents?

Near Duplicates How to detect near duplicate documents? Document d1 Tendulkar hits a century. k-shingles {Tendulkar hits, hits a, a century } Document d2 Tendulkar hits a fantastic century. k-shingles {Tendulkar hits, hits a, a fantastic, fantastic century} Apply Jaccard Similarity

3Vs of Big Data 5

An Example of Distributed Platforms 6

HDP: Hortonworks Data Platform 7

8 Hadoop Installation

Virtualbox Installation 9

Cloudera VM installation 10

Virtualbox setup 11

Adding cloudera vm 12

Installation Video https://www.youtube.com/watc h?v=p7ucyffwl-c https://www.youtube.com/watc h?v=l0lppc5qeyu 13

Advanced Map Reduce: Job Chaining How to find top k words? Write one MR job for word count. Write another MR job to process the output to find top k. Sometimes, the output of job 1 is not so big. You do not need MR job to do the task. You may just use shell script or manually do it. For example, hadoop fs -cat /output/of/wordcount/part* sort -n -k2 -r head -n20 14

Reduce - Optimization How to do top k? Have each reducer only output k results. If you have only one reducer, this problem is already solved. 15

Example https://github.com/adamjshook/mapreducepatter ns/blob/master/mrdp/src/main/java/mrdp/ch3/to ptendriver.java 16

MR Design Patterns Taxonomy of Patterns 1. Summarization 1. Average 2. Median 2. Filtering 1. Top k 2. Distinct 3. Joins 1. Cartesian Product 4. Meta patters 1. Job Chaining 5. Data Organization 1. Partitioning 2. Bin ning 6. External I/O 17

Pig A scripting language to simplify processing on Hadoop. More information at http://pig.apache.org/docs/r0.14.0/start.html 18

Benefits & Limitations Benefits 10 lines of Pig Latin (approx.) = 200 lines in Java 15 minutes in Pig Latin (approx.) = 3 hours in Java Simple Easy Quick to Code Limitations Slower than Map-Reduce 19

Word Count Lines=LOAD input/hadoop.log AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word; Groups = GROUP Words BY word; Counts = FOREACH Groups GENERATE group, COUNT(Words); Results = ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO /output/top5words; 20

Pig in Real-World Yahoo used it extensively (>70% of jobs) Facebook Process Logs Twitter Process Logs ebay Data processing for intelligence 21

UDF in Pig -- myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.upper(name); DUMP B; 22

Simple UDF 6 public class UPPER extends EvalFunc<String> 7 { 8 public String exec(tuple input) throws IOException { 9 if (input == null input.size() == 0) 10 return null; 11 try{ 12 String str = (String)input.get(0); 13 return str.touppercase(); 14 }catch(exception e){ 15 throw new IOException("Caught exception processing input row ", e); 16 } 17 } 18 } Source: https://pig.apache.org/docs/r0.10.0/udf.html 23

Creating the Jar Javac -cp /..path../pig.jar UPPER.java jar -cf exampleudf.jar exampleudf Know where have you placed this jar. In Pig Script: REGISTER path to jar ; DEFINE SIMPLEUPPER exampleudf.upper(); now you can use this method. 24

https://vvtesh.sarahah.com/ Information Retrieval Venkatesh Vinayakarao Term: Aug Dec, 2018 Indian Institute of Information Technology, Sri City So much of life, it seems to me, is determined by pure randomness. Sidney Poitier. Love Tarun Venkatesh Vinayakarao (Vv)

The Intuition Which bucket is most likely to lead to a Black Ball? OR OR 26

The Intuition Which bucket is most likely to lead to a Black Ball? OR OR 27

The Intuition Which document is most likely to lead to a query? query terms OR OR Document 1 Document 2 Document 3 Can be modeled using a discriminative model (P(D R=1,Q)) or a generative model (P(Q R=1,D)). 28

A good query Words that are most likely to appear in a relevant document 29

Another Way to Rank Rank documents based on the probability of the model generating the query: P(q M d ) M d is the model of the document which generates the query 30

A Model to Generate a String What strings can this model generate? This is a Finite Automaton that can generate: I wish I wish I wish I wish I wish I wish 31

Key Idea Document Generative Model Words with probability 32

Example P(STOP t) = 0.2 i.e., probability that our automaton stops after encountering any word is 0.2. You are given with the probabilities of words appearing in the query. What is probability of frog said that toad likes frog being the query? Answer: (0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01) x (0.8^5 x 0.2) = 0.000000000001573 Word Probability the 0.2 a 0.1 frog 0.01 toad 0.01 said 0.03 likes 0.02 that 0.04 Uday: It is much easier to take log and add instead of multiplying these numbers. 33

Comparing Models Say we have two models: Model 1 seems to have higher probability of generating the string. Model 1 will match a document containing these terms better to this query. 34

A Simple Language Model Simple Model Even simpler The Unigram Language Model The Bigram Model 35

Document as a Bag of Words We rank documents based on the probability of that document to generate the given query. If documents are bag of words, Assume p(d) is uniform. Hence can be dropped. P(q) and K q does not affect ranking. Hence can be dropped. 36

Missing Terms What happens to P(d q) if a query term does not appear in the document? Clue: We approximate P(d q) to document ranking by computing 37

Missing Terms & Smoothening Instead of zero, use P(t M c ) M c is the model over the entire collection. Smoothen: We can also mix document and collection probabilities using a linear interpolation co-efficient λ. 38

Wake Up Quiz Is the following same as the document length? True or False? tf(ti, d) 1 i M M is the term vocabulary size. Function tf(t,d) is term frequency of a term in a document. 39

Example Suppose: d1: Tada is a city between Chennai and Tirupati d2: Tada has few restaurants but no good malls Use the smoothened MLE unigram language model to rank these documents for the query Tada City. Assume λ = 0.5. 40

Example Suppose: d1: Tada is a city between Chennai and Tirupati d2: Tada has few restaurants but no good malls Use the smoothened MLE unigram language model to rank these documents for the query Tada City. Assume λ = 0.5. P(q d1) = [(1/8 + 2/16)/2] [(1/8 + 1/16)/2] = 1/8 3/32 = 3/256 P(q d2) = [(1/8 + 2/16)/2] [(0/8 + 1/16)/2] = 1/8 1/32 = 1/256 Ranking: d1 > d2 41

Summary Probabilistic Retrieval 42

Summary Odds of Relevance 43

Summary 44

Summary Smoothening 45

Thank You 46