Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Size: px
Start display at page:

Download "Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)"

Transcription

1 Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 07) Week 4: Analyzing Text (/) January 6, 07 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

2 Source: Search!

3 Abstract IR Architecture Query Documents Representation Function online offline Representation Function Query Representation Document Representation Comparison Function Index Hits

4 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham blue 3 4 What goes in each cell? cat egg boolean count positions green ham hat one red two

5 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham 3 4 blue Indexing: building this structure cat Retrieval: manipulating this structure egg green ham hat one red two

6 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham 3 4 blue blue cat cat 3 egg egg 4 green green 4 ham ham 4 hat hat 3 one one red red two two

7 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham tf 3 4 df blue blue cat cat 3 egg egg 4 green green 4 ham ham 4 hat hat 3 one one red red two two

8 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham tf 3 4 df blue blue [3] cat cat 3 [] egg egg 4 [] [,4] [,4] green green 4 [] ham ham 4 [3] hat hat 3 [] one one [] red red [] two two [3]

9 Inverted Indexing with MapReduce Doc one, two Doc red, blue Doc 3 cat in the hat Map one two red blue cat hat 3 3 Shuffle and Sort: aggregate values by keys Reduce cat 3 one red blue hat two 3

10 Inverted Indexing: Pseudo-Code : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray. histogram to hold term frequencies 4: for all term t doc d do. processes the doc, e.g., tokenization and stopword removal 5: H{t} H{t} + 6: for all term t H do 7: Emit(term t, posting hn, H{t}i). emits individual postings : class Reducer : method Reduce(term t, postings [hn,f i...]) 3: P new List 4: for all hn, fi postings [hn,f i...] do 5: P.Append(hn, fi). appends postings unsorted 6: P.Sort() 7: Emit(term t, postingslist P ). sorts for compression

11 Positional Indexes Doc one, two Doc red, blue Doc 3 cat in the hat Map one two [] [3] red blue [] [3] cat hat 3 3 [] [] [,4] [,4] Shuffle and Sort: aggregate values by keys Reduce cat 3 [] one red [,4] [] [] [,4] blue hat two 3 [3] [] [3]

12 Positional Indexes : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray. histogram to hold term frequencies 4: for all term t doc d do. processes the doc, e.g., tokenization and stopword removal 5: H{t} H{t} + 6: for all term t H do 7: Emit(term t, posting hn, H{t}i). emits individual postings : class Reducer : method Reduce(term t, postings [hn,f i...]) 3: P new List 4: for all hn, fi postings [hn,f i...] do 5: P.Append(hn, fi). appends postings unsorted 6: P.Sort() 7: Emit(term t, postingslist P ). sorts for compression

13 Another Try (key) (values) (keys) (values) [,4] [,4] 34 [3] 9 [9] 3 [,8,] [,8,] 35 [8,4] 34 [3] 80 3 [,9,76] 35 [8,4] 9 [9] 80 [,9,76] How is this different? Let the framework do the sorting Term frequency implicitly stored Where have we seen this before?

14 Inverted Indexing: Pseudo-Code : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray 4: for all term t doc d do. builds a histogram of term frequencies 5: H{t} H{t} + 6: for all term t H do 7: Emit(tuple ht, ni, tf H{t}). emits individual postings, with a tuple as the key : class Partitioner : method Partition(tuple ht, ni, tf f) 3: return Hash(t) modnumofreducers. keys of same term are sent to same reducer : class Reducer : method Initialize 3: t prev ; 4: P new PostingsList 5: method Reduce(tuple ht, ni, tf [f]) 6: if t 6= t prev ^ t prev 6= ; then 7: Emit(term t, postings P ). emits postings list of term t prev 8: P.Reset() 9: P.Append(hn, fi). appends postings in sorted order 0: t prev t : method Close : Emit(term t, postings P ). emits last postings list from this reducer

15 Postings Encoding Conceptually: In Practice: Don t encode docnos, encode gaps (or d-gaps) But it s not obvious that this save space = delta encoding, delta compression, gap compression

16 Overview of Integer Compression Byte-aligned technique VByte Bit-aligned Unary codes g/d codes Golomb codes (local Bernoulli model) Word-aligned Simple family Bit packing family (PForDelta, etc.)

17 VByte Simple idea: use only as many bytes as needed Need to reserve one bit per byte as the continuation bit Use remaining bits for encoding value 7 bits 4 bits bits Works okay, easy to implement Beware of branch mispredicts!

18 Simple-9 How many different ways can we divide up 8 bits? 8 -bit numbers 4 -bit numbers selectors (9 total ways) 9 3-bit numbers 7 4-bit numbers Efficient decompression with hard-coded decoders Simple Family general idea applies to 64-bit words, etc. Beware of branch mispredicts?

19 Bit Packing What s the smallest number of bits we need to code a block (=8) of integers? Efficient decompression with hard-coded decoders PForDelta bit packing + separate storage of overflow bits Beware of branch mispredicts?

20 Golomb Codes x ³, parameter b: q + in unary, where q = ë( x - ) / bû r in binary, where r = x - qb -, in ëlog bû or élog bù bits Example: b = 3, r = 0,, (0, 0, ) b = 6, r = 0,,, 3, 4, 5 (00, 0, 00, 0, 0, ) x = 9, b = 3: q =, r =, code = 0: x = 9, b = 6: q =, r =, code = 0:00 Punch line: optimal b ~ 0.69 (N/df) Different b for every term!

21 Chicken and Egg? (key) 9 (value) [,4] [9] [,8,] 34 [3] 35 [8,4] 80 [,9,76] But wait! How do we set the Golomb parameter b? Recall: optimal b ~ 0.69 (N/df) We need the df to set b But we don t know the df until we ve seen all postings! Write postings Sound familiar?

22 Getting the df In the mapper: Emit special key-value pairs to keep track of df In the reducer: Make sure special key-value pairs come first: process them to determine df Remember: proper partitioning!

23 Getting the df: Modified Mapper Doc one, two Input document (key) (value) [,4] Emit normal key-value pairs one [] two [3] «[] Emit special key-value pairs to keep track of df one «[] two «[]

24 (key) Getting the df: Modified Reducer (value) «[63] [8] [7] [,4] First, compute the df by summing contributions from all special key-value pair Compute b from df 9 [9] [,8,] 34 [3] 35 [8,4] 80 [,9,76] Important: properly define sort order to make sure special key-value pairs come first! Write postings Where have we seen this before?

25 Basic Inverted Indexer: Reducer (key) (value) 9 [,4] [9] [,8,] 34 [3] 35 [8,4] 80 [,9,76] Write postings compressed

26 Inverted Indexing: IP (~Pairs) : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray 4: for all term t doc d do. builds a histogram of term frequencies 5: H{t} H{t} + 6: for all term t H do 7: Emit(tuple ht, ni, tf H{t}). emits individual postings, with a tuple as the key : class Partitioner : method Partition(tuple ht, ni, tf f) 3: return Hash(t) modnumofreducers. keys of same term are sent to same reducer : class Reducer : method Initialize 3: t prev ; 4: P new PostingsList 5: method Reduce(tuple ht, ni, tf [f]) 6: if t 6= t prev ^ t prev 6= ; then 7: Emit(term t, postings P ). emits postings list of term t prev 8: P.Reset() 9: P.Append(hn, fi). appends postings in sorted order 0: t prev t : method Close : Emit(term t, postings P ). emits last postings list from this reducer

27 Merging Postings Let s define an operation on postings lists P: Postings(, 5,, 39, 54) Postings(, 46) = Postings(,, 5,, 39, 46, 54) Then we can rewrite our indexing algorithm! flatmap: emit singleton postings reducebykey:

28 What s the issue? Postings Postings = Postings M Solution: apply compression as needed!

29 Inverted Indexing: LP (~Stripes) Slightly less elegant implementation but uses same idea : class Mapper : method Initialize 3: M new AssociativeArray. holds partial lists of postings 4: method Map(docid n, doc d) 5: H new AssociativeArray. builds a histogram of term frequencies 6: for all term t doc d do 7: H{t} H{t} + 8: for all term t H do 9: M{t}.Add(posting hn, H{t}i). adds a posting to partial postings lists 0: if MemoryFull() then : Flush() : method Flush. flushes partial lists of postings as intermediate output 3: for all term t M do 4: P SortAndEncodePostings(M{t}) 5: Emit(term t, postingslist P ) 6: M.Clear() 7: method Close 8: Flush()

30 Inverted Indexing: LP (~Stripes) : class Reducer : method Reduce(term t, postingslists [P,P,...]) 3: P f new List. temporarily stores partial lists of postings 4: R new List. stores merged partial lists of postings 5: for all P postingslists [P,P,...] do 6: P f.add(p ) 7: if MemoryNearlyFull() then 8: R.Add(MergeLists(P f )) 9: P f.clear() 0: R.Add(MergeLists(P f )) : Emit(term t, postingslist MergeLists(R)). emits fully merged postings list of term t

31 LP vs. IP? Experiments on ClueWeb09 collection: segments + 0.8m documents (47 GB compressed,.97 TB uncompressed) IP algorithm LP algorithm R = Indexing Time (minutes) R = Number of Documents (millions) Alg. Time Intermediate Pairs Intermediate Size IP 38.5 min bytes LP 9.6 min bytes From: Elsayed et al., Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother? 00

32 : class Mapper : method Initialize 3: M new AssociativeArray. holds partial lists of postings 4: method Map(docid n, doc d) 5: H new AssociativeArray. builds a histogram of term frequencies 6: for all term t doc d do 7: H{t} H{t} + 8: for all term t H do 9: M{t}.Add(posting hn, H{t}i). adds a posting to partial postings lists 0: if MemoryFull() then : Flush() : method Flush. flushes partial lists of postings as intermediate output 3: for all term t M do 4: P SortAndEncodePostings(M{t}) 5: Emit(term t, postingslist P ) 6: M.Clear() 7: method Close 8: Flush() Another Look at LP flatmap: emit singleton postings reducebykey: : class Reducer : method Reduce(term t, postingslists [P,P,...]) 3: P f new List. temporarily stores partial lists of postings 4: R new List. stores merged partial lists of postings 5: for all P postingslists [P,P,...] do 6: P f.add(p ) 7: if MemoryNearlyFull() then 8: R.Add(MergeLists(P f )) 9: P f.clear() 0: R.Add(MergeLists(P f )) : Emit(term t, postingslist MergeLists(R)). emits fully merged postings list of term t RDD[(K, V)] aggregatebykey seqop: (U, V) U, combop: (U, U) U RDD[(K, U)]

33 Algorithm design in a nutshell Exploit associativity and commutativity via commutative monoids (if you can) Exploit framework-based sorting to sequence computations (if you can t) Source: Wikipedia (Walnut)

34 Abstract IR Architecture Query Documents Representation Function online offline Representation Function Query Representation Document Representation Comparison Function Index Hits

35 MapReduce it? The indexing problem Scalability is critical Must be relatively fast, but need not be real time Fundamentally a batch operation Incremental updates may or may not be important For the web, crawling is a challenge in itself The retrieval problem Must have sub-second response time For the web, only need relatively few results

36 Assume everything fits in memory on a single machine (For now)

37 Boolean Retrieval Users express queries as a Boolean expression AND, OR, NOT Can be arbitrarily nested Retrieval is based on the notion of sets Any query divides the collection into two sets: retrieved, not-retrieved Pure Boolean systems do not define an ordering of the results

38 Boolean Retrieval To execute a Boolean query: OR Build query syntax tree ( blue AND ) OR ham ham AND blue For each clause, look up postings blue ham Traverse postings and apply Boolean operator

39 Term-at-a-Time OR blue 5 9 ham AND blue ham AND 5 9 blue ham OR AND Efficiency analysis? blue What s RPN?

40 Document-at-a-Time OR blue 5 9 ham AND blue ham blue ham Tradeoffs? Efficiency analysis?

41 Boolean Retrieval Users express queries as a Boolean expression AND, OR, NOT Can be arbitrarily nested Retrieval is based on the notion of sets Any query divides the collection into two sets: retrieved, not-retrieved Pure Boolean systems do not define an ordering of the results

42 Ranked Retrieval Order documents by how likely they are to be relevant Estimate relevance(q, d i ) Sort documents by relevance How do we estimate relevance? Take similarity as a proxy for relevance

43 Vector Space Model t 3 d d 3 θ d φ t t d 5 d 4 Assumption: Documents that are close together in vector space talk about the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ closeness )

44 Similarity Metric Use angle between the vectors: d j =[w j,,w j,,w j,3,...w j,n ] d k =[w k,,w k,,w k,3,...w k,n ] cos = d j d k d j d k sim(d j,d k )= d j d k d j d k = P n i=0 w j,iw k,i q Pn i=0 w j,i q Pn i=0 w k,i Or, more generally, inner products: sim(d j,d k )=d j d k = nx w j,i w k,i i=0

45 Term Weighting Term weights consist of two components Local: how important is the term in this document? Global: how important is the term in the collection? Here s the intuition: Terms that appear often in a document should get high weights Terms that appear in many documents should get low weights How do we capture this mathematically? Term frequency (local) Inverse document frequency (global)

46 TF.IDF Term Weighting w i, j = tf i, j log N n i w i, j tf i, j N n i weight assigned to term i in document j number of occurrence of term i in document j number of documents in entire collection number of documents with term i

47 Retrieval in a Nutshell Look up postings lists corresponding to query terms Traverse postings for each query term Store partial query-document scores in accumulators Select top k results to return

48 Retrieval: Document-at-a-Time Evaluate documents one at a time (score all query terms) blue Accumulators (e.g. min heap) Document score in top k? Yes: Insert document score, extract-min if heap too large No: Do nothing Tradeoffs: Small memory footprint (good) Skipping possible to avoid reading all postings (good) More seeks and irregular data accesses (bad)

49 Retrieval: Term-At-A-Time Evaluate documents one query term at a time Usually, starting from most rare term (often with tf-sorted postings) blue 9 35 Score {q=x} (doc n) = s Accumulators (e.g., hash) Tradeoffs: Early termination heuristics (good) Large memory footprint (bad), but filtering heuristics possible

50 Assume everything fits in memory on a single machine Okay, let s relax this assumption now

51 Important Ideas Partitioning (for scalability) Replication (for redundancy) Caching (for speed) Routing (for load balancing) The rest is just details!

52 Term vs. Document Partitioning D T D Term Partitioning T T 3 T Document Partitioning T D D D 3

53 FE brokers partitions replicas cache

54 Datacenter Datacenter Datace brokers brokers partitions partitions part Tier replicas cache Tier replicas cache Tier partitions partitions part Tier replicas cache Tier replicas cache Tier partitions partitions part Tier replicas cache Tier replicas cache Tier

55 Important Ideas Partitioning (for scalability) Replication (for redundancy) Caching (for speed) Routing (for load balancing)

56 Questions? Source: Wikipedia (Japanese rock garden)

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother?

Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother? Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother? Tamer Elsayed, 1 Ferhan Ture, 1 Jimmy Lin 2 1 Department of Computer Science 2 The ischool, College of Information

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Information Retrieval Processing with MapReduce

Information Retrieval Processing with MapReduce Information Retrieval Processing with MapReduce Based on Jimmy Lin s Tutorial at the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009) This

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (3/4) March 7, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Big Data Infrastructure

Big Data Infrastructure Big Data Infrastructure Session 4: MapReduce Structured and Unstructured Data Jimmy Lin University of Maryland Monday, February 23, 205 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Inverted Indexing for Text Retrieval

Inverted Indexing for Text Retrieval Chapter 4 Inverted Indexing for Text Retrieval Web search is the quintessential large-data problem. Given an information need expressed as a short query consisting of a few terms, the system s task is

More information

MapReduce Patterns, Algorithms, and Use Cases

MapReduce Patterns, Algorithms, and Use Cases MapReduce Patterns, Algorithms, and Use Cases In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 6: Similar Item Detection Jimmy Lin University of Maryland Thursday, February 28, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 4: Analyzing Graphs (1/2) October 4, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes Algorithms for MapReduce 1 Assignment 1 released Due 16:00 on 20 October Correctness is not enough! Most marks are for efficiency. 2 Combining, Sorting, and Partitioning... and algorithms exploiting these

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (2/2) March 16, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

MapReduce Algorithm Design

MapReduce Algorithm Design MapReduce Algorithm Design Contents Combiner and in mapper combining Complex keys and values Secondary Sorting Combiner and in mapper combining Purpose Carry out local aggregation before shuffle and sort

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Map-Reduce and Adwords Problem

Map-Reduce and Adwords Problem Map-Reduce and Adwords Problem Map-Reduce and Adwords Problem Miłosz Kadziński Institute of Computing Science Poznan University of Technology, Poland www.cs.put.poznan.pl/mkadzinski/wpi Big Data (1) Big

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

Using Graphics Processors for High Performance IR Query Processing

Using Graphics Processors for High Performance IR Query Processing Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,

More information

Chapter 4. Distributed Algorithms based on MapReduce. - Applications

Chapter 4. Distributed Algorithms based on MapReduce. - Applications Chapter 4 Distributed Algorithms based on MapReduce - Applications 1 Acknowledgements MapReduce Algorithms - Understanding Data Joins: http://codingjunkie.net/mapreduce-reduce-joins/ Joins with Map Reduce:

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective

More information

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes uery Processing Olaf Hartig David R. Cheriton School of Computer Science University of Waterloo CS 640 Principles of Database Management and Use Winter 2013 Some of these slides are based on a slide set

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing Outline introduction selection projection join set & aggregate operations Prakash 2018 VT CS 4604 2

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes? Course work Introduc)on to Informa(on Retrieval Problem set 1 due Thursday Programming exercise 1 will be handed out today CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 8: Analyzing Graphs, Redux (1/2) March 20, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Index Construction 1

Index Construction 1 Index Construction 1 October, 2009 1 Vorlage: Folien von M. Schütze 1 von 43 Index Construction Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (1/2) March 27, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

University of Maryland. Tuesday, March 2, 2010

University of Maryland. Tuesday, March 2, 2010 Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Indexes Indexes are data structures designed to make search faster Text search

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 7: Mutable State (2/2) November 13, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

Map-Reduce Applications: Counting, Graph Shortest Paths

Map-Reduce Applications: Counting, Graph Shortest Paths Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

CS60092: Informa0on Retrieval

CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Last lecture index construc)on Sort- based indexing Naïve in- memory inversion Blocked Sort- Based Indexing Merge sort is effec)ve for

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression. Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes

More information

Variable Length Integers for Search

Variable Length Integers for Search 7:57:57 AM Variable Length Integers for Search Past, Present and Future Ryan Ernst A9.com 7:57:59 AM Overview About me Search and inverted indices Traditional encoding (Vbyte) Modern encodings Future work

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary

More information

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013 Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Algorithms. Revised based on the slides by Ruoming Kent State Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

The role of index compression in score-at-a-time query evaluation

The role of index compression in score-at-a-time query evaluation DOI 10.1007/s10791-016-9291-5 INFORMATION RETRIEVAL EFFICIENCY The role of index compression in score-at-a-time query evaluation Jimmy Lin 1 Andrew Trotman 2 Received: 26 May 2016 / Accepted: 16 December

More information

Map Reduce.

Map Reduce. Map Reduce dacosta@irit.fr Divide and conquer at PaaS Second Third Fourth 100 % // Fifth Sixth Seventh Cliquez pour 2 Typical problem Second Extract something of interest from each MAP Third Shuffle and

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information