Appendix A1: Fig A1-1. Examples of build methods for distributed inverted files

Similar documents
6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Distributed computing: index building and use

Building an Inverted Index

City Research Online. Permanent City Research Online URL:

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

CSC630/CSC730 Parallel & Distributed Computing

Information Retrieval and Organisation

Distributed computing: index building and use

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

CSE5351: Parallel Processing Part III

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Document Representation : Quiz

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Advanced Database Systems

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Analytical Modeling of Parallel Programs

Efficient query processing

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

CSE 544, Winter 2009, Final Examination 11 March 2009

Parallel Query Optimisation

Dense Matrix Algorithms

SCALABILITY ANALYSIS

THE WEB SEARCH ENGINE

A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

A Batched GPU Algorithm for Set Intersection

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Chapter 13: Query Processing

Reuters collection example (approximate # s)

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Evaluation of Parallel Programs by Measurement of Its Granularity

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Chapter 12: Query Processing

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Chapter 12: Query Processing. Chapter 12: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

CS 222/122C Fall 2017, Final Exam. Sample solutions

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Lecture 15: The Details of Joins

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

News Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Advanced Databases: Parallel Databases A.Poulovassilis

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Parallelization of Sequential Programs

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Database Management System

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans

CMSC424: Database Design. Instructor: Amol Deshpande

Database System Concepts

Information Retrieval

University of Waterloo Midterm Examination Solution

Index construc-on. Friday, 8 April 16 1

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Analytical Modeling of Parallel Programs

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

Information Retrieval

Advanced Databases. Lecture 1- Query Processing. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Ateles performance assessment report

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

University of Waterloo Midterm Examination Sample Solution

Chapter 13 Strong Scaling

Joint Entity Resolution

CSE 344 MAY 7 TH EXAM REVIEW

Chapter 17: Parallel Databases

Data Set Buffering. Introduction

Query Evaluation Strategies

Structured Parallel Programming Patterns for Efficient Computation

Join algorithm costs revisited

Design of Parallel Algorithms. Course Introduction

Matrix Multiplication

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

COMP6237 Data Mining Searching and Ranking

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Distributing the Derivation and Maintenance of Subset Descriptor Rules

Lecture 5: Information Retrieval using the Vector Space Model

TRADITIONAL search engines utilize hard disk drives

Parallel Programming in C with MPI and OpenMP

Date Lesson TOPIC Homework. The Intersection of a Line with a Plane and the Intersection of Two Lines

Short Summary of DB2 V4 Through V6 Changes

Query Processing. Solutions to Practice Exercises Query:

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Chapter 5: Analytical Modelling of Parallel Programs

Designing for Performance. Patrick Happ Raul Feitosa

1.1 - Basics of Query Processing in SQL Server

Increasing Database Performance through Optimizing Structure Query Language Join Statement

Transcription:

APPENDICES Appendix A: Fig A-. Examples of build methods for distributed inverted files Key: Distributed build Local build A node in the parallel machine Indicates presense of text files on a node Inverted file partition on a node Network connection between nodes 26

Appendix A2: Extra Probablistic Search Results 6 5 4 3 2 leaf nodes pos topic topic pos Fig A2-. BASE [TermId]: search average elapsed time in seconds (sequential sort: CF distribution) Leaf nodes Title Only WholeTopic - NPOS POS NPOS POS 2 63% 4% 55% 34% 3 69% 48% 63% 44% 4 73% 53% 69% 54% 5 75% 55% 72% 57% 6 78% 58% 75% 6% 7 8% 6% 73% 62% Table A2-. BASE [TermId]: search overheads in % of total time (sequential sort: CF distribution) 2.5.5 leaf nodes pos topic topic pos Fig A2-2. BASE [TermId]: search speedup (sequential sort: CF distribution).9.8.7.6.5.4.3.2. leaf nodes pos topic topic pos Fig A2-3. BASE [TermId]: search parallel efficiency (sequential sort: CF distribution).5.4.3.2. leaf nodes pos topic topic pos Fig A2-4. BASE [TermId]: search load imbalance (sequential sort: CF distribution) 6 5 4 3 2 leaf nodes pos topic topic pos Fig A2-5. BASE [TermId]: search average elapsed time in seconds (sequential sort: TF distribution) 262

2.5.5 leaf nodes pos topic topic pos Fig A2-6. BASE [TermId]: search speedup (sequential sort: TF distribution).9.8.7.6.5.4.3.2. leaf nodes pos topic topic pos Fig A2-7. BASE [TermId]: search parallel efficiency (sequential sort: TF distribution).5.4.3.2. leaf nodes pos topic topic pos Fig A2-8. BASE [TermId]: search load imbalance (sequential sort: TF distribution) Leaf nodes Title Only WholeTopic - NPOS POS NPOS POS 2 64% 4% 53% 34% 3 68% 47% 67% 44% 4 72% 52% 74% 55% 5 7% 56% 73% 54% 6 77% 57% 76% 58% 7 79% 59% 8% 64% Table A2-2. BASE [TermId]: search overheads in % of total time (sequential sort: TF distribution) 35 Throughput Query/Hr 3 25 2 5 5 NO POS-cf-to POS-cf-to NO POS-tf-to POS-tf-to NO POS-cf-wt POS-cf-wt NO POS-tf-wt POS-tf-wt Fig A2-9. BASE [TermId]: search throughput in queries/hour (CF and TF distributions) 263

Appendix A3. Further Retrieval Results for on-the-fly distribution: Routing/Filtering task Speedup 3 2.5 2.5.5 2 4 6 8 slave nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-. ZIFF-DAVIS [On-the-fly]: speedup for term selection algorithms (Network).9.8.7.6.5.4.3.2. 2 4 6 8 slaves nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-2. ZIFF-DAVIS [On-the-fly]: parallel efficiency for term selection algorithms (Network) 264

Iterations 8 7 6 5 4 3 2 2 4 6 8 slaves nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-3. ZIFF-DAVIS [On-the-fly]: outer iterations to service term selection (Network) 3 2 5 Iterations 2 5 5 F B C F P C S P 2 3 4 5 6 7 8 9 s laves no d es Fig A3-4. ZIFF-DAVIS [On-the-fly]: outer iterations to service term selection (AP) 265

Appendix A4. Further details of update and index maintenance experiments.9.8.7.6.5.4.3.2. 2 3 4 Fig A4-. BASE [DocId]: parallel efficiency for update transactions (postings only).9.8.7.6.5.4.3.2. Fig A4-2. BASE [TermId]: parallel efficiency for update transactions (postings only) 2 3 4.9.8.7.6.5.4.3.2. 2 3 4 Fig A4-3. BASE [DocId]: parallel efficiency for update transactions (position data).9.8.7.6.5.4.3.2. Fig A4-4. BASE [TermId]: parallel efficiency for update transactions (position data) 2 3 4.8.6.4 2 3 4.8.6.4 2 3 4.2.2 Fig A4-5. BASE [DocId]: parallel efficiency for all transactions (postings only).9.8.7.6.5.4.3.2. 2 3 4 Fig A4-6. BASE [TermId]: parallel efficiency for all transactions (postings only) Fig A4-7. BASE [DocId]: parallel efficiency for all transactions (position data).9.8.7.6.5.4.3.2. 2 3 4 Fig A4-8. BASE [TermId]: parallel efficiency for all transactions (position data) 266

% increase 9 8 7 6 5 4 3 2 2 3 4 Fig A4-9. BASE [DocId]: % increase from normal average transaction elapsed time during index update (postings only) % increase 2 8 6 4 2 8 6 4 2 2 3 4 Fig A4-. BASE [DocId]: % increase from normal average transaction elapsed time during index update (position data) % increase 9 8 7 6 5 4 3 2 2 3 4 Fig A4-. BASE [TermId]: % increase from normal average transaction elapsed time during index update (postings only) % increase 2 8 6 4 2 8 6 4 2 2 3 4 Fig A4-2. BASE [TermId]: % increase from normal average transaction elapsed time during index update (position data) 2.6 2.4 2.2 2.8.6.4.2 2.6 2.4 2.2 2.8.6.4.2 Fig A4-3. BASE [DocId]: Parallel efficiency for index reorganisation (postings only) Fig A4-4. BASE [DocId]: Parallel efficiency for index reorganisation (position data) 4 docs 8 docs 2 docs 4 docs 5 docs 4 docs 8 docs 2 docs 4 docs 5 docs.2.8.6.4.2.2 Fig A4-5. BASE [TermId]: Parallel efficiency for index reorganisation (postings only).8.6.4.2 Fig A4-6. BASE [TermId]: Parallel efficiency for index reorganisation (position data) 4 docs 8 docs 2 docs 4 docs 5 docs 4 docs 8 docs 2 docs 4 docs 5 docs 267

Time: secs Time: secs 2 8 6 4 2 8 6 4 2 4 docs 8 docs 2 docs 4 docs 5 docs Fig A4-7. BASE [DocId]: Accumulated total time for index reorganisation (postings only) 5 45 4 35 3 25 2 5 5 4 docs 8 docs 2 docs 4 docs 5 docs Fig A4-8. BASE [DocId]: Accumulated total time for index reorganisation (position data) Time: secs Time: secs 2 8 6 4 2 8 6 4 2 4 docs 8 docs 2 docs 4 docs 5 docs Fig A4-9. BASE [TermId]: Accumulated total time for index reorganisation (postings only) 5 45 4 35 3 25 2 5 5 4 docs 8 docs 2 docs 4 docs 5 docs Fig A4-2. BASE [TermId]: Accumulated total time for index reorganisation (position data) Metric Position Data Total Time secs Collectio n 4 Docs 8 Docs 2 Docs 4 Docs 5 Docs BASE BASE 65.9 677 5,53 74,644 28 2,466 32 2,68 Scalability on Total Time secs BASE.973.993.6.3.2 Postings Only (No Positions) Total Time (secs) BASE BASE 26.6 235 42.5 372 72.8 59 23 97 28 8 Scalability on Total Time (secs) BASE.3.4.23.36.6 Table A4-. BASE/BASE[DocId]: scalability on index reorganisation 268

Appendix A5 - Synthetic models chapter appendix. P LI[P] 2.2 3.3 4.5 5.6 6.8 7.9 8. 9.3.4 2.6 3.8 4.2 5.2 6.23 7.25 8.27 9.29.3 Table A5-. Load imbalance estimates for LI[P] variable A5. SEQUENTIAL MODEL FOR INDEXING A5.. Analyse Documents Strip words from documents: Insert Word Into Block: d * n log(n) dnlog(n) * T cpu A5..2 Save Intermediate Results Number of intermediate saves: (dn/ BSIZE) Cost per intermediate save (BSIZE * T i/o ) A5..3 Merge Phase (dn/ BSIZE) * (BSIZE *T i/o ) Load Blocks: (dn/ BSIZE) * (BSIZE *T i/o ) Write Blocks: (dn/ BSIZE) * (BSIZE *T i/o ) Merge Blocks: (dn/ BSIZE) * (BSIZE * T cpu ) 2((dn/ BSIZE) * (BSIZE *T i/o )) + (dn/ BSIZE) * (BSIZE * T cpu ) 269

A5..4 Sequential Indexing Model Combing the equations declared in sections A5.. to A5..3 gives us the following sequential synthetic indexing model; INDEX seq (d,n,bsize) = dnlog(n) * T cpu + 3((dn/ BSIZE) * (BSIZE *T i/o )) + (dn/ BSIZE) * (BSIZE * T cpu ) A5.2 PARALLEL MODELS FOR INDEXING A5.2. Distributing Documents to nodes dn/f * T comm A5.2.2 Global Merge Phase. (dn/ BSIZE) * (P *T comm ) A5.2.3 DocId Indexing Models Using the function defined in section A5..4 and the equation defined in section A5.2., we can define synthetic models for DocId indexing. With distributed build (INDEX Distr_DocId ) we also add the distribution component for text data (equation from section A5.2.) INDEX Local_Docid (d,n,p,bsize) = (INDEX seq (d,n,bsize)/p)* LI[P] INDEX Distr_DocId (d,n,f,p,bsize) = ((INDEX seq (d,n,bsize)/p)* LI[P]) + (dn/f * T comm ) A5.2.4 TermId Indexing Model The distributed build TermId model (INDEX Distr_TermId ) must redefine one aspect of the sequential indexing model defined in section A5..4. The merge component defined in section A5..3 is doubled for the TermId model and the extra communication costs from section A5.2.2 are added. The revised index computation component is divided by the number of processors and multiplied by the load imbalance estimate. 27

INDEX Distr_TermId (d,n,f,p,bsize) = dn/f * T comm + (dn/ BSIZE) * (P *T comm ) + (dnlog(n) * T cpu + 6((dn/ BSIZE) * (BSIZE *T i/o )) + 2(dn/ BSIZE) * (BSIZE * T cpu )) * LI[P] P A5.3 SEQUENTIAL MODEL FOR PROBABILISTIC SEARCH The sequential model for probabilistic search is made up the the following: Load q Keyword sets: Weight q Keyword sets: Merge q- Keyword sets: Sort final results set Load_kw seq (q,s) = q * T i/o [s] Weight_kw seq (q,s) = s*q * T cpu Merge_kw seq (q,s) = (q-)*(s+s) * T cpu Sort_set seq (q,s) = R[q,s]log(R[q,s]) * T cpu Put together these functions make up the synthetic search model for sequential probabilistic search; SEARCH seq (s,q) = Load_kw seq (q,s) + Weight_kw seq (q,s) + Merge_kw seq (q,s) + Sort_set seq (q,s) A5.4 PARALLEL MODELS FOR PROBABILISTIC SEARCH A5.4. DocId Partitioning following: The parallel model using DocId partitioning for probabilistic search is made up the the Communications Costs for DocId: Comms_Search docid (P) = 3(P* T comm ) Send P requests for terms frequency: P* T comm Send P Queries (with term frequency): P* T comm Gather results from P for set size s: P* T comm Load q Keyword sets: Load_kw docid (q,s,p) = q * T i/o [s/p] Weight q Keyword sets: Weight_kw par (q,s,p) = (Weight_kw seq (q,s)/p) * LI[P] Merge q- Keyword sets: Merge_kw par (q,s,p) = (Merge_kw seq (q,s)/p) * LI[P] Sort final results set: Sort_set par (q,s,p) = ((R[q,s]/P)log(R[q,s]/P) * T cpu ) * LI[P] 27

The DocId partitioning synthetic search model is therefore; SEARCH docid (s,q,p) = Comms_Search docid (P) + Load_kw docid (q,s,p) + Weight_kw par (q,s,p) + Merge_kw par (q,s,p) + Sort_set par (q,s,p) A5.4.2 TermId Partitioning - Sequential Sort The parallel model using TermId partitioning for probabilistic search using a sequential sort is made up the the following: Communications Costs for TermId: Comms_Search termid (s,q,p,ssize) = (R[s,q]/SSIZE)* P* T comm )+(P* T comm ) Send p Queries (with term frequency): P* T comm Gather results from p for set size s: (R[s,q]/SSIZE)* P* T comm Load q Keyword sets: Load_kw termid (q,s,p) = q/p * T i/o [s/p[q]] Weight q Keyword sets: Weight_kw par (q,s,p[q]) Merge q- Keyword sets: Merge_kw par (q,s,p[q]) Sort final results set: Sort_set seq (q,s) The TermId partitioning synthetic search model with sequential sort is therefore; SEARCH termid (s,q,p,ssize) = Comms_Search termid (s,q,p,ssize) + Load_kw termid (q,s,p) + Weight_kw par (q,s, P[q]) + Merge_kw par (q,s,p[q]) + Sort_set seq (q,s) A5.4.3 TermId Partitioning 2 - Parallel Sort The parallel model using TermId partitioning for probabilistic search using a parallel sort is made up the the following: 272

Communications Costs for TermId2: Comms_Search termid2 (s,q,p,ssize) = 3(R[s,q]/SSIZE)* P* T comm )+(P* T comm ) Send p Queries (with term frequency): P* T comm Gather results from p for set size s: 3(R[s,q]/SSIZE)* P* T comm Load q Keyword sets: Load_kw termid (q,s,p) Weight q Keyword sets: Weight_kw par (q,s,p[q]) Merge q- Keyword sets: Merge_kw par (q,s,p[q]) Sort final results set: Sort_set par (q,s,p) The TermId partitioning synthetic search model with parallel sort is therefore; SEARCH termid2 (s,q,p,ssize) = Comms_Search termid2 (s,q,p,ssize) + Load_kw termid (q,s,p) + Weight_kw par (q,s,p[q]) + Merge_kw par (q,s,p[q]) + Sort_set par (q,s,p) A5.5 SEQUENTIAL MODEL PASSAGE RETRIEVAL Service q terms on PR documents each with (a(a-))/2 inspected passages: Compute_Pass(PR,q,a) = T cpu * PR * q*((a(a-))/2) A sort on the top PR documents is required to re-rank the final results set, cost is T cpu PRlog(PR). PASSAGE seq (s,q,a,pr) = SEARCH seq (s,q) + Compute_Pass(PR,q,a) + T cpu PRlog(PR) A5.6 PARALLEL MODELS FOR PASSAGE RETRIEVAL A5.6. DocId Models The DocId method simply applies P processors to the Compute_Pass computation defined in section A5.5: Compute_Pass par (PR,q,a,P) = (Compute_Pass(PR,q,a)/P) * LI[P] The local passage processing cost model is constructed by simply adding the probabilistic DocId cost model from section A5.4. to the Compute_Pass par model; 273

PASSAGE docid_local (s,q,a,pr,p) = SEARCH docid (s,q,p) + Compute_Pass par (PR,q,a,P) The distributed passage retrieval method must also gather up data from nodes in order to choose the best PR documents in the collection. This requires four stages; i) Gather the data from an initial probabilistic search (the top PR documents) ii) Scatter this full set to the processors iii) Gather up the full set from all the processors iv) Do a final rank on the top PR documents The estimate for this overhead is therefore: i) Gather data (PR/SSIZE)/P * P: PR/SSIZE (eliminated P) ii) Scatter PR elements to P processors: T comm (PR/SSIZE)*P iii) Gather PR elements from P processors: T comm (PR/SSIZE)*P iv) Sort PR elements to obtain final rank:t cpu PRlog(PR) The model for overheads on distributed passage processing is therefore; OVERHEAD pass (PR,P,SSIZE) = T comm ((2(PR/SSIZE)*P)+ PR/SSIZE) + T cpu PRlog(PR) The DocId distributed passage processing cost model is constructed by adding the probabilistic DocId cost model from section A5.4. to the Compute_Pass par model together with the OVERHEAD pass cost model; PASSAGE docid_distr (s,q,a,pr,ssize,p) = SEARCH docid (s,q,p) + Compute_Pass par (PR,q,a,P) + OVERHEAD pass (PR,P,SSIZE) A5.6.2 TermId Models P processors: In TermId we must communicate the data for (a(a-))/2 passages for PR documents on OVERHEAD passtid (a,pr,p) = T comm ( PR*P*((a(a-))/2) The TermId distributed passage processing cost models are constructed by adding the probabilistic TermId cost model from sections A5.4.2 and A5.4.3 to the Compute_Pass par model together with the OVERHEAD pass and OVERHEAD passtid cost models; PASSAGE termid (s,q,a,pr,ssize,p) = 274

SEARCH termid (s,q,p,ssize) + Compute_Pass par (PR,q,a,P[q]) + OVERHEAD pass (PR,P,SSIZE) + OVERHEAD passtid (a,pr,p) PASSAGE termid2 (s,q,a,pr,ssize,p) = SEARCH termid2 (s,q,p,ssize) + Compute_Pass par (PR,q,a,P[q]) + OVERHEAD pass (PR,P,SSIZE) + OVERHEAD passtid (a,pr,p) 5.7 SEQUENTIAL MODELS FOR TERM SELECTION A5.7. Evaluation The cost of evaluation is broken down into the following; Merge set for term with accumulated set: T cpu *(s+s) Merge relevance judgements with temporary set: T cpu *(s+r) Rank the temporary set using a sort: T cpu *(R[q,s]log(R[q,s])) Put together these equations form the model for the cost of a single evaluation; EVAL(q,s,r) = T cpu *((s+s) + R[q,s]log(R[q,s]))+ (s+r)) A5.7.2 Total number of evaluations Maximum number of evaluations for the find best algorithm is: q*i Not all Keywords are inspected in i iterations: i* (i+) *.5 After each iteration one less term is inspected. This formula accumulates the total number of keywords not inspected in i iterations, as one term is always chosen. Put together with an estimate of the total number of terms skipped the function for inspected terms is: INSPECTED(q,i) = (qi - (i(i+).5) - u(qi -(i(i+).5))) A5.7.3 Load costs for keywords The cost of loading term data is as follows; Load q terms from disk each with set size s: q * T i/o [s] Weight q terms each with set size s: q * s * T cpu Putting these equations together yields the following load cost; LOAD(q,s) = q(t i/o [s] +s*t cpu ) 275

A5.7.4 Sequential Models for Term Selection Using the models defined in sections A5.7. to A5.7.3 we can now define the sequential cost models for term selection. For add only operation (ROUTING seq ) this is a simple process of multiplying the evaluation cost (see section A5.7.) with the number of terms inspected (see section A5.7.2) and with an addition of load costs (see section A5.7.3). The model for add reweight (ROUTING seqw ) is constructed by factoring the total evaluation cost by the reweight variable w. ROUTING seq (s,r,i,q) = (INSPECTED(q,i) * EVAL(q,s,r)) + LOAD(q,s) ROUTING seqw (s,r,i,q,w) = (INSPECTED(q,i) * EVAL(q,s,r) * w) + LOAD(q,s) A5.8 PARALLEL MODELS FOR TERM SELECTION Basic term selection models with no synchronisation or communication costs is as follows; ROUTING par (s,r,i,q,p) = ROUTING seq (s,r,i,q) * LI[P] P ROUTING parw (s,r,i,q,p,w) = ROUTING seqw (s,r,i,q,w) * LI[P] P 5.8. DocId Models The cost model for intra-set parallelism is; Merge set costs: Merge_Route docid (s,r,p) = (T cpu (s+ r+s)/p) * LI[P] Sort costs: Sort_Route docid (s,p) = (T cpu (s/p)log(s/p)) * LI[P] Communication costs: Comms_Route docid (s,p,ssize) = (((s/ssize)/p)+2p)* T comm Putting these functions together gives us the evaluation cost model for DocId term selection: EVAL docid (s,r,i,q,p,ssize) = INSPECTED(q,i) * (Merge_Route docid (s,r,p) + Sort_Route docid (s,p) + Comms_Route docid (s,p,ssize)) We also measure overheads at the synchronisation point for merging the chosen term into the accumulated set and communicating the best term identifier in one iteration; Communication costs for best term: Merge best term set into accumulated set: P*T comm ((s*t cpu )/P * LI[P]) 276

We assume latency is the dominant factor in communication costs. Putting these equations together gives us the estimate of overheads for the DocId term selection cost model. OVERHEAD docid (s,i,p) = i*(((s*t cpu )/P) * LI[P]) + (P*T comm )) The models for term selection are constructed by taking the load cost model (defined in section A5.7.3), and adding the evaluation and overhead cost models defined above in this section. The load cost model is further refined by dividing by the number of processors and factoring the result by the load imbalance estimate (LI[P]). ROUTING docid (s,r,i,q,p,ssize) = LOAD(q,s) * LI[P] + EVAL docid (s,r,i,q,p,ssize) + OVERHEAD docid (s,i,p) P ROUTING docidw (s,r,i,q,p,ssize,w) = LOAD(q,s) * LI[P] + (w*eval docid (s,r,i,q,p,ssize)) + OVERHEAD docid (s,i,p) P A5.8.2 TermId Models The interaction at the synchronisation point is more complicated than for the DocId models. This is because the data for the best term must be retrieved from the relevant node and merged into the accumulated set, which is then broadcast to all nodes. Overheads for TermId models are calculated as follows: Get the identifier of best term in one iteration: Request for best term set: Retrieving best set from relevant node: Broadcast best set to all other nodes: Merge the best term data into the accumulated set: P*T comm * T comm s/ssize * T comm (P- * s/ssize) * T comm st cpu +T i/o [s] Put together, these equations form the cost model for routing overheads on the TermId partitioning scheme; OVERHEAD termid (s,i,p,ssize) = i*( (T comm *((P+)*(P*s/SSIZE)))+ (st cpu +T i/o [s])) Construction of the routing models can be done by re-using the basic term selection models and adding the (OVERHEAD termid )cost; ROUTING termid (s,r,i,q,p,ssize) = (ROUTING par (s,r,i,q,p) * LI[P]) + OVERHEAD termid (s,i,p,ssize) 277

ROUTING termidw (s,r,i,q,p,ssize,w) = (ROUTING parw (s,r,i,q,p,w) * LI[P]) + OVERHEAD termid (s,i,p,ssize) The extra LI[P] here assumes that ROUTING termid imbalance will probably be worse than ROUTING rep in particular or other models in general. This is because terms are statically allocated to a node (see chapter 4, sub-section 4.4.2.3 for a discussion on term allocation schemes). A5.8.3 Replication Models Latency is presumed to be the main communication problem for the replication distribution scheme. The overheads for replication cost models are calculated as follows: Get the identifier of best term from P processors: Send the identifier of best term to P processors: Merge the best term data into the accumulated set: Putting these equations together gives us the following cost model; OVERHEAD rep (s,i,p) = i*(st cpu +T i/o [s] + (2P*T comm )) P* T comm P* T comm (s*t cpu )+T i/o [s] The cost models for the replication distribution scheme can be constructed by simply adding the overheads to the basic term selection cost models. ROUTING rep (s,r,i,q,p) = ROUTING par (s,r,i,q,p) + OVERHEAD rep (s,i,p) ROUTING repw (s,r,i,q,p,w) = ROUTING parw (s,r,i,q,p,w) + OVERHEAD rep (s,i,p) A5.8.4 On-the-fly distribution Models The overhead cost model for the On-the-fly distribution scheme is; OVERHEAD load (q,s,ssize,p) = ((qs/ssize)+p) * T comm There are also overhead costs at the synchronisation point for transferring set data, which is formed as follows; Get the identifier of best term in one iteration: Broadcast best set to all nodes: Merge the best term data into the accumulated set: P*T comm (P * s/ssize) * T comm (s*t cpu )+T i/o [s] Putting these equations together, we have the overhead at the synchronisation point; OVERHEAD large (s,i,ssize,p) = i*((((p*(s/ssize))+p) * T comm )+( st cpu +T i/o [s])) We cannot use the basic parallel term selection costs models, as some aspects of them (such as load) must be done sequentially. We apply a parallel cost model to the total evaluation cost, 278

together with the load cost model defined in section A5.7.3 and the overhead cost models defined above in this section. ROUTING parfly (s,r,i,q,ssize,p) = LOAD(q,s) + OVERHEAD load (q,s,ssize,p) + OVERHEAD large (s,i,ssize,p) + (INSPECTED(q,i) * EVAL(q,s,r) * LI[P] ) P ROUTING parflyw (s,r,i,q,ssize,p,w) = LOAD(q,s) + OVERHEAD load (q,s,ssize,p) + OVERHEAD large (s,i,ssize,p) + (INSPECTED(q,i) * EVAL(q,s,r) * w * LI[P] ) P A5.9 SEQUENTIAL MODELS FOR INDEX A5.9. Adding a Document to the Buffer: Update Transaction The Client/Server Update model is formed by the following steps; Scan Word and put in client Tree: nlog(n) * T cpu Marshalling/UnMarshalling term data: (n + n)* T cpu Sending data: (n/ssize) * T comm Merge word data with server buffer: nlog(dict) * T cpu Putting these equations together gives us a cost model for update on a single inverted file. seq (n,dict,ssize) = T cpu (nlog(n) + 2n + nlog(dict)) + T comm * (n/ssize) A5.9.2 Transaction while index is updated The cost model for transaction is calculated by adding the contention factor c to the particular function being examined. The search cost model is taken from section A5.3 and the update cost model from the previous section. seqc (n,dict,ssize) = ( seq (n,dict,ssize) *c[]) + seq (n,dict,ssize) SEARCH seqc (s,q) = (SEARCH seq (s,q) * c[]) + SEARCH seq (s,q) A5.9.3 Transaction estimate Taking the cost models defined in sections A5.9. and A5.9.2 we can construct a cost model for transactions where the percentage of total transaction time spent updating the index can be used. This allows us the vary the effect on transactions and study the theoretical performance penalty on transaction while doing a simultaneous index update. In the TRANSACTION seq model we eliminate the contention by setting ro to zero, while ro set to means that all transactions are affected by contention. 279

TRANSACTION seq (ur,sr,ro,n,dict,s,q) = (-ro(ur* seq(n,dict,ssize)+ sr*search seq(s,q))) + (ro(ur* seqc(n,dict,ssize)+ sr*search seqc(s,q))) ur + sr A5.9.4 Reorganisation of Inverted File The reorganisation model is made up of the following synthetic cost functions; Insert m buffer words in dict: Insert_Words_buff(m,b,dict) = T cpu (m(log(dict/b)+b)) Read t+m keyword lists from disk: Write t+m keyword lists to disk: Merge m keyword lists: Read in (dict/b) keyword blocks: List_Disk_Trans(t,m) = T i/o [s] * (t+m) List_Disk_Trans(t,m) Merge_kw_lists(m,s) = T cpu (m(s+s)) Read_kw_blocks(dict,b) = T i/o [b] * (dict/b) The reorganisation or index update cost model is constructed by using the four cost models defined above (List_Disk_Trans is used twice); REORG seq (n,dict,b,m,t,s) = Insert_Words_buff(m,b,dict) + 2List_Disk_Trans(t,m) + Merge_kw_lists(m,s) + Read_kw_blocks(dict,b) The contention model for reorganising the index is; REORG seqc (n,dict,b,m,t,s) = (REORG seq (n,dict,b, m,t,s) * c[]) + REORG seq (n,dict,b,m,t,s) A5. PARALLEL MODELS FOR INDEX A5.. DocId Transaction Model In this data distribution method we simply re-use the sequential cost model defined in section A5.9. above; docid (n,dict,ssize) = seq (n,dict,ssize) The contention model also re-uses the sequential model; docidc (n,dict,ssize,p) = ( seq (n,dict,ssize) * c[p]) + seq (n,dict,ssize) In order to construct the contention model for DocId search we re-use the function defined in section A5.4. above; SEARCH docidc (s,q,p) = (SEARCH docid (s,q,p) * c[p]) + SEARCH docid (s,q,p) The transaction model for DocId partitioning is constructed in exactly the same way as the sequential version described in section A5.9.3 above. TRANSACTION docid (ur,sr,ro,n,dict,s,q,p) = 28

(-ro(ur* docid(n,dict,p)+ sr*search docid(s,q,p))) + (ro(ur* docidc(n,dict,ssize,p)+ sr*search docidc(s,q,p))) ur + sr A5..2 TermId Transaction Model With the TermId distribution method, a new cost model must be defined as merging the data with the buffer is parallelised. termid (n,dict,p,ssize) = T cpu (nlog(n) + (nlog(dict) *LI[P]) + 2n ) + (P*T comm * (n/ssize) ) P The contention model re-uses the model defined above; termidc (n,dict,p,ssize) = ( termid (n,dict,p,ssize) * c[p]) + termid (n,dict,p,ssize) In order to construct the contention model for TermId search we re-use the function defined in section A5.4.3 above (we utilise the parallel sort cost model); SEARCH termidc (s,q,p,ssize) = (SEARCH termid2 (s,q,p,ssize) * c[p]) + SEARCH termid2 (s,q,p,ssize) The transaction model for DocId partitioning is constructed in exactly the same way as the sequential version described in section A5.9.3 above. TRANSACTION termid (ur,sr,ro,n,dict,s,q,p,ssize) = (-ro(ur* termid(n,dict,p)+ sr*search termid2(s,q,p,ssize))) + (ro(ur* termidc(n,dict,p,ssize)+ sr*search termidc(s,q,p,ssize))) ur + sr A5..3 DocId Reorganisation Model The DocId index update cost model is; Insert m buffer words in dict: Insert_Words_buff docid (m,b,dict,p) = T cpu (m*i[p]* (log(((dict/b)/p)*i[p])+b)) * LI[P] Read t+m keyword lists from disk: List_Disk_Trans docid (t,m,s,p) = T i/o [p[p]*(s/p)] * (t+m) *i[p] * LI[P] Write t+m keyword lists to disk: List_Disk_Trans docid (t,m,s,p) Merge m keyword lists: Merge_kw_lists docid (m,s,p) = T cpu (m*i[p]*(s/p +s/p)) * LI[P] Read in (dict/b) keyword blocks: Read_kw_blocks docid (dict,b,p) = T i/o [b] * ((dict/b)/p) *i[p] * LI[P] The index update model for DocId partitioning is constructed as follows; REORG docid (n,dict,b,m,t,s,p) = Insert_Words_buff docid (m,b,dict,p) + 2List_Disk_Trans docid (t,m,s,p) + Merge_kw_lists docid (m,s,p) + Read_kw_blocks docid (dict,b,p) This function is re-used in the construction of the cost model with contention as follows; REORG docidc (n,dict,b,m,t,s,p) = (REORG docid (n,dict,b,m,t,s,p) *c[p]) + REORG docid (n,dict,b,m,t,s,p) 28

A5..4 TermId Reorganisation Model The TermId index update cost model is; Insert m buffer words in dict: Insert_Words_buff termid (m,b,dict,p) = (T cpu (m(log(dict/b)+b))/p) * LI[P] Read t+m keyword lists from disk: List_Disk_Trans termid (t,m,s,p) = ((T i/o [s]* (t+m))/p) * LI[P] Write t+m keyword lists to disk: List_Disk_Trans termid (t,m,s,p) Merge m keyword lists: Merge_kw_lists termid (m,s,p) = (T cpu (m(s+s))/p) * LI[P] Read in (dict/b) keyword blocks: Read_kw_blocks termid (dict,b,p) = ((T i/o [b] * (dict/b))/p) * LI[P] The index update model for TermId partitioning is constructed as follows; REORG termid (n,dict,b,m,t,s,p) = Insert_Words_buff termid (m,b,dict,p) + 2List_Disk_Trans termid (t,m,s,p) + Merge_kw_lists termid (m,s,p) + Read_kw_blocks termid (dict,b,p) This function is re-used in the construction of the cost model with contention as follows; REORG termidc (n,dict,b,m,t,s,p) = (REORG termid (n,dict,b,m,t,s,p) *c[p]) + REORG termid (n,dict,b,m,t,s,p) 282