Appendix A1: Fig A1-1. Examples of build methods for distributed inverted files

Size: px
Start display at page:

Download "Appendix A1: Fig A1-1. Examples of build methods for distributed inverted files"

Transcription

1 APPENDICES Appendix A: Fig A-. Examples of build methods for distributed inverted files Key: Distributed build Local build A node in the parallel machine Indicates presense of text files on a node Inverted file partition on a node Network connection between nodes 26

2 Appendix A2: Extra Probablistic Search Results leaf nodes pos topic topic pos Fig A2-. BASE [TermId]: search average elapsed time in seconds (sequential sort: CF distribution) Leaf nodes Title Only WholeTopic - NPOS POS NPOS POS 2 63% 4% 55% 34% 3 69% 48% 63% 44% 4 73% 53% 69% 54% 5 75% 55% 72% 57% 6 78% 58% 75% 6% 7 8% 6% 73% 62% Table A2-. BASE [TermId]: search overheads in % of total time (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-2. BASE [TermId]: search speedup (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-3. BASE [TermId]: search parallel efficiency (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-4. BASE [TermId]: search load imbalance (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-5. BASE [TermId]: search average elapsed time in seconds (sequential sort: TF distribution) 262

3 2.5.5 leaf nodes pos topic topic pos Fig A2-6. BASE [TermId]: search speedup (sequential sort: TF distribution) leaf nodes pos topic topic pos Fig A2-7. BASE [TermId]: search parallel efficiency (sequential sort: TF distribution) leaf nodes pos topic topic pos Fig A2-8. BASE [TermId]: search load imbalance (sequential sort: TF distribution) Leaf nodes Title Only WholeTopic - NPOS POS NPOS POS 2 64% 4% 53% 34% 3 68% 47% 67% 44% 4 72% 52% 74% 55% 5 7% 56% 73% 54% 6 77% 57% 76% 58% 7 79% 59% 8% 64% Table A2-2. BASE [TermId]: search overheads in % of total time (sequential sort: TF distribution) 35 Throughput Query/Hr NO POS-cf-to POS-cf-to NO POS-tf-to POS-tf-to NO POS-cf-wt POS-cf-wt NO POS-tf-wt POS-tf-wt Fig A2-9. BASE [TermId]: search throughput in queries/hour (CF and TF distributions) 263

4 Appendix A3. Further Retrieval Results for on-the-fly distribution: Routing/Filtering task Speedup slave nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-. ZIFF-DAVIS [On-the-fly]: speedup for term selection algorithms (Network) slaves nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-2. ZIFF-DAVIS [On-the-fly]: parallel efficiency for term selection algorithms (Network) 264

5 Iterations slaves nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-3. ZIFF-DAVIS [On-the-fly]: outer iterations to service term selection (Network) Iterations F B C F P C S P s laves no d es Fig A3-4. ZIFF-DAVIS [On-the-fly]: outer iterations to service term selection (AP) 265

6 Appendix A4. Further details of update and index maintenance experiments Fig A4-. BASE [DocId]: parallel efficiency for update transactions (postings only) Fig A4-2. BASE [TermId]: parallel efficiency for update transactions (postings only) Fig A4-3. BASE [DocId]: parallel efficiency for update transactions (position data) Fig A4-4. BASE [TermId]: parallel efficiency for update transactions (position data) Fig A4-5. BASE [DocId]: parallel efficiency for all transactions (postings only) Fig A4-6. BASE [TermId]: parallel efficiency for all transactions (postings only) Fig A4-7. BASE [DocId]: parallel efficiency for all transactions (position data) Fig A4-8. BASE [TermId]: parallel efficiency for all transactions (position data) 266

7 % increase Fig A4-9. BASE [DocId]: % increase from normal average transaction elapsed time during index update (postings only) % increase Fig A4-. BASE [DocId]: % increase from normal average transaction elapsed time during index update (position data) % increase Fig A4-. BASE [TermId]: % increase from normal average transaction elapsed time during index update (postings only) % increase Fig A4-2. BASE [TermId]: % increase from normal average transaction elapsed time during index update (position data) Fig A4-3. BASE [DocId]: Parallel efficiency for index reorganisation (postings only) Fig A4-4. BASE [DocId]: Parallel efficiency for index reorganisation (position data) 4 docs 8 docs 2 docs 4 docs 5 docs 4 docs 8 docs 2 docs 4 docs 5 docs Fig A4-5. BASE [TermId]: Parallel efficiency for index reorganisation (postings only) Fig A4-6. BASE [TermId]: Parallel efficiency for index reorganisation (position data) 4 docs 8 docs 2 docs 4 docs 5 docs 4 docs 8 docs 2 docs 4 docs 5 docs 267

8 Time: secs Time: secs docs 8 docs 2 docs 4 docs 5 docs Fig A4-7. BASE [DocId]: Accumulated total time for index reorganisation (postings only) docs 8 docs 2 docs 4 docs 5 docs Fig A4-8. BASE [DocId]: Accumulated total time for index reorganisation (position data) Time: secs Time: secs docs 8 docs 2 docs 4 docs 5 docs Fig A4-9. BASE [TermId]: Accumulated total time for index reorganisation (postings only) docs 8 docs 2 docs 4 docs 5 docs Fig A4-2. BASE [TermId]: Accumulated total time for index reorganisation (position data) Metric Position Data Total Time secs Collectio n 4 Docs 8 Docs 2 Docs 4 Docs 5 Docs BASE BASE ,53 74, , ,68 Scalability on Total Time secs BASE Postings Only (No Positions) Total Time (secs) BASE BASE Scalability on Total Time (secs) BASE Table A4-. BASE/BASE[DocId]: scalability on index reorganisation 268

9 Appendix A5 - Synthetic models chapter appendix. P LI[P] Table A5-. Load imbalance estimates for LI[P] variable A5. SEQUENTIAL MODEL FOR INDEXING A5.. Analyse Documents Strip words from documents: Insert Word Into Block: d * n log(n) dnlog(n) * T cpu A5..2 Save Intermediate Results Number of intermediate saves: (dn/ BSIZE) Cost per intermediate save (BSIZE * T i/o ) A5..3 Merge Phase (dn/ BSIZE) * (BSIZE *T i/o ) Load Blocks: (dn/ BSIZE) * (BSIZE *T i/o ) Write Blocks: (dn/ BSIZE) * (BSIZE *T i/o ) Merge Blocks: (dn/ BSIZE) * (BSIZE * T cpu ) 2((dn/ BSIZE) * (BSIZE *T i/o )) + (dn/ BSIZE) * (BSIZE * T cpu ) 269

10 A5..4 Sequential Indexing Model Combing the equations declared in sections A5.. to A5..3 gives us the following sequential synthetic indexing model; INDEX seq (d,n,bsize) = dnlog(n) * T cpu + 3((dn/ BSIZE) * (BSIZE *T i/o )) + (dn/ BSIZE) * (BSIZE * T cpu ) A5.2 PARALLEL MODELS FOR INDEXING A5.2. Distributing Documents to nodes dn/f * T comm A5.2.2 Global Merge Phase. (dn/ BSIZE) * (P *T comm ) A5.2.3 DocId Indexing Models Using the function defined in section A5..4 and the equation defined in section A5.2., we can define synthetic models for DocId indexing. With distributed build (INDEX Distr_DocId ) we also add the distribution component for text data (equation from section A5.2.) INDEX Local_Docid (d,n,p,bsize) = (INDEX seq (d,n,bsize)/p)* LI[P] INDEX Distr_DocId (d,n,f,p,bsize) = ((INDEX seq (d,n,bsize)/p)* LI[P]) + (dn/f * T comm ) A5.2.4 TermId Indexing Model The distributed build TermId model (INDEX Distr_TermId ) must redefine one aspect of the sequential indexing model defined in section A5..4. The merge component defined in section A5..3 is doubled for the TermId model and the extra communication costs from section A5.2.2 are added. The revised index computation component is divided by the number of processors and multiplied by the load imbalance estimate. 27

11 INDEX Distr_TermId (d,n,f,p,bsize) = dn/f * T comm + (dn/ BSIZE) * (P *T comm ) + (dnlog(n) * T cpu + 6((dn/ BSIZE) * (BSIZE *T i/o )) + 2(dn/ BSIZE) * (BSIZE * T cpu )) * LI[P] P A5.3 SEQUENTIAL MODEL FOR PROBABILISTIC SEARCH The sequential model for probabilistic search is made up the the following: Load q Keyword sets: Weight q Keyword sets: Merge q- Keyword sets: Sort final results set Load_kw seq (q,s) = q * T i/o [s] Weight_kw seq (q,s) = s*q * T cpu Merge_kw seq (q,s) = (q-)*(s+s) * T cpu Sort_set seq (q,s) = R[q,s]log(R[q,s]) * T cpu Put together these functions make up the synthetic search model for sequential probabilistic search; SEARCH seq (s,q) = Load_kw seq (q,s) + Weight_kw seq (q,s) + Merge_kw seq (q,s) + Sort_set seq (q,s) A5.4 PARALLEL MODELS FOR PROBABILISTIC SEARCH A5.4. DocId Partitioning following: The parallel model using DocId partitioning for probabilistic search is made up the the Communications Costs for DocId: Comms_Search docid (P) = 3(P* T comm ) Send P requests for terms frequency: P* T comm Send P Queries (with term frequency): P* T comm Gather results from P for set size s: P* T comm Load q Keyword sets: Load_kw docid (q,s,p) = q * T i/o [s/p] Weight q Keyword sets: Weight_kw par (q,s,p) = (Weight_kw seq (q,s)/p) * LI[P] Merge q- Keyword sets: Merge_kw par (q,s,p) = (Merge_kw seq (q,s)/p) * LI[P] Sort final results set: Sort_set par (q,s,p) = ((R[q,s]/P)log(R[q,s]/P) * T cpu ) * LI[P] 27

12 The DocId partitioning synthetic search model is therefore; SEARCH docid (s,q,p) = Comms_Search docid (P) + Load_kw docid (q,s,p) + Weight_kw par (q,s,p) + Merge_kw par (q,s,p) + Sort_set par (q,s,p) A5.4.2 TermId Partitioning - Sequential Sort The parallel model using TermId partitioning for probabilistic search using a sequential sort is made up the the following: Communications Costs for TermId: Comms_Search termid (s,q,p,ssize) = (R[s,q]/SSIZE)* P* T comm )+(P* T comm ) Send p Queries (with term frequency): P* T comm Gather results from p for set size s: (R[s,q]/SSIZE)* P* T comm Load q Keyword sets: Load_kw termid (q,s,p) = q/p * T i/o [s/p[q]] Weight q Keyword sets: Weight_kw par (q,s,p[q]) Merge q- Keyword sets: Merge_kw par (q,s,p[q]) Sort final results set: Sort_set seq (q,s) The TermId partitioning synthetic search model with sequential sort is therefore; SEARCH termid (s,q,p,ssize) = Comms_Search termid (s,q,p,ssize) + Load_kw termid (q,s,p) + Weight_kw par (q,s, P[q]) + Merge_kw par (q,s,p[q]) + Sort_set seq (q,s) A5.4.3 TermId Partitioning 2 - Parallel Sort The parallel model using TermId partitioning for probabilistic search using a parallel sort is made up the the following: 272

13 Communications Costs for TermId2: Comms_Search termid2 (s,q,p,ssize) = 3(R[s,q]/SSIZE)* P* T comm )+(P* T comm ) Send p Queries (with term frequency): P* T comm Gather results from p for set size s: 3(R[s,q]/SSIZE)* P* T comm Load q Keyword sets: Load_kw termid (q,s,p) Weight q Keyword sets: Weight_kw par (q,s,p[q]) Merge q- Keyword sets: Merge_kw par (q,s,p[q]) Sort final results set: Sort_set par (q,s,p) The TermId partitioning synthetic search model with parallel sort is therefore; SEARCH termid2 (s,q,p,ssize) = Comms_Search termid2 (s,q,p,ssize) + Load_kw termid (q,s,p) + Weight_kw par (q,s,p[q]) + Merge_kw par (q,s,p[q]) + Sort_set par (q,s,p) A5.5 SEQUENTIAL MODEL PASSAGE RETRIEVAL Service q terms on PR documents each with (a(a-))/2 inspected passages: Compute_Pass(PR,q,a) = T cpu * PR * q*((a(a-))/2) A sort on the top PR documents is required to re-rank the final results set, cost is T cpu PRlog(PR). PASSAGE seq (s,q,a,pr) = SEARCH seq (s,q) + Compute_Pass(PR,q,a) + T cpu PRlog(PR) A5.6 PARALLEL MODELS FOR PASSAGE RETRIEVAL A5.6. DocId Models The DocId method simply applies P processors to the Compute_Pass computation defined in section A5.5: Compute_Pass par (PR,q,a,P) = (Compute_Pass(PR,q,a)/P) * LI[P] The local passage processing cost model is constructed by simply adding the probabilistic DocId cost model from section A5.4. to the Compute_Pass par model; 273

14 PASSAGE docid_local (s,q,a,pr,p) = SEARCH docid (s,q,p) + Compute_Pass par (PR,q,a,P) The distributed passage retrieval method must also gather up data from nodes in order to choose the best PR documents in the collection. This requires four stages; i) Gather the data from an initial probabilistic search (the top PR documents) ii) Scatter this full set to the processors iii) Gather up the full set from all the processors iv) Do a final rank on the top PR documents The estimate for this overhead is therefore: i) Gather data (PR/SSIZE)/P * P: PR/SSIZE (eliminated P) ii) Scatter PR elements to P processors: T comm (PR/SSIZE)*P iii) Gather PR elements from P processors: T comm (PR/SSIZE)*P iv) Sort PR elements to obtain final rank:t cpu PRlog(PR) The model for overheads on distributed passage processing is therefore; OVERHEAD pass (PR,P,SSIZE) = T comm ((2(PR/SSIZE)*P)+ PR/SSIZE) + T cpu PRlog(PR) The DocId distributed passage processing cost model is constructed by adding the probabilistic DocId cost model from section A5.4. to the Compute_Pass par model together with the OVERHEAD pass cost model; PASSAGE docid_distr (s,q,a,pr,ssize,p) = SEARCH docid (s,q,p) + Compute_Pass par (PR,q,a,P) + OVERHEAD pass (PR,P,SSIZE) A5.6.2 TermId Models P processors: In TermId we must communicate the data for (a(a-))/2 passages for PR documents on OVERHEAD passtid (a,pr,p) = T comm ( PR*P*((a(a-))/2) The TermId distributed passage processing cost models are constructed by adding the probabilistic TermId cost model from sections A5.4.2 and A5.4.3 to the Compute_Pass par model together with the OVERHEAD pass and OVERHEAD passtid cost models; PASSAGE termid (s,q,a,pr,ssize,p) = 274

15 SEARCH termid (s,q,p,ssize) + Compute_Pass par (PR,q,a,P[q]) + OVERHEAD pass (PR,P,SSIZE) + OVERHEAD passtid (a,pr,p) PASSAGE termid2 (s,q,a,pr,ssize,p) = SEARCH termid2 (s,q,p,ssize) + Compute_Pass par (PR,q,a,P[q]) + OVERHEAD pass (PR,P,SSIZE) + OVERHEAD passtid (a,pr,p) 5.7 SEQUENTIAL MODELS FOR TERM SELECTION A5.7. Evaluation The cost of evaluation is broken down into the following; Merge set for term with accumulated set: T cpu *(s+s) Merge relevance judgements with temporary set: T cpu *(s+r) Rank the temporary set using a sort: T cpu *(R[q,s]log(R[q,s])) Put together these equations form the model for the cost of a single evaluation; EVAL(q,s,r) = T cpu *((s+s) + R[q,s]log(R[q,s]))+ (s+r)) A5.7.2 Total number of evaluations Maximum number of evaluations for the find best algorithm is: q*i Not all Keywords are inspected in i iterations: i* (i+) *.5 After each iteration one less term is inspected. This formula accumulates the total number of keywords not inspected in i iterations, as one term is always chosen. Put together with an estimate of the total number of terms skipped the function for inspected terms is: INSPECTED(q,i) = (qi - (i(i+).5) - u(qi -(i(i+).5))) A5.7.3 Load costs for keywords The cost of loading term data is as follows; Load q terms from disk each with set size s: q * T i/o [s] Weight q terms each with set size s: q * s * T cpu Putting these equations together yields the following load cost; LOAD(q,s) = q(t i/o [s] +s*t cpu ) 275

16 A5.7.4 Sequential Models for Term Selection Using the models defined in sections A5.7. to A5.7.3 we can now define the sequential cost models for term selection. For add only operation (ROUTING seq ) this is a simple process of multiplying the evaluation cost (see section A5.7.) with the number of terms inspected (see section A5.7.2) and with an addition of load costs (see section A5.7.3). The model for add reweight (ROUTING seqw ) is constructed by factoring the total evaluation cost by the reweight variable w. ROUTING seq (s,r,i,q) = (INSPECTED(q,i) * EVAL(q,s,r)) + LOAD(q,s) ROUTING seqw (s,r,i,q,w) = (INSPECTED(q,i) * EVAL(q,s,r) * w) + LOAD(q,s) A5.8 PARALLEL MODELS FOR TERM SELECTION Basic term selection models with no synchronisation or communication costs is as follows; ROUTING par (s,r,i,q,p) = ROUTING seq (s,r,i,q) * LI[P] P ROUTING parw (s,r,i,q,p,w) = ROUTING seqw (s,r,i,q,w) * LI[P] P 5.8. DocId Models The cost model for intra-set parallelism is; Merge set costs: Merge_Route docid (s,r,p) = (T cpu (s+ r+s)/p) * LI[P] Sort costs: Sort_Route docid (s,p) = (T cpu (s/p)log(s/p)) * LI[P] Communication costs: Comms_Route docid (s,p,ssize) = (((s/ssize)/p)+2p)* T comm Putting these functions together gives us the evaluation cost model for DocId term selection: EVAL docid (s,r,i,q,p,ssize) = INSPECTED(q,i) * (Merge_Route docid (s,r,p) + Sort_Route docid (s,p) + Comms_Route docid (s,p,ssize)) We also measure overheads at the synchronisation point for merging the chosen term into the accumulated set and communicating the best term identifier in one iteration; Communication costs for best term: Merge best term set into accumulated set: P*T comm ((s*t cpu )/P * LI[P]) 276

17 We assume latency is the dominant factor in communication costs. Putting these equations together gives us the estimate of overheads for the DocId term selection cost model. OVERHEAD docid (s,i,p) = i*(((s*t cpu )/P) * LI[P]) + (P*T comm )) The models for term selection are constructed by taking the load cost model (defined in section A5.7.3), and adding the evaluation and overhead cost models defined above in this section. The load cost model is further refined by dividing by the number of processors and factoring the result by the load imbalance estimate (LI[P]). ROUTING docid (s,r,i,q,p,ssize) = LOAD(q,s) * LI[P] + EVAL docid (s,r,i,q,p,ssize) + OVERHEAD docid (s,i,p) P ROUTING docidw (s,r,i,q,p,ssize,w) = LOAD(q,s) * LI[P] + (w*eval docid (s,r,i,q,p,ssize)) + OVERHEAD docid (s,i,p) P A5.8.2 TermId Models The interaction at the synchronisation point is more complicated than for the DocId models. This is because the data for the best term must be retrieved from the relevant node and merged into the accumulated set, which is then broadcast to all nodes. Overheads for TermId models are calculated as follows: Get the identifier of best term in one iteration: Request for best term set: Retrieving best set from relevant node: Broadcast best set to all other nodes: Merge the best term data into the accumulated set: P*T comm * T comm s/ssize * T comm (P- * s/ssize) * T comm st cpu +T i/o [s] Put together, these equations form the cost model for routing overheads on the TermId partitioning scheme; OVERHEAD termid (s,i,p,ssize) = i*( (T comm *((P+)*(P*s/SSIZE)))+ (st cpu +T i/o [s])) Construction of the routing models can be done by re-using the basic term selection models and adding the (OVERHEAD termid )cost; ROUTING termid (s,r,i,q,p,ssize) = (ROUTING par (s,r,i,q,p) * LI[P]) + OVERHEAD termid (s,i,p,ssize) 277

18 ROUTING termidw (s,r,i,q,p,ssize,w) = (ROUTING parw (s,r,i,q,p,w) * LI[P]) + OVERHEAD termid (s,i,p,ssize) The extra LI[P] here assumes that ROUTING termid imbalance will probably be worse than ROUTING rep in particular or other models in general. This is because terms are statically allocated to a node (see chapter 4, sub-section for a discussion on term allocation schemes). A5.8.3 Replication Models Latency is presumed to be the main communication problem for the replication distribution scheme. The overheads for replication cost models are calculated as follows: Get the identifier of best term from P processors: Send the identifier of best term to P processors: Merge the best term data into the accumulated set: Putting these equations together gives us the following cost model; OVERHEAD rep (s,i,p) = i*(st cpu +T i/o [s] + (2P*T comm )) P* T comm P* T comm (s*t cpu )+T i/o [s] The cost models for the replication distribution scheme can be constructed by simply adding the overheads to the basic term selection cost models. ROUTING rep (s,r,i,q,p) = ROUTING par (s,r,i,q,p) + OVERHEAD rep (s,i,p) ROUTING repw (s,r,i,q,p,w) = ROUTING parw (s,r,i,q,p,w) + OVERHEAD rep (s,i,p) A5.8.4 On-the-fly distribution Models The overhead cost model for the On-the-fly distribution scheme is; OVERHEAD load (q,s,ssize,p) = ((qs/ssize)+p) * T comm There are also overhead costs at the synchronisation point for transferring set data, which is formed as follows; Get the identifier of best term in one iteration: Broadcast best set to all nodes: Merge the best term data into the accumulated set: P*T comm (P * s/ssize) * T comm (s*t cpu )+T i/o [s] Putting these equations together, we have the overhead at the synchronisation point; OVERHEAD large (s,i,ssize,p) = i*((((p*(s/ssize))+p) * T comm )+( st cpu +T i/o [s])) We cannot use the basic parallel term selection costs models, as some aspects of them (such as load) must be done sequentially. We apply a parallel cost model to the total evaluation cost, 278

19 together with the load cost model defined in section A5.7.3 and the overhead cost models defined above in this section. ROUTING parfly (s,r,i,q,ssize,p) = LOAD(q,s) + OVERHEAD load (q,s,ssize,p) + OVERHEAD large (s,i,ssize,p) + (INSPECTED(q,i) * EVAL(q,s,r) * LI[P] ) P ROUTING parflyw (s,r,i,q,ssize,p,w) = LOAD(q,s) + OVERHEAD load (q,s,ssize,p) + OVERHEAD large (s,i,ssize,p) + (INSPECTED(q,i) * EVAL(q,s,r) * w * LI[P] ) P A5.9 SEQUENTIAL MODELS FOR INDEX A5.9. Adding a Document to the Buffer: Update Transaction The Client/Server Update model is formed by the following steps; Scan Word and put in client Tree: nlog(n) * T cpu Marshalling/UnMarshalling term data: (n + n)* T cpu Sending data: (n/ssize) * T comm Merge word data with server buffer: nlog(dict) * T cpu Putting these equations together gives us a cost model for update on a single inverted file. seq (n,dict,ssize) = T cpu (nlog(n) + 2n + nlog(dict)) + T comm * (n/ssize) A5.9.2 Transaction while index is updated The cost model for transaction is calculated by adding the contention factor c to the particular function being examined. The search cost model is taken from section A5.3 and the update cost model from the previous section. seqc (n,dict,ssize) = ( seq (n,dict,ssize) *c[]) + seq (n,dict,ssize) SEARCH seqc (s,q) = (SEARCH seq (s,q) * c[]) + SEARCH seq (s,q) A5.9.3 Transaction estimate Taking the cost models defined in sections A5.9. and A5.9.2 we can construct a cost model for transactions where the percentage of total transaction time spent updating the index can be used. This allows us the vary the effect on transactions and study the theoretical performance penalty on transaction while doing a simultaneous index update. In the TRANSACTION seq model we eliminate the contention by setting ro to zero, while ro set to means that all transactions are affected by contention. 279

20 TRANSACTION seq (ur,sr,ro,n,dict,s,q) = (-ro(ur* seq(n,dict,ssize)+ sr*search seq(s,q))) + (ro(ur* seqc(n,dict,ssize)+ sr*search seqc(s,q))) ur + sr A5.9.4 Reorganisation of Inverted File The reorganisation model is made up of the following synthetic cost functions; Insert m buffer words in dict: Insert_Words_buff(m,b,dict) = T cpu (m(log(dict/b)+b)) Read t+m keyword lists from disk: Write t+m keyword lists to disk: Merge m keyword lists: Read in (dict/b) keyword blocks: List_Disk_Trans(t,m) = T i/o [s] * (t+m) List_Disk_Trans(t,m) Merge_kw_lists(m,s) = T cpu (m(s+s)) Read_kw_blocks(dict,b) = T i/o [b] * (dict/b) The reorganisation or index update cost model is constructed by using the four cost models defined above (List_Disk_Trans is used twice); REORG seq (n,dict,b,m,t,s) = Insert_Words_buff(m,b,dict) + 2List_Disk_Trans(t,m) + Merge_kw_lists(m,s) + Read_kw_blocks(dict,b) The contention model for reorganising the index is; REORG seqc (n,dict,b,m,t,s) = (REORG seq (n,dict,b, m,t,s) * c[]) + REORG seq (n,dict,b,m,t,s) A5. PARALLEL MODELS FOR INDEX A5.. DocId Transaction Model In this data distribution method we simply re-use the sequential cost model defined in section A5.9. above; docid (n,dict,ssize) = seq (n,dict,ssize) The contention model also re-uses the sequential model; docidc (n,dict,ssize,p) = ( seq (n,dict,ssize) * c[p]) + seq (n,dict,ssize) In order to construct the contention model for DocId search we re-use the function defined in section A5.4. above; SEARCH docidc (s,q,p) = (SEARCH docid (s,q,p) * c[p]) + SEARCH docid (s,q,p) The transaction model for DocId partitioning is constructed in exactly the same way as the sequential version described in section A5.9.3 above. TRANSACTION docid (ur,sr,ro,n,dict,s,q,p) = 28

21 (-ro(ur* docid(n,dict,p)+ sr*search docid(s,q,p))) + (ro(ur* docidc(n,dict,ssize,p)+ sr*search docidc(s,q,p))) ur + sr A5..2 TermId Transaction Model With the TermId distribution method, a new cost model must be defined as merging the data with the buffer is parallelised. termid (n,dict,p,ssize) = T cpu (nlog(n) + (nlog(dict) *LI[P]) + 2n ) + (P*T comm * (n/ssize) ) P The contention model re-uses the model defined above; termidc (n,dict,p,ssize) = ( termid (n,dict,p,ssize) * c[p]) + termid (n,dict,p,ssize) In order to construct the contention model for TermId search we re-use the function defined in section A5.4.3 above (we utilise the parallel sort cost model); SEARCH termidc (s,q,p,ssize) = (SEARCH termid2 (s,q,p,ssize) * c[p]) + SEARCH termid2 (s,q,p,ssize) The transaction model for DocId partitioning is constructed in exactly the same way as the sequential version described in section A5.9.3 above. TRANSACTION termid (ur,sr,ro,n,dict,s,q,p,ssize) = (-ro(ur* termid(n,dict,p)+ sr*search termid2(s,q,p,ssize))) + (ro(ur* termidc(n,dict,p,ssize)+ sr*search termidc(s,q,p,ssize))) ur + sr A5..3 DocId Reorganisation Model The DocId index update cost model is; Insert m buffer words in dict: Insert_Words_buff docid (m,b,dict,p) = T cpu (m*i[p]* (log(((dict/b)/p)*i[p])+b)) * LI[P] Read t+m keyword lists from disk: List_Disk_Trans docid (t,m,s,p) = T i/o [p[p]*(s/p)] * (t+m) *i[p] * LI[P] Write t+m keyword lists to disk: List_Disk_Trans docid (t,m,s,p) Merge m keyword lists: Merge_kw_lists docid (m,s,p) = T cpu (m*i[p]*(s/p +s/p)) * LI[P] Read in (dict/b) keyword blocks: Read_kw_blocks docid (dict,b,p) = T i/o [b] * ((dict/b)/p) *i[p] * LI[P] The index update model for DocId partitioning is constructed as follows; REORG docid (n,dict,b,m,t,s,p) = Insert_Words_buff docid (m,b,dict,p) + 2List_Disk_Trans docid (t,m,s,p) + Merge_kw_lists docid (m,s,p) + Read_kw_blocks docid (dict,b,p) This function is re-used in the construction of the cost model with contention as follows; REORG docidc (n,dict,b,m,t,s,p) = (REORG docid (n,dict,b,m,t,s,p) *c[p]) + REORG docid (n,dict,b,m,t,s,p) 28

22 A5..4 TermId Reorganisation Model The TermId index update cost model is; Insert m buffer words in dict: Insert_Words_buff termid (m,b,dict,p) = (T cpu (m(log(dict/b)+b))/p) * LI[P] Read t+m keyword lists from disk: List_Disk_Trans termid (t,m,s,p) = ((T i/o [s]* (t+m))/p) * LI[P] Write t+m keyword lists to disk: List_Disk_Trans termid (t,m,s,p) Merge m keyword lists: Merge_kw_lists termid (m,s,p) = (T cpu (m(s+s))/p) * LI[P] Read in (dict/b) keyword blocks: Read_kw_blocks termid (dict,b,p) = ((T i/o [b] * (dict/b))/p) * LI[P] The index update model for TermId partitioning is constructed as follows; REORG termid (n,dict,b,m,t,s,p) = Insert_Words_buff termid (m,b,dict,p) + 2List_Disk_Trans termid (t,m,s,p) + Merge_kw_lists termid (m,s,p) + Read_kw_blocks termid (dict,b,p) This function is re-used in the construction of the cost model with contention as follows; REORG termidc (n,dict,b,m,t,s,p) = (REORG termid (n,dict,b,m,t,s,p) *c[p]) + REORG termid (n,dict,b,m,t,s,p) 282

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Building an Inverted Index

Building an Inverted Index Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct

More information

City Research Online. Permanent City Research Online URL:

City Research Online. Permanent City Research Online URL: MacFarlane, A., McCann, J. A. & Robertson, S. E. (25). Parallel methods for the generation of partitioned inverted files. Aslib Proceedings; New Information Perspectives, 57(5), pp. 434-459. doi:.8/253562888

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction Hardware

More information

CSE5351: Parallel Processing Part III

CSE5351: Parallel Processing Part III CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

Document Representation : Quiz

Document Representation : Quiz Document Representation : Quiz Q1. In-memory Index construction faces following problems:. (A) Scaling problem (B) The optimal use of Hardware resources for scaling (C) Easily keep entire data into main

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs Analytical Modeling of Parallel Programs Alexandre David Introduction to Parallel Computing 1 Topic overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

CSE 544, Winter 2009, Final Examination 11 March 2009

CSE 544, Winter 2009, Final Examination 11 March 2009 CSE 544, Winter 2009, Final Examination 11 March 2009 Rules: Open books and open notes. No laptops or other mobile devices. Calculators allowed. Please write clearly. Relax! You are here to learn. Question

More information

Parallel Query Optimisation

Parallel Query Optimisation Parallel Query Optimisation Contents Objectives of parallel query optimisation Parallel query optimisation Two-Phase optimisation One-Phase optimisation Inter-operator parallelism oriented optimisation

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

SCALABILITY ANALYSIS

SCALABILITY ANALYSIS SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS Evaluation Sequential: runtime (execution time) Ts =T (InputSize) Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture)

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval

A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval Simon Jonassen and Svein Erik Bratsberg Department of Computer and Information Science Norwegian University of

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

A Batched GPU Algorithm for Set Intersection

A Batched GPU Algorithm for Set Intersection A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin

More information

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Reuters collection example (approximate # s)

Reuters collection example (approximate # s) BSBI Reuters collection example (approximate # s) 800,000 documents from the Reuters news feed 200 terms per document 400,000 unique terms number of postings 100,000,000 BSBI Reuters collection example

More information

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1 Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 15

CO Computer Architecture and Programming Languages CAPL. Lecture 15 CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125

More information

Evaluation of Parallel Programs by Measurement of Its Granularity

Evaluation of Parallel Programs by Measurement of Its Granularity Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

CS 222/122C Fall 2017, Final Exam. Sample solutions

CS 222/122C Fall 2017, Final Exam. Sample solutions CS 222/122C Fall 2017, Final Exam Principles of Data Management Department of Computer Science, UC Irvine Prof. Chen Li (Max. Points: 100 + 15) Sample solutions Question 1: Short questions (15 points)

More information

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building External Sorting and Query Optimization A.R. Hurson 323 CS Building External sorting When data to be sorted cannot fit into available main memory, external sorting algorithm must be applied. Naturally,

More information

Lecture 15: The Details of Joins

Lecture 15: The Details of Joins Lecture 15 Lecture 15: The Details of Joins (and bonus!) Lecture 15 > Section 1 What you will learn about in this section 1. How to choose between BNLJ, SMJ 2. HJ versus SMJ 3. Buffer Manager Detail (PS#3!)

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

News Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala

News Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala News Article Matcher Team: Rohan Sehgal, Arnold Kao, Nithin Kunala Abstract: The news article matcher is a search engine that allows you to input an entire news article and it returns articles that are

More information

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 This homework is to be done individually. Total 9 Questions, 100 points 1. (8

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain. Defining Performance Performance 1 Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100 200 300

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Parallelization of Sequential Programs

Parallelization of Sequential Programs Parallelization of Sequential Programs Alecu Felician, Pre-Assistant Lecturer, Economic Informatics Department, A.S.E. Bucharest Abstract The main reason of parallelization a sequential program is to run

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Database Management System

Database Management System Database Management System Lecture Join * Some materials adapted from R. Ramakrishnan, J. Gehrke and Shawn Bowers Today s Agenda Join Algorithm Database Management System Join Algorithms Database Management

More information

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans Lesson 1 4 Prefix Sum Definitions Prefix sum given an array...the prefix sum is the sum of all the elements in the array from the beginning to the position, including the value at the position. The sequential

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Information Retrieval

Information Retrieval Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University

More information

University of Waterloo Midterm Examination Solution

University of Waterloo Midterm Examination Solution University of Waterloo Midterm Examination Solution Winter, 2011 1. (6 total marks) The diagram below shows an extensible hash table with four hash buckets. Each number x in the buckets represents an entry

More information

Index construc-on. Friday, 8 April 16 1

Index construc-on. Friday, 8 April 16 1 Index construc-on Informa)onal Retrieval By Dr. Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan qaiser.abbas@uos.edu.pk Friday, 8 April 16 1 4.1 Index

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs 2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Analytical Modeling of Parallel Programs Hardik K. Molia Master of Computer Engineering, Department of Computer Engineering Atmiya Institute of Technology &

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Advanced Databases. Lecture 1- Query Processing. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Advanced Databases. Lecture 1- Query Processing. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch Advanced Databases Lecture 1- Query Processing Masood Niazi Torshiz Islamic Azad university- Mashhad Branch www.mniazi.ir Overview Measures of Query Cost Selection Operation Sorting Join Operation Other

More information

Ateles performance assessment report

Ateles performance assessment report Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

University of Waterloo Midterm Examination Sample Solution

University of Waterloo Midterm Examination Sample Solution 1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,

More information

Chapter 13 Strong Scaling

Chapter 13 Strong Scaling Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

CSE 344 MAY 7 TH EXAM REVIEW

CSE 344 MAY 7 TH EXAM REVIEW CSE 344 MAY 7 TH EXAM REVIEW EXAMINATION STATIONS Exam Wednesday 9:30-10:20 One sheet of notes, front and back Practice solutions out after class Good luck! EXAM LENGTH Production v. Verification Practice

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

Data Set Buffering. Introduction

Data Set Buffering. Introduction Data Set Buffering Introduction In IBM InfoSphere DataStage job data flow, the data is moved between stages (or operators) through a data link, in the form of virtual data sets. An upstream operator will

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa

More information

Structured Parallel Programming Patterns for Efficient Computation

Structured Parallel Programming Patterns for Efficient Computation Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO

More information

Join algorithm costs revisited

Join algorithm costs revisited The VLDB Journal (1996) 5: 64 84 The VLDB Journal c Springer-Verlag 1996 Join algorithm costs revisited Evan P. Harris, Kotagiri Ramamohanarao Department of Computer Science, The University of Melbourne,

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c

More information

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis 1 Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis 1 General approach to the implementation of Query Processing and Query Optimisation functionalities in DBMSs 1. Parse

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives Name: _Solution 6 questions, 100 pts, 80 minutes 1. (20 pts) Compare Hadoop (plus HDFS) to the Chord DHT. (a) What

More information

Distributing the Derivation and Maintenance of Subset Descriptor Rules

Distributing the Derivation and Maintenance of Subset Descriptor Rules Distributing the Derivation and Maintenance of Subset Descriptor Rules Jerome Robinson, Barry G. T. Lowden, Mohammed Al Haddad Department of Computer Science, University of Essex Colchester, Essex, CO4

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

TRADITIONAL search engines utilize hard disk drives

TRADITIONAL search engines utilize hard disk drives This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 1.119/TC.216.268818,

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three

More information

Date Lesson TOPIC Homework. The Intersection of a Line with a Plane and the Intersection of Two Lines

Date Lesson TOPIC Homework. The Intersection of a Line with a Plane and the Intersection of Two Lines UNIT 4 - RELATIONSHIPS BETWEEN LINES AND PLANES Date Lesson TOPIC Homework Oct. 4. 9. The Intersection of a Line with a Plane and the Intersection of Two Lines Pg. 496 # (4, 5)b, 7, 8b, 9bd, Oct. 6 4.

More information

Short Summary of DB2 V4 Through V6 Changes

Short Summary of DB2 V4 Through V6 Changes IN THIS CHAPTER DB2 Version 6 Features DB2 Version 5 Features DB2 Version 4 Features Short Summary of DB2 V4 Through V6 Changes This appendix provides short checklists of features for the most recent versions

More information

Query Processing. Solutions to Practice Exercises Query:

Query Processing. Solutions to Practice Exercises Query: C H A P T E R 1 3 Query Processing Solutions to Practice Exercises 13.1 Query: Π T.branch name ((Π branch name, assets (ρ T (branch))) T.assets>S.assets (Π assets (σ (branch city = Brooklyn )(ρ S (branch)))))

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

1.1 - Basics of Query Processing in SQL Server

1.1 - Basics of Query Processing in SQL Server Department of Computer Science and Engineering 2013/2014 Database Administration and Tuning Lab 3 2nd semester In this lab class, we will address query processing. For students with a particular interest

More information

Increasing Database Performance through Optimizing Structure Query Language Join Statement

Increasing Database Performance through Optimizing Structure Query Language Join Statement Journal of Computer Science 6 (5): 585-590, 2010 ISSN 1549-3636 2010 Science Publications Increasing Database Performance through Optimizing Structure Query Language Join Statement 1 Ossama K. Muslih and

More information