Appendix A1: Fig A1-1. Examples of build methods for distributed inverted files
|
|
- Grant Domenic Malone
- 6 years ago
- Views:
Transcription
1 APPENDICES Appendix A: Fig A-. Examples of build methods for distributed inverted files Key: Distributed build Local build A node in the parallel machine Indicates presense of text files on a node Inverted file partition on a node Network connection between nodes 26
2 Appendix A2: Extra Probablistic Search Results leaf nodes pos topic topic pos Fig A2-. BASE [TermId]: search average elapsed time in seconds (sequential sort: CF distribution) Leaf nodes Title Only WholeTopic - NPOS POS NPOS POS 2 63% 4% 55% 34% 3 69% 48% 63% 44% 4 73% 53% 69% 54% 5 75% 55% 72% 57% 6 78% 58% 75% 6% 7 8% 6% 73% 62% Table A2-. BASE [TermId]: search overheads in % of total time (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-2. BASE [TermId]: search speedup (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-3. BASE [TermId]: search parallel efficiency (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-4. BASE [TermId]: search load imbalance (sequential sort: CF distribution) leaf nodes pos topic topic pos Fig A2-5. BASE [TermId]: search average elapsed time in seconds (sequential sort: TF distribution) 262
3 2.5.5 leaf nodes pos topic topic pos Fig A2-6. BASE [TermId]: search speedup (sequential sort: TF distribution) leaf nodes pos topic topic pos Fig A2-7. BASE [TermId]: search parallel efficiency (sequential sort: TF distribution) leaf nodes pos topic topic pos Fig A2-8. BASE [TermId]: search load imbalance (sequential sort: TF distribution) Leaf nodes Title Only WholeTopic - NPOS POS NPOS POS 2 64% 4% 53% 34% 3 68% 47% 67% 44% 4 72% 52% 74% 55% 5 7% 56% 73% 54% 6 77% 57% 76% 58% 7 79% 59% 8% 64% Table A2-2. BASE [TermId]: search overheads in % of total time (sequential sort: TF distribution) 35 Throughput Query/Hr NO POS-cf-to POS-cf-to NO POS-tf-to POS-tf-to NO POS-cf-wt POS-cf-wt NO POS-tf-wt POS-tf-wt Fig A2-9. BASE [TermId]: search throughput in queries/hour (CF and TF distributions) 263
4 Appendix A3. Further Retrieval Results for on-the-fly distribution: Routing/Filtering task Speedup slave nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-. ZIFF-DAVIS [On-the-fly]: speedup for term selection algorithms (Network) slaves nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-2. ZIFF-DAVIS [On-the-fly]: parallel efficiency for term selection algorithms (Network) 264
5 Iterations slaves nodes FB ADD FB A/R FB RW CFP ADD CFP A/R CFP RW CSP ADD CSP A/R CSP RW Fig A3-3. ZIFF-DAVIS [On-the-fly]: outer iterations to service term selection (Network) Iterations F B C F P C S P s laves no d es Fig A3-4. ZIFF-DAVIS [On-the-fly]: outer iterations to service term selection (AP) 265
6 Appendix A4. Further details of update and index maintenance experiments Fig A4-. BASE [DocId]: parallel efficiency for update transactions (postings only) Fig A4-2. BASE [TermId]: parallel efficiency for update transactions (postings only) Fig A4-3. BASE [DocId]: parallel efficiency for update transactions (position data) Fig A4-4. BASE [TermId]: parallel efficiency for update transactions (position data) Fig A4-5. BASE [DocId]: parallel efficiency for all transactions (postings only) Fig A4-6. BASE [TermId]: parallel efficiency for all transactions (postings only) Fig A4-7. BASE [DocId]: parallel efficiency for all transactions (position data) Fig A4-8. BASE [TermId]: parallel efficiency for all transactions (position data) 266
7 % increase Fig A4-9. BASE [DocId]: % increase from normal average transaction elapsed time during index update (postings only) % increase Fig A4-. BASE [DocId]: % increase from normal average transaction elapsed time during index update (position data) % increase Fig A4-. BASE [TermId]: % increase from normal average transaction elapsed time during index update (postings only) % increase Fig A4-2. BASE [TermId]: % increase from normal average transaction elapsed time during index update (position data) Fig A4-3. BASE [DocId]: Parallel efficiency for index reorganisation (postings only) Fig A4-4. BASE [DocId]: Parallel efficiency for index reorganisation (position data) 4 docs 8 docs 2 docs 4 docs 5 docs 4 docs 8 docs 2 docs 4 docs 5 docs Fig A4-5. BASE [TermId]: Parallel efficiency for index reorganisation (postings only) Fig A4-6. BASE [TermId]: Parallel efficiency for index reorganisation (position data) 4 docs 8 docs 2 docs 4 docs 5 docs 4 docs 8 docs 2 docs 4 docs 5 docs 267
8 Time: secs Time: secs docs 8 docs 2 docs 4 docs 5 docs Fig A4-7. BASE [DocId]: Accumulated total time for index reorganisation (postings only) docs 8 docs 2 docs 4 docs 5 docs Fig A4-8. BASE [DocId]: Accumulated total time for index reorganisation (position data) Time: secs Time: secs docs 8 docs 2 docs 4 docs 5 docs Fig A4-9. BASE [TermId]: Accumulated total time for index reorganisation (postings only) docs 8 docs 2 docs 4 docs 5 docs Fig A4-2. BASE [TermId]: Accumulated total time for index reorganisation (position data) Metric Position Data Total Time secs Collectio n 4 Docs 8 Docs 2 Docs 4 Docs 5 Docs BASE BASE ,53 74, , ,68 Scalability on Total Time secs BASE Postings Only (No Positions) Total Time (secs) BASE BASE Scalability on Total Time (secs) BASE Table A4-. BASE/BASE[DocId]: scalability on index reorganisation 268
9 Appendix A5 - Synthetic models chapter appendix. P LI[P] Table A5-. Load imbalance estimates for LI[P] variable A5. SEQUENTIAL MODEL FOR INDEXING A5.. Analyse Documents Strip words from documents: Insert Word Into Block: d * n log(n) dnlog(n) * T cpu A5..2 Save Intermediate Results Number of intermediate saves: (dn/ BSIZE) Cost per intermediate save (BSIZE * T i/o ) A5..3 Merge Phase (dn/ BSIZE) * (BSIZE *T i/o ) Load Blocks: (dn/ BSIZE) * (BSIZE *T i/o ) Write Blocks: (dn/ BSIZE) * (BSIZE *T i/o ) Merge Blocks: (dn/ BSIZE) * (BSIZE * T cpu ) 2((dn/ BSIZE) * (BSIZE *T i/o )) + (dn/ BSIZE) * (BSIZE * T cpu ) 269
10 A5..4 Sequential Indexing Model Combing the equations declared in sections A5.. to A5..3 gives us the following sequential synthetic indexing model; INDEX seq (d,n,bsize) = dnlog(n) * T cpu + 3((dn/ BSIZE) * (BSIZE *T i/o )) + (dn/ BSIZE) * (BSIZE * T cpu ) A5.2 PARALLEL MODELS FOR INDEXING A5.2. Distributing Documents to nodes dn/f * T comm A5.2.2 Global Merge Phase. (dn/ BSIZE) * (P *T comm ) A5.2.3 DocId Indexing Models Using the function defined in section A5..4 and the equation defined in section A5.2., we can define synthetic models for DocId indexing. With distributed build (INDEX Distr_DocId ) we also add the distribution component for text data (equation from section A5.2.) INDEX Local_Docid (d,n,p,bsize) = (INDEX seq (d,n,bsize)/p)* LI[P] INDEX Distr_DocId (d,n,f,p,bsize) = ((INDEX seq (d,n,bsize)/p)* LI[P]) + (dn/f * T comm ) A5.2.4 TermId Indexing Model The distributed build TermId model (INDEX Distr_TermId ) must redefine one aspect of the sequential indexing model defined in section A5..4. The merge component defined in section A5..3 is doubled for the TermId model and the extra communication costs from section A5.2.2 are added. The revised index computation component is divided by the number of processors and multiplied by the load imbalance estimate. 27
11 INDEX Distr_TermId (d,n,f,p,bsize) = dn/f * T comm + (dn/ BSIZE) * (P *T comm ) + (dnlog(n) * T cpu + 6((dn/ BSIZE) * (BSIZE *T i/o )) + 2(dn/ BSIZE) * (BSIZE * T cpu )) * LI[P] P A5.3 SEQUENTIAL MODEL FOR PROBABILISTIC SEARCH The sequential model for probabilistic search is made up the the following: Load q Keyword sets: Weight q Keyword sets: Merge q- Keyword sets: Sort final results set Load_kw seq (q,s) = q * T i/o [s] Weight_kw seq (q,s) = s*q * T cpu Merge_kw seq (q,s) = (q-)*(s+s) * T cpu Sort_set seq (q,s) = R[q,s]log(R[q,s]) * T cpu Put together these functions make up the synthetic search model for sequential probabilistic search; SEARCH seq (s,q) = Load_kw seq (q,s) + Weight_kw seq (q,s) + Merge_kw seq (q,s) + Sort_set seq (q,s) A5.4 PARALLEL MODELS FOR PROBABILISTIC SEARCH A5.4. DocId Partitioning following: The parallel model using DocId partitioning for probabilistic search is made up the the Communications Costs for DocId: Comms_Search docid (P) = 3(P* T comm ) Send P requests for terms frequency: P* T comm Send P Queries (with term frequency): P* T comm Gather results from P for set size s: P* T comm Load q Keyword sets: Load_kw docid (q,s,p) = q * T i/o [s/p] Weight q Keyword sets: Weight_kw par (q,s,p) = (Weight_kw seq (q,s)/p) * LI[P] Merge q- Keyword sets: Merge_kw par (q,s,p) = (Merge_kw seq (q,s)/p) * LI[P] Sort final results set: Sort_set par (q,s,p) = ((R[q,s]/P)log(R[q,s]/P) * T cpu ) * LI[P] 27
12 The DocId partitioning synthetic search model is therefore; SEARCH docid (s,q,p) = Comms_Search docid (P) + Load_kw docid (q,s,p) + Weight_kw par (q,s,p) + Merge_kw par (q,s,p) + Sort_set par (q,s,p) A5.4.2 TermId Partitioning - Sequential Sort The parallel model using TermId partitioning for probabilistic search using a sequential sort is made up the the following: Communications Costs for TermId: Comms_Search termid (s,q,p,ssize) = (R[s,q]/SSIZE)* P* T comm )+(P* T comm ) Send p Queries (with term frequency): P* T comm Gather results from p for set size s: (R[s,q]/SSIZE)* P* T comm Load q Keyword sets: Load_kw termid (q,s,p) = q/p * T i/o [s/p[q]] Weight q Keyword sets: Weight_kw par (q,s,p[q]) Merge q- Keyword sets: Merge_kw par (q,s,p[q]) Sort final results set: Sort_set seq (q,s) The TermId partitioning synthetic search model with sequential sort is therefore; SEARCH termid (s,q,p,ssize) = Comms_Search termid (s,q,p,ssize) + Load_kw termid (q,s,p) + Weight_kw par (q,s, P[q]) + Merge_kw par (q,s,p[q]) + Sort_set seq (q,s) A5.4.3 TermId Partitioning 2 - Parallel Sort The parallel model using TermId partitioning for probabilistic search using a parallel sort is made up the the following: 272
13 Communications Costs for TermId2: Comms_Search termid2 (s,q,p,ssize) = 3(R[s,q]/SSIZE)* P* T comm )+(P* T comm ) Send p Queries (with term frequency): P* T comm Gather results from p for set size s: 3(R[s,q]/SSIZE)* P* T comm Load q Keyword sets: Load_kw termid (q,s,p) Weight q Keyword sets: Weight_kw par (q,s,p[q]) Merge q- Keyword sets: Merge_kw par (q,s,p[q]) Sort final results set: Sort_set par (q,s,p) The TermId partitioning synthetic search model with parallel sort is therefore; SEARCH termid2 (s,q,p,ssize) = Comms_Search termid2 (s,q,p,ssize) + Load_kw termid (q,s,p) + Weight_kw par (q,s,p[q]) + Merge_kw par (q,s,p[q]) + Sort_set par (q,s,p) A5.5 SEQUENTIAL MODEL PASSAGE RETRIEVAL Service q terms on PR documents each with (a(a-))/2 inspected passages: Compute_Pass(PR,q,a) = T cpu * PR * q*((a(a-))/2) A sort on the top PR documents is required to re-rank the final results set, cost is T cpu PRlog(PR). PASSAGE seq (s,q,a,pr) = SEARCH seq (s,q) + Compute_Pass(PR,q,a) + T cpu PRlog(PR) A5.6 PARALLEL MODELS FOR PASSAGE RETRIEVAL A5.6. DocId Models The DocId method simply applies P processors to the Compute_Pass computation defined in section A5.5: Compute_Pass par (PR,q,a,P) = (Compute_Pass(PR,q,a)/P) * LI[P] The local passage processing cost model is constructed by simply adding the probabilistic DocId cost model from section A5.4. to the Compute_Pass par model; 273
14 PASSAGE docid_local (s,q,a,pr,p) = SEARCH docid (s,q,p) + Compute_Pass par (PR,q,a,P) The distributed passage retrieval method must also gather up data from nodes in order to choose the best PR documents in the collection. This requires four stages; i) Gather the data from an initial probabilistic search (the top PR documents) ii) Scatter this full set to the processors iii) Gather up the full set from all the processors iv) Do a final rank on the top PR documents The estimate for this overhead is therefore: i) Gather data (PR/SSIZE)/P * P: PR/SSIZE (eliminated P) ii) Scatter PR elements to P processors: T comm (PR/SSIZE)*P iii) Gather PR elements from P processors: T comm (PR/SSIZE)*P iv) Sort PR elements to obtain final rank:t cpu PRlog(PR) The model for overheads on distributed passage processing is therefore; OVERHEAD pass (PR,P,SSIZE) = T comm ((2(PR/SSIZE)*P)+ PR/SSIZE) + T cpu PRlog(PR) The DocId distributed passage processing cost model is constructed by adding the probabilistic DocId cost model from section A5.4. to the Compute_Pass par model together with the OVERHEAD pass cost model; PASSAGE docid_distr (s,q,a,pr,ssize,p) = SEARCH docid (s,q,p) + Compute_Pass par (PR,q,a,P) + OVERHEAD pass (PR,P,SSIZE) A5.6.2 TermId Models P processors: In TermId we must communicate the data for (a(a-))/2 passages for PR documents on OVERHEAD passtid (a,pr,p) = T comm ( PR*P*((a(a-))/2) The TermId distributed passage processing cost models are constructed by adding the probabilistic TermId cost model from sections A5.4.2 and A5.4.3 to the Compute_Pass par model together with the OVERHEAD pass and OVERHEAD passtid cost models; PASSAGE termid (s,q,a,pr,ssize,p) = 274
15 SEARCH termid (s,q,p,ssize) + Compute_Pass par (PR,q,a,P[q]) + OVERHEAD pass (PR,P,SSIZE) + OVERHEAD passtid (a,pr,p) PASSAGE termid2 (s,q,a,pr,ssize,p) = SEARCH termid2 (s,q,p,ssize) + Compute_Pass par (PR,q,a,P[q]) + OVERHEAD pass (PR,P,SSIZE) + OVERHEAD passtid (a,pr,p) 5.7 SEQUENTIAL MODELS FOR TERM SELECTION A5.7. Evaluation The cost of evaluation is broken down into the following; Merge set for term with accumulated set: T cpu *(s+s) Merge relevance judgements with temporary set: T cpu *(s+r) Rank the temporary set using a sort: T cpu *(R[q,s]log(R[q,s])) Put together these equations form the model for the cost of a single evaluation; EVAL(q,s,r) = T cpu *((s+s) + R[q,s]log(R[q,s]))+ (s+r)) A5.7.2 Total number of evaluations Maximum number of evaluations for the find best algorithm is: q*i Not all Keywords are inspected in i iterations: i* (i+) *.5 After each iteration one less term is inspected. This formula accumulates the total number of keywords not inspected in i iterations, as one term is always chosen. Put together with an estimate of the total number of terms skipped the function for inspected terms is: INSPECTED(q,i) = (qi - (i(i+).5) - u(qi -(i(i+).5))) A5.7.3 Load costs for keywords The cost of loading term data is as follows; Load q terms from disk each with set size s: q * T i/o [s] Weight q terms each with set size s: q * s * T cpu Putting these equations together yields the following load cost; LOAD(q,s) = q(t i/o [s] +s*t cpu ) 275
16 A5.7.4 Sequential Models for Term Selection Using the models defined in sections A5.7. to A5.7.3 we can now define the sequential cost models for term selection. For add only operation (ROUTING seq ) this is a simple process of multiplying the evaluation cost (see section A5.7.) with the number of terms inspected (see section A5.7.2) and with an addition of load costs (see section A5.7.3). The model for add reweight (ROUTING seqw ) is constructed by factoring the total evaluation cost by the reweight variable w. ROUTING seq (s,r,i,q) = (INSPECTED(q,i) * EVAL(q,s,r)) + LOAD(q,s) ROUTING seqw (s,r,i,q,w) = (INSPECTED(q,i) * EVAL(q,s,r) * w) + LOAD(q,s) A5.8 PARALLEL MODELS FOR TERM SELECTION Basic term selection models with no synchronisation or communication costs is as follows; ROUTING par (s,r,i,q,p) = ROUTING seq (s,r,i,q) * LI[P] P ROUTING parw (s,r,i,q,p,w) = ROUTING seqw (s,r,i,q,w) * LI[P] P 5.8. DocId Models The cost model for intra-set parallelism is; Merge set costs: Merge_Route docid (s,r,p) = (T cpu (s+ r+s)/p) * LI[P] Sort costs: Sort_Route docid (s,p) = (T cpu (s/p)log(s/p)) * LI[P] Communication costs: Comms_Route docid (s,p,ssize) = (((s/ssize)/p)+2p)* T comm Putting these functions together gives us the evaluation cost model for DocId term selection: EVAL docid (s,r,i,q,p,ssize) = INSPECTED(q,i) * (Merge_Route docid (s,r,p) + Sort_Route docid (s,p) + Comms_Route docid (s,p,ssize)) We also measure overheads at the synchronisation point for merging the chosen term into the accumulated set and communicating the best term identifier in one iteration; Communication costs for best term: Merge best term set into accumulated set: P*T comm ((s*t cpu )/P * LI[P]) 276
17 We assume latency is the dominant factor in communication costs. Putting these equations together gives us the estimate of overheads for the DocId term selection cost model. OVERHEAD docid (s,i,p) = i*(((s*t cpu )/P) * LI[P]) + (P*T comm )) The models for term selection are constructed by taking the load cost model (defined in section A5.7.3), and adding the evaluation and overhead cost models defined above in this section. The load cost model is further refined by dividing by the number of processors and factoring the result by the load imbalance estimate (LI[P]). ROUTING docid (s,r,i,q,p,ssize) = LOAD(q,s) * LI[P] + EVAL docid (s,r,i,q,p,ssize) + OVERHEAD docid (s,i,p) P ROUTING docidw (s,r,i,q,p,ssize,w) = LOAD(q,s) * LI[P] + (w*eval docid (s,r,i,q,p,ssize)) + OVERHEAD docid (s,i,p) P A5.8.2 TermId Models The interaction at the synchronisation point is more complicated than for the DocId models. This is because the data for the best term must be retrieved from the relevant node and merged into the accumulated set, which is then broadcast to all nodes. Overheads for TermId models are calculated as follows: Get the identifier of best term in one iteration: Request for best term set: Retrieving best set from relevant node: Broadcast best set to all other nodes: Merge the best term data into the accumulated set: P*T comm * T comm s/ssize * T comm (P- * s/ssize) * T comm st cpu +T i/o [s] Put together, these equations form the cost model for routing overheads on the TermId partitioning scheme; OVERHEAD termid (s,i,p,ssize) = i*( (T comm *((P+)*(P*s/SSIZE)))+ (st cpu +T i/o [s])) Construction of the routing models can be done by re-using the basic term selection models and adding the (OVERHEAD termid )cost; ROUTING termid (s,r,i,q,p,ssize) = (ROUTING par (s,r,i,q,p) * LI[P]) + OVERHEAD termid (s,i,p,ssize) 277
18 ROUTING termidw (s,r,i,q,p,ssize,w) = (ROUTING parw (s,r,i,q,p,w) * LI[P]) + OVERHEAD termid (s,i,p,ssize) The extra LI[P] here assumes that ROUTING termid imbalance will probably be worse than ROUTING rep in particular or other models in general. This is because terms are statically allocated to a node (see chapter 4, sub-section for a discussion on term allocation schemes). A5.8.3 Replication Models Latency is presumed to be the main communication problem for the replication distribution scheme. The overheads for replication cost models are calculated as follows: Get the identifier of best term from P processors: Send the identifier of best term to P processors: Merge the best term data into the accumulated set: Putting these equations together gives us the following cost model; OVERHEAD rep (s,i,p) = i*(st cpu +T i/o [s] + (2P*T comm )) P* T comm P* T comm (s*t cpu )+T i/o [s] The cost models for the replication distribution scheme can be constructed by simply adding the overheads to the basic term selection cost models. ROUTING rep (s,r,i,q,p) = ROUTING par (s,r,i,q,p) + OVERHEAD rep (s,i,p) ROUTING repw (s,r,i,q,p,w) = ROUTING parw (s,r,i,q,p,w) + OVERHEAD rep (s,i,p) A5.8.4 On-the-fly distribution Models The overhead cost model for the On-the-fly distribution scheme is; OVERHEAD load (q,s,ssize,p) = ((qs/ssize)+p) * T comm There are also overhead costs at the synchronisation point for transferring set data, which is formed as follows; Get the identifier of best term in one iteration: Broadcast best set to all nodes: Merge the best term data into the accumulated set: P*T comm (P * s/ssize) * T comm (s*t cpu )+T i/o [s] Putting these equations together, we have the overhead at the synchronisation point; OVERHEAD large (s,i,ssize,p) = i*((((p*(s/ssize))+p) * T comm )+( st cpu +T i/o [s])) We cannot use the basic parallel term selection costs models, as some aspects of them (such as load) must be done sequentially. We apply a parallel cost model to the total evaluation cost, 278
19 together with the load cost model defined in section A5.7.3 and the overhead cost models defined above in this section. ROUTING parfly (s,r,i,q,ssize,p) = LOAD(q,s) + OVERHEAD load (q,s,ssize,p) + OVERHEAD large (s,i,ssize,p) + (INSPECTED(q,i) * EVAL(q,s,r) * LI[P] ) P ROUTING parflyw (s,r,i,q,ssize,p,w) = LOAD(q,s) + OVERHEAD load (q,s,ssize,p) + OVERHEAD large (s,i,ssize,p) + (INSPECTED(q,i) * EVAL(q,s,r) * w * LI[P] ) P A5.9 SEQUENTIAL MODELS FOR INDEX A5.9. Adding a Document to the Buffer: Update Transaction The Client/Server Update model is formed by the following steps; Scan Word and put in client Tree: nlog(n) * T cpu Marshalling/UnMarshalling term data: (n + n)* T cpu Sending data: (n/ssize) * T comm Merge word data with server buffer: nlog(dict) * T cpu Putting these equations together gives us a cost model for update on a single inverted file. seq (n,dict,ssize) = T cpu (nlog(n) + 2n + nlog(dict)) + T comm * (n/ssize) A5.9.2 Transaction while index is updated The cost model for transaction is calculated by adding the contention factor c to the particular function being examined. The search cost model is taken from section A5.3 and the update cost model from the previous section. seqc (n,dict,ssize) = ( seq (n,dict,ssize) *c[]) + seq (n,dict,ssize) SEARCH seqc (s,q) = (SEARCH seq (s,q) * c[]) + SEARCH seq (s,q) A5.9.3 Transaction estimate Taking the cost models defined in sections A5.9. and A5.9.2 we can construct a cost model for transactions where the percentage of total transaction time spent updating the index can be used. This allows us the vary the effect on transactions and study the theoretical performance penalty on transaction while doing a simultaneous index update. In the TRANSACTION seq model we eliminate the contention by setting ro to zero, while ro set to means that all transactions are affected by contention. 279
20 TRANSACTION seq (ur,sr,ro,n,dict,s,q) = (-ro(ur* seq(n,dict,ssize)+ sr*search seq(s,q))) + (ro(ur* seqc(n,dict,ssize)+ sr*search seqc(s,q))) ur + sr A5.9.4 Reorganisation of Inverted File The reorganisation model is made up of the following synthetic cost functions; Insert m buffer words in dict: Insert_Words_buff(m,b,dict) = T cpu (m(log(dict/b)+b)) Read t+m keyword lists from disk: Write t+m keyword lists to disk: Merge m keyword lists: Read in (dict/b) keyword blocks: List_Disk_Trans(t,m) = T i/o [s] * (t+m) List_Disk_Trans(t,m) Merge_kw_lists(m,s) = T cpu (m(s+s)) Read_kw_blocks(dict,b) = T i/o [b] * (dict/b) The reorganisation or index update cost model is constructed by using the four cost models defined above (List_Disk_Trans is used twice); REORG seq (n,dict,b,m,t,s) = Insert_Words_buff(m,b,dict) + 2List_Disk_Trans(t,m) + Merge_kw_lists(m,s) + Read_kw_blocks(dict,b) The contention model for reorganising the index is; REORG seqc (n,dict,b,m,t,s) = (REORG seq (n,dict,b, m,t,s) * c[]) + REORG seq (n,dict,b,m,t,s) A5. PARALLEL MODELS FOR INDEX A5.. DocId Transaction Model In this data distribution method we simply re-use the sequential cost model defined in section A5.9. above; docid (n,dict,ssize) = seq (n,dict,ssize) The contention model also re-uses the sequential model; docidc (n,dict,ssize,p) = ( seq (n,dict,ssize) * c[p]) + seq (n,dict,ssize) In order to construct the contention model for DocId search we re-use the function defined in section A5.4. above; SEARCH docidc (s,q,p) = (SEARCH docid (s,q,p) * c[p]) + SEARCH docid (s,q,p) The transaction model for DocId partitioning is constructed in exactly the same way as the sequential version described in section A5.9.3 above. TRANSACTION docid (ur,sr,ro,n,dict,s,q,p) = 28
21 (-ro(ur* docid(n,dict,p)+ sr*search docid(s,q,p))) + (ro(ur* docidc(n,dict,ssize,p)+ sr*search docidc(s,q,p))) ur + sr A5..2 TermId Transaction Model With the TermId distribution method, a new cost model must be defined as merging the data with the buffer is parallelised. termid (n,dict,p,ssize) = T cpu (nlog(n) + (nlog(dict) *LI[P]) + 2n ) + (P*T comm * (n/ssize) ) P The contention model re-uses the model defined above; termidc (n,dict,p,ssize) = ( termid (n,dict,p,ssize) * c[p]) + termid (n,dict,p,ssize) In order to construct the contention model for TermId search we re-use the function defined in section A5.4.3 above (we utilise the parallel sort cost model); SEARCH termidc (s,q,p,ssize) = (SEARCH termid2 (s,q,p,ssize) * c[p]) + SEARCH termid2 (s,q,p,ssize) The transaction model for DocId partitioning is constructed in exactly the same way as the sequential version described in section A5.9.3 above. TRANSACTION termid (ur,sr,ro,n,dict,s,q,p,ssize) = (-ro(ur* termid(n,dict,p)+ sr*search termid2(s,q,p,ssize))) + (ro(ur* termidc(n,dict,p,ssize)+ sr*search termidc(s,q,p,ssize))) ur + sr A5..3 DocId Reorganisation Model The DocId index update cost model is; Insert m buffer words in dict: Insert_Words_buff docid (m,b,dict,p) = T cpu (m*i[p]* (log(((dict/b)/p)*i[p])+b)) * LI[P] Read t+m keyword lists from disk: List_Disk_Trans docid (t,m,s,p) = T i/o [p[p]*(s/p)] * (t+m) *i[p] * LI[P] Write t+m keyword lists to disk: List_Disk_Trans docid (t,m,s,p) Merge m keyword lists: Merge_kw_lists docid (m,s,p) = T cpu (m*i[p]*(s/p +s/p)) * LI[P] Read in (dict/b) keyword blocks: Read_kw_blocks docid (dict,b,p) = T i/o [b] * ((dict/b)/p) *i[p] * LI[P] The index update model for DocId partitioning is constructed as follows; REORG docid (n,dict,b,m,t,s,p) = Insert_Words_buff docid (m,b,dict,p) + 2List_Disk_Trans docid (t,m,s,p) + Merge_kw_lists docid (m,s,p) + Read_kw_blocks docid (dict,b,p) This function is re-used in the construction of the cost model with contention as follows; REORG docidc (n,dict,b,m,t,s,p) = (REORG docid (n,dict,b,m,t,s,p) *c[p]) + REORG docid (n,dict,b,m,t,s,p) 28
22 A5..4 TermId Reorganisation Model The TermId index update cost model is; Insert m buffer words in dict: Insert_Words_buff termid (m,b,dict,p) = (T cpu (m(log(dict/b)+b))/p) * LI[P] Read t+m keyword lists from disk: List_Disk_Trans termid (t,m,s,p) = ((T i/o [s]* (t+m))/p) * LI[P] Write t+m keyword lists to disk: List_Disk_Trans termid (t,m,s,p) Merge m keyword lists: Merge_kw_lists termid (m,s,p) = (T cpu (m(s+s))/p) * LI[P] Read in (dict/b) keyword blocks: Read_kw_blocks termid (dict,b,p) = ((T i/o [b] * (dict/b))/p) * LI[P] The index update model for TermId partitioning is constructed as follows; REORG termid (n,dict,b,m,t,s,p) = Insert_Words_buff termid (m,b,dict,p) + 2List_Disk_Trans termid (t,m,s,p) + Merge_kw_lists termid (m,s,p) + Read_kw_blocks termid (dict,b,p) This function is re-used in the construction of the cost model with contention as follows; REORG termidc (n,dict,b,m,t,s,p) = (REORG termid (n,dict,b,m,t,s,p) *c[p]) + REORG termid (n,dict,b,m,t,s,p) 282
6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS
Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationBuilding an Inverted Index
Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct
More informationCity Research Online. Permanent City Research Online URL:
MacFarlane, A., McCann, J. A. & Robertson, S. E. (25). Parallel methods for the generation of partitioned inverted files. Aslib Proceedings; New Information Perspectives, 57(5), pp. 434-459. doi:.8/253562888
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationCSC630/CSC730 Parallel & Distributed Computing
CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationIndex Construction Introduction to Information Retrieval INF 141 Donald J. Patterson
Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction Hardware
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationB.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2
Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,
More informationDocument Representation : Quiz
Document Representation : Quiz Q1. In-memory Index construction faces following problems:. (A) Scaling problem (B) The optimal use of Hardware resources for scaling (C) Easily keep entire data into main
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More informationAnalytical Modeling of Parallel Programs
Analytical Modeling of Parallel Programs Alexandre David Introduction to Parallel Computing 1 Topic overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of
More informationEfficient query processing
Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationCSE 544, Winter 2009, Final Examination 11 March 2009
CSE 544, Winter 2009, Final Examination 11 March 2009 Rules: Open books and open notes. No laptops or other mobile devices. Calculators allowed. Please write clearly. Relax! You are here to learn. Question
More informationParallel Query Optimisation
Parallel Query Optimisation Contents Objectives of parallel query optimisation Parallel query optimisation Two-Phase optimisation One-Phase optimisation Inter-operator parallelism oriented optimisation
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationSCALABILITY ANALYSIS
SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS Evaluation Sequential: runtime (execution time) Ts =T (InputSize) Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture)
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationA Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval
A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval Simon Jonassen and Svein Erik Bratsberg Department of Computer and Information Science Norwegian University of
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationA Batched GPU Algorithm for Set Intersection
A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin
More informationAnalyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc
Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationReuters collection example (approximate # s)
BSBI Reuters collection example (approximate # s) 800,000 documents from the Reuters news feed 200 terms per document 400,000 unique terms number of postings 100,000,000 BSBI Reuters collection example
More informationPage 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1
Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 15
CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125
More informationEvaluation of Parallel Programs by Measurement of Its Granularity
Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationOutline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationCS 222/122C Fall 2017, Final Exam. Sample solutions
CS 222/122C Fall 2017, Final Exam Principles of Data Management Department of Computer Science, UC Irvine Prof. Chen Li (Max. Points: 100 + 15) Sample solutions Question 1: Short questions (15 points)
More informationDatabase Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building
External Sorting and Query Optimization A.R. Hurson 323 CS Building External sorting When data to be sorted cannot fit into available main memory, external sorting algorithm must be applied. Naturally,
More informationLecture 15: The Details of Joins
Lecture 15 Lecture 15: The Details of Joins (and bonus!) Lecture 15 > Section 1 What you will learn about in this section 1. How to choose between BNLJ, SMJ 2. HJ versus SMJ 3. Buffer Manager Detail (PS#3!)
More information! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large
Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!
More informationChapter 20: Parallel Databases
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationChapter 20: Parallel Databases. Introduction
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationNews Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala
News Article Matcher Team: Rohan Sehgal, Arnold Kao, Nithin Kunala Abstract: The news article matcher is a search engine that allows you to input an entire news article and it returns articles that are
More informationIndex Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010
ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 This homework is to be done individually. Total 9 Questions, 100 points 1. (8
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationUniversity of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015
University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic
More informationInforma)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies
Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:
More informationDefining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.
Defining Performance Performance 1 Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100 200 300
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery
More informationChapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction
Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of
More informationParallelization of Sequential Programs
Parallelization of Sequential Programs Alecu Felician, Pre-Assistant Lecturer, Economic Informatics Department, A.S.E. Bucharest Abstract The main reason of parallelization a sequential program is to run
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationDatabase Management System
Database Management System Lecture Join * Some materials adapted from R. Ramakrishnan, J. Gehrke and Shawn Bowers Today s Agenda Join Algorithm Database Management System Join Algorithms Database Management
More informationLesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans
Lesson 1 4 Prefix Sum Definitions Prefix sum given an array...the prefix sum is the sum of all the elements in the array from the beginning to the position, including the value at the position. The sequential
More informationCMSC424: Database Design. Instructor: Amol Deshpande
CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationInformation Retrieval
Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University
More informationUniversity of Waterloo Midterm Examination Solution
University of Waterloo Midterm Examination Solution Winter, 2011 1. (6 total marks) The diagram below shows an extensible hash table with four hash buckets. Each number x in the buckets represents an entry
More informationIndex construc-on. Friday, 8 April 16 1
Index construc-on Informa)onal Retrieval By Dr. Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan qaiser.abbas@uos.edu.pk Friday, 8 April 16 1 4.1 Index
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationAnalytical Modeling of Parallel Programs
2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Analytical Modeling of Parallel Programs Hardik K. Molia Master of Computer Engineering, Department of Computer Engineering Atmiya Institute of Technology &
More informationDISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA
DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationAdvanced Databases. Lecture 1- Query Processing. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Advanced Databases Lecture 1- Query Processing Masood Niazi Torshiz Islamic Azad university- Mashhad Branch www.mniazi.ir Overview Measures of Query Cost Selection Operation Sorting Join Operation Other
More informationAteles performance assessment report
Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,
More informationPathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data
PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg
More informationUniversity of Waterloo Midterm Examination Sample Solution
1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,
More informationChapter 13 Strong Scaling
Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationCSE 344 MAY 7 TH EXAM REVIEW
CSE 344 MAY 7 TH EXAM REVIEW EXAMINATION STATIONS Exam Wednesday 9:30-10:20 One sheet of notes, front and back Practice solutions out after class Good luck! EXAM LENGTH Production v. Verification Practice
More informationChapter 17: Parallel Databases
Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems
More informationData Set Buffering. Introduction
Data Set Buffering Introduction In IBM InfoSphere DataStage job data flow, the data is moved between stages (or operators) through a data link, in the form of virtual data sets. An upstream operator will
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa
More informationStructured Parallel Programming Patterns for Efficient Computation
Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO
More informationJoin algorithm costs revisited
The VLDB Journal (1996) 5: 64 84 The VLDB Journal c Springer-Verlag 1996 Join algorithm costs revisited Evan P. Harris, Kotagiri Ramamohanarao Department of Computer Science, The University of Melbourne,
More informationDesign of Parallel Algorithms. Course Introduction
+ Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:
More informationMatrix Multiplication
Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c
More informationAdvances in Data Management Query Processing and Query Optimisation A.Poulovassilis
1 Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis 1 General approach to the implementation of Query Processing and Query Optimisation functionalities in DBMSs 1. Parse
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More informationMidterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives
Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives Name: _Solution 6 questions, 100 pts, 80 minutes 1. (20 pts) Compare Hadoop (plus HDFS) to the Chord DHT. (a) What
More informationDistributing the Derivation and Maintenance of Subset Descriptor Rules
Distributing the Derivation and Maintenance of Subset Descriptor Rules Jerome Robinson, Barry G. T. Lowden, Mohammed Al Haddad Department of Computer Science, University of Essex Colchester, Essex, CO4
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationTRADITIONAL search engines utilize hard disk drives
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 1.119/TC.216.268818,
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three
More informationDate Lesson TOPIC Homework. The Intersection of a Line with a Plane and the Intersection of Two Lines
UNIT 4 - RELATIONSHIPS BETWEEN LINES AND PLANES Date Lesson TOPIC Homework Oct. 4. 9. The Intersection of a Line with a Plane and the Intersection of Two Lines Pg. 496 # (4, 5)b, 7, 8b, 9bd, Oct. 6 4.
More informationShort Summary of DB2 V4 Through V6 Changes
IN THIS CHAPTER DB2 Version 6 Features DB2 Version 5 Features DB2 Version 4 Features Short Summary of DB2 V4 Through V6 Changes This appendix provides short checklists of features for the most recent versions
More informationQuery Processing. Solutions to Practice Exercises Query:
C H A P T E R 1 3 Query Processing Solutions to Practice Exercises 13.1 Query: Π T.branch name ((Π branch name, assets (ρ T (branch))) T.assets>S.assets (Π assets (σ (branch city = Brooklyn )(ρ S (branch)))))
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationChapter 5: Analytical Modelling of Parallel Programs
Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel
More informationDesigning for Performance. Patrick Happ Raul Feitosa
Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance
More information1.1 - Basics of Query Processing in SQL Server
Department of Computer Science and Engineering 2013/2014 Database Administration and Tuning Lab 3 2nd semester In this lab class, we will address query processing. For students with a particular interest
More informationIncreasing Database Performance through Optimizing Structure Query Language Join Statement
Journal of Computer Science 6 (5): 585-590, 2010 ISSN 1549-3636 2010 Science Publications Increasing Database Performance through Optimizing Structure Query Language Join Statement 1 Ossama K. Muslih and
More information