On Scalable Information Retrieval Systems
|
|
- Clement Reynolds
- 5 years ago
- Views:
Transcription
1 On Scalable Information Retrieval Systems Ophir Frieder 1
2 Scalable Search Structured Semi-structured Text, video, etc. Answer Engine 2
3 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 3
4 Duplicate Document Detection Union of data obtained from multiple sources often contains duplicates Duplicates affect both retrieval effectiveness and retrieval efficiency Duplicate detection is either syntactic or semantic, where semantic is far more challenging. 4
5 What is a Duplicate Document? Semantic Similarity If a document contains roughly the same semantic content it is a duplicate whether or not it is a precise syntactic match. 5
6 Duplicate Detection Techniques Main duplicate detection approaches: Hash based approaches (syntactic) Information retrieval techniques Resemblance r ( A, B ) = S S ( ( A A ) ) S S ( ( B B ) ) 6
7 Duplicate Detection with IR Using documents as queries, rank all documents in the collection with similar terms Documents with equivalent weights are duplicates For each query term, the corresponding posting list entries must be retrieved for large collections, I/O costs are prohibitive 7
8 Duplicate Detection with Resemblance Calculate the resemblance of each document to every other document with matching features Divide the document into shingles (X terms) used to create a unique hash Calculate the resemblance based on hashes rather than terms N 2 comparison approaches not feasible for large collections Optimizations, filter which shingles to use E.g., every 25 th shingle or a combination of multiple shingles 8
9 Issues with Prior Approaches Hash techniques not resilient to small changes in document representation. IR techniques - slow for large collections. Resemblance documents are clustered into multiple clusters due to partitioning duplicate classification is difficult. 9
10 Combined (I-Match) Algorithm Tokenize document Create list of unique tokens Filter tokens - What to filter? Create a unique hash of remaining tokens Search collection for duplicate hashes 10
11 Filtration Based On Collection Statistics Hi & Low 25% Low 25% High 25% Mid 50% N 1. Sort according to idf = log n N = Number _ Of _ Documents _ In _ Collection n = Number _ Of _ Documents _ Term _ Occurs _ In 2. Filter unwanted components 11
12 LA Times Collection Create random duplicates to test effectiveness. For every i th word, pick a random number from one to ten. If the number is higher than the random threshold (call it alpha) then pick a number from 1 to 3. If the random number chosen is a one then remove the word. If the number is a two then flip it with a word at position i+1. If it is a three, add a word (randomly pick one from the term list). Insert duplicate into the collection. 12
13 Document Clusters Formed Document Resemblance Resemblance-Opt Combined LA LA LA LA LA LA LA LA LA LA Average I-Match did not produce any false positives while Resemblance did. 13
14 Processing Time 2GB Algorithm MEAN Time Std Deviation Median Time Resemblance Resemblance - Opt I-Match Syntactic 65 N/A N/A 14
15 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 15
16 SIRE Goals Integrate structured and semi-structured data using a framework that also integrates unstructured data. Improve accuracy of retrieved results Support scalability: data volume retrieval speeds Support legacy data 16
17 Portability The information retrieval prototype was implemented on the following relational platforms: NCR Teradata DBC-machines Microsoft SQL Server Sybase Oracle IBM DB2 and SQL/DS 17
18 Relational Inverted Index All inverted index entries <term> <list of documents> e.g., vehicle D1, D3, D4 results in: term vehicle vehicle vehicle docid D1 D3 D4 18
19 Text Retrieval Conference (TREC) Sample Document <DOC> <DOCNO> AP </DOCNO> <FILEID>AP-NR EST</FILEID> <FIRST>u i BC-Japan-Stocks </FIRST> <SECOND>BC-Japan-Stocks,0026</SECOND> <HEAD>Stocks Up In Tokyo</HEAD> <DATELINE>TOKYO (AP) </DATELINE> <TEXT> The Nikkei Stock Average closed at 29, points up points on the Tokyo Stock Exchange Wednesday. </TEXT> </DOC> 19
20 Relational Document Representation (Term Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term df idf average closed exchange nikkei points stock tokyo up wednesday
21 Simplistic Models: Keyword and Boolean Searches 21
22 Relational Approach: Keyword Search Techniques Keyword search select i.docid from INDEX i, QUERY q where i.term = q.term Keyword search with stop word list select i.docid from INDEX i, QUERY q, STOPLIST s where (i.term = q.term) and (i.term <> s.term) 22
23 Relational Approach: Boolean Search Techniques OR query select docid select docid from INDEX from INDEX where term = term1 where term = term1 OR union term = term2 OR select docid term = term3 OR from INDEX... where term = term2 term = termn union select docid from INDEX where term = term3... union select docid from INDEX where term = termn 23
24 Relational Approach: Boolean Search Techniques AND query select docid select docid from INDEX from INDEX a, INDEX b, INDEX c,... INDEX N where term = term1 where a.term = term1 AND intersect b.term = term2 AND select docid c.term = term3 AND from INDEX... where term = term2 n.term = termn AND intersect a.docid= b.docid AND select docid b.docid = c.docid AND from INDEX... where term = term3 N-1.docID= N.docID... intersect select docid from INDEX where term = termn 24
25 Fixed Join-Count AND Queries Find all documents that contain all of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) = select count(*) from QUERY 25
26 TAND Queries Find all documents that contain at least X of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) >= X 26
27 Relevance Ranking: Vector Space & Probabilistic Models 27
28 Vector Space Model Term Frequency (tf ik ): number of occurrences of term t k in document i Document Frequency (df j ): number of documents which contain t j Inverse Document Frequency (idf j ): log(d/df j ) where d is the total number of documents Notes: idf is a measure of uniqueness of a term across the collection tf is the frequency of a term in a given document 28
29 Vector Space Model: Sample Relational Query List all documents in the order of their similarity coefficient where the coefficient is computed using the dot product. SELECT FROM WHERE d.docid, d.docname, SUM(i.termcnt * t.idf * q.termcnt * t.idf) DOCUMENT d, QUERY q, INDEX i, TERM t q.term = i.term AND q.term = t.term AND d.docid = i.docid GROUP BY d.docid, d.docname ORDER BY 3 DESC 29
30 Similarity Coefficients Several similarity coefficients based on the query vector X and the document vector Y are defined: Inner Prod uct x y Cosine Coefficient t i= 1 i i t xiyi i= 1 t t 2 xi 2 yi i= 1 i= 1 30
31 SQL for Probabilistic Similarity Measure num _ terms log ( numdocs dfi ) +.5 tfid ( df ) (.75 doclength avgdoclength) i= 1 i / + tf id qtf SELECT d.docid, d.docname, SUM( LOG(((NumDocs - t.df) + 0.5) / (t.df + 0.5)) * ((2.2*i.tf) / (.3 + ((.75 * d.doclen)/avgdoclen) + i.tf)) * q.termcnt ) FROM INDEX i, TERM t, DOCUMENT d, QUERY q WHERE i.term = t.term AND i.docid = d.docid AND t.term = q.term GROUP BY d.docid, d.docname ORDER BY 3; 31
32 Relational Document Representation (Term Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term df idf average closed exchange nikkei points stock tokyo up wednesday
33 Relational Query Representation (Term Processing) QUERY term termcnt nikkei 1 stock 2 exchange 2 american 1 ORIGINAL QUERY: nikkei stock exchange american stock exchange SQL: (Query Weight * Document Weight) SELECT d.docid, d.docname, SUM(a.termcnt * c.idf * b.termcnt * c.idf) FROM QUERY a, INDEX b, TERM c, DOCUMENT d WHERE a.term = b.term AND a.term = c.term AND b.docid = d.docid GROUP BY d.docid, docname ORDER BY 3 DESC 33
34 Sample Term Query Result (Inner/Dot Product) Term Q-Termcnt Q-Weight D-Termcnt D-Weight Q-Wt * D-Wt nikkei stock exchange american Similarity Coefficient
35 Simple Phrase Parsing Simple phrase parser with the following rules Phrases do not include stop terms Phrases do not span across punctuation Example: The Nikkei Stock Average closed at 29, points up points, on the Tokyo Stock Exchange Wednesday. Phrases: nikkei stock stock average average closed points up tokyo stock stock exchange exchange wednesday 35
36 Relational Document Representation (Phrase Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt phrase 28 1 nikkei stock 28 1 stock average 28 1 average closed 28 1 points up 28 1 tokyo stock 28 1 stock exchange 28 1 exchange wednesday PHRASE phrase df idf average closed exchange Wednesday nikkei stock points up stock average stock exchange tokyo stock
37 Enhancing Accuracy With Relevance Feedback 37
38 Relevance Feedback The modification of the search process so as to improve accuracy by incorporating information obtained from prior relevance judgments. Q 0 top relevant documents new terms Q 0 Q 1 matching documents database search database search 38
39 Relevance Feedback Example Q tunnel under English Channel 1 Document Collection Top Ranked Document: The tunnel under the English channel is often called a Chunnel Q1 tunnel under English Channel Chunnel Documents Retrieved Relevant Retrieved b 2 Not Relevant Documents Retrieved Relevant Retrieved b 39
40 Feedback Mechanisms Manual - relevant documents are identified manually and new terms are selected either manually or automatically. Automatic - relevant documents are identified automatically by assuming the top-ranked documents, are relevant and new terms are selected automatically. 40
41 Relevance Feedback Parameters Various techniques can be used to improve the relevance feedback process. Number of Top-Ranked Documents Number of Feedback Terms Feedback Term Selection Techniques Term Weighting Document Clustering Relevance Feedback Thresholding Term Frequency Cutoff Points Query Expansion Using a Thesaurus 41
42 Relevance Feedback Evaluation Improvement from relevance feedback, nidf weights at 0.00 at 0.20 at 0.40 at 0.60 at 0.80 at 1.00 nidf, no feedback nidf, feedback 10 terms Recall 42
43 Comparative TREC-8 Results Run IIT Avg. Precision # Above Median # At Median # Below Median # Best # Worst iit00t iit00td iit00tde iit00m
44 Performance Optimizations: Query Thresholds & Clustered Indexes 44
45 Query Thresholds Consider a query with terms t 1, t 2, t 3,..., t n. Sort the terms by their frequency across the collection (least frequent terms appear first). Define a threshold as the percentage of terms taken in the original query in a newly created reduced query. Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 Term 8 Term 9 Term 10 Threshold = 20 Threshold = 50 Threshold = 80 45
46 Relevant Retrieved as a Function of Query Thresholds 2500 Relevant Retrieved Query Threshold (Percent) 46
47 Run Time as a Function of Query Thresholds CPU ,989 13, ,238 1,736 5, Query Threshold (Percent) 47
48 Relevant New Documents Per CPU Cycle Threshold Relevant Retrieved CPU Cycles New Relevant Docs per Cycle
49 Caveat: Logical Design versus Physical Implementation While the design shown represents the replication of the document identifier, in the physical implementation, clustered tables are actually used. That is, attribute values that are logically repeated many times are physically clustered by the attribute value to eliminate the replication storing only one copy for each unique attribute value. (Note clustered tables in Oracle implementations) The I/O to retrieve a posting list is achieved via a grouped block read as opposed to retrieval across distributed storage. 49
50 Clustered Indexes: Posting List Processing TRADITIONAL term docid tf stock 1 1 stock 28 2 stock stock stock stock stock stock stock CLUSTERED term docid tf stock { ( 1 1 ), ( 28 2 ), ( ), ( ), ( ), ( ), ( ), ( ), ( ) } 50
51 Clustered Indexes: Relevance Feedback Processing TRADITIONAL docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday CLUSTERED docid termcnt term 28 { ( 1 nikkei ), ( 2 stock ), ( 1 average ), ( 1 closed ), ( 2 points ), ( 1 up ), ( 1 tokyo ), ( 1 exchange ), ( 1 wednesday ) } 51
52 Technology Transfer Concern: An Academic Solution Without a Public Commercial World Problem Resolution: National Institutes of Health s National Center for Complementary and Alternative Medicine Citation Index 52
53 NIH-NCCAM Application Short query lengths necessitate query expansion Advanced Search techniques are needed since used by roughly 30% of the users Efficient processing critical filtration an option. Scalability not currently a concern but future needs may dictate such. 53
54 NCCAM - Search Interface 54
55 NCCAM Results Page Interface 55
56 System Architecture Internet Servlet Engine User Oracle DBMS HTTP Server Sun ES
57 Servlet Architecture HTTP Search Request Generic Search Servlet Dynamic HTML MRU Cache Query Engine RDBMS Connection Pool RDBMS 57
58 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 58
59 Scalability via Parallelism: Not My Problem: It s the Database Vendors Problem 59
60 Parallel Information Retrieval Most parallel information retrieval systems require custom hardware and software which reduces portability across systems. Parallelism in a relational information retrieval system is a function of the database management system and does not require custom hardware and software. 60
61 TPC- C Benchmarks TPC-C BENCHMARK RESULTS These results are valid as of date 7/5/2000 9:46:09 AM TPC-C Results - Revision 3.X Company System Spec. RevitpmC $/tpmc Total Sys. Currency Database Operating TP MonitoServer CPU# Server CPCluster # Front En Date SubmAvailability ALR Revolution US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 6 N 6 11/6/97 12/31/97 ALR ALR Revol US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 6 N 5 7/9/97 11/30/97 ALR Revolution US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 4 4/4/97 4/30/97 Acer AcerAltos US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/17/99 5/17/99 Acer AcerAltos US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 5 2/16/98 2/16/98 Amdahl EnVista Fr US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 6 3/28/97 3/31/97 Bull Escala EP US $ Oracle 8i EIBM AIX 4.3WebshpereIBM RS64-8 N 6 6/20/00 9/30/00 Bull Escala T US $ Oracle 8i VIBM AIX 4.3WebshpereIBM RS64-6 N 5 6/20/00 6/20/00 Bull EPC 440 c Euros Oracle8i 8IBM AIX 4.3IBM TXSer IBM RS64-4 N 8 12/2/99 12/2/99 Bull Escala EP US $ Oracle 8i VIBM AIX 4.3IBM TXSer IBM RS64 24 N 15 11/5/99 3/1/00 Bull Express US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 8 N 8 3/26/98 6/15/98 Bull ESCALA P US $ Oracle7 7.IBM AIX 4.1Tuxedo ETMotorola P 32 Y 16 1/16/97 6/30/97 Bull ESCALA D US $ Sybase SQIBM AIX 4.2Tuxedo 4.2Motorola P 8 N 6 11/15/96 1/31/97 Bull ESCALA D US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 7 6/3/96 11/30/96 Bull ESCALA D US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 2/1/96 7/30/96 Bull ESCALA D US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 4 N 3 5/9/95 5/1/95 Bull ESCALA D US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 5/9/95 6/1/95 Bull ESCALA R US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 4 N 3 5/9/95 5/1/95 Bull ESCALA R US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 5/9/95 6/1/95 Compaq ProLiant DL US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/23/00 8/1/00 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 6 N 4 4/7/00 4/7/00 Compaq ML US $ Microsoft Microsoft WCompaq DIntel Pentiu 2 N 2 3/8/00 8/1/00 Compaq ProLiant PD US $ Oracle 8i VMicrosoft WCompaq DIntel Pentiu 48 Y 6 2/11/00 3/31/00 Compaq AlphaServe US $ Sybase AdCompaq TrApplication Alphachip 4 N 4 2/9/00 3/17/00 Compaq PDC/O US $ Oracle 8i VMicrosoft WCompaq DIntel Pentiu 48 Y 12 12/23/99 3/31/00 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 12/20/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft C Intel Pentiu 1 N 1 10/12/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft C Intel Pentiu 2 N 3 9/29/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 9/20/99 12/31/99 61
62 More Vendors TPC- C Benchmarks DG AViiON AV US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 3/4/99 3/25/99 DG AViiON AV US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 1/22/99 3/25/99 DG AViiON AV US $ Microsoft Microsoft WTopEnd v2 Intel Pentiu 8 N 5 3/2/98 5/31/98 DG AViiON US $ Microsoft Microsoft WTopEnd v2 Intel Pentiu 6 N 9 11/21/97 2/28/98 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 5/26/00 8/1/00 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 5/26/00 8/1/00 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 11/22/99 12/31/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/28/99 9/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/28/99 9/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/28/99 6/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/28/99 6/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 2 1/5/98 1/5/98 Dell PowerEdge US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 4 3/12/97 4/11/97 Fujitsu Si Primergy N Euros Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 8 7/5/00 10/1/00 Fujitsu Si Primergy K Euros Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 12/13/99 1/1/00 Fujitsu Si Primergy US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 3/17/99 4/26/99 Fujitsu Si Primergy US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 12/23/98 2/28/99 Fujitsu Si Primergy US $ Microsoft Microsoft WOpen UTMIntel Pentiu 4 N 7 9/28/98 12/29/98 Fujitsu Si RM600 mo US $ Informix OSNI ReliantOpenUTM MIPS R N 14 3/14/98 3/1/98 Fujitsu Si Primergy US $ Microsoft Microsoft WSNI openuintel Pentiu 4 N 6 12/1/97 1/1/98 Fujitsu Si RM600 Mo US $ Informix OPyramid ReOpenUTM MIPS R100 8 N 7 7/11/97 7/31/97 Fujitsu Si Primergy US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 5 2/14/97 3/31/97 Fujitsu Si RM400-C US $ Oracle7 7.SNI ReliantOpenUTM MIPS R100 1 N 2 2/14/97 3/31/97 Fujitsu Si RM600 Mo US $ Informix OSNI Relian OpenUTM MIPS R N 7 12/20/96 6/30/97 Fujitsu Si RM600 Mo US $ Informix OSNI Relian Open UTMMIPS R N 6 9/9/96 9/9/96 Fujitsu Si RM 600 Mo US $ Informix OSNI SINIX-Tuxedo 4.2MIPS R N 5 1/5/96 5/1/96 Fujitsu Si RM 600 Mo US $ Informix OSNI SINIX-Tuxedo MIPS R440 8 N 3 11/13/95 5/1/96 Fujitsu Si RM 400 Mo US $ Informix OSNI SINIX- Tuxedo 4.2MIPS R440 1 N 1 7/20/95 11/1/95 Fujitsu/ICLGRANPOW US $ Fujitsu/ICLMicrosoft WBEA TuxedIntel Pentiu 4 N 6 7/19/99 1/16/00 62
63 More Vendors TPC- C Benchmarks HP HP NetServ US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 7/8/97 11/30/97 HP NetServer L US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 4/3/97 4/3/97 HP NetServer L US $ Microsoft Microsoft WTuxedo 4.1Intel Pentiu 2 N 2 12/16/96 2/28/97 IBM Netfinity US $ IBM DB2 UMicrosoft WMicrosoft CIntel Pentiu 4 Y 96 7/3/00 12/7/00 IBM RS/6000 E US $ Oracle 8i EIBM AIX 4.3WebshpereIBM RS64-8 N 6 5/31/00 9/30/00 IBM RS/6000 E US $ Oracle8i EIBM AIX 4.3WebshpereIBM RS64-6 N 5 5/9/00 6/9/00 IBM Netfinity US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 6 2/25/00 2/25/00 IBM RISC Syste US $ Oracle 8i VIBM AIX 4.3IBM TXSer IBM RS64 24 N 15 10/29/99 3/1/00 IBM RS 6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM RS64 60 Y 60 6/30/99 6/30/99 IBM AS/400e M US $ IBM DB2 foibm OS/40BEA TuxedIBM Power 12 N 6 6/7/99 6/1/99 IBM RISC Syste US $ Oracle v8ibm AIX 4.3IBM TXSer IBM RS64-4 N 8 5/28/99 11/19/99 IBM AS/400e Se US $ IBM DB2 foibm OS/40CICS for A IBM Power 12 N 97 9/1/98 9/11/98 IBM RS/6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM Power 12 N 15 8/11/98 1/21/99 IBM RS/6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM Power 12 N 8 3/3/98 9/2/98 IBM RISC Syste US $ Sybase AdIBM AIX 4.2BEA TuxedIBM Power 4 N 5 2/12/98 2/12/98 IBM AS/400e S US $ IBM DB2 foibm OS/40CICS for A IBM AS A3 12 N 64 8/18/97 8/29/97 IBM RS/6000 W US $ Sybase AdIBM AIX 4.2BEA TuxedIBM Power 4 N 3 5/12/97 9/30/97 IBM RS/6000 E US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 8 N 3 5/6/97 9/30/97 IBM RS/6000 E US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 8 N 3 5/6/97 9/30/97 IBM RS/6000 W US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 4 N 3 4/7/97 4/25/97 IBM RISC Syste US $ Oracle7 7.IBM AIX 4.1Tuxedo ETIBM Power 32 Y 16 12/10/96 6/30/97 IBM RISC Syste US $ Oracle7 7.IBM AIX 4.1Tuxedo ETIBM Power 32 Y 16 12/10/96 6/30/97 IBM RS6000 Po US $ Sybase SQIBM AIX 4.1BEA TuxedMotorola P 8 N 6 7/23/96 12/15/96 IBM RS6000 Po US $ Sybase SQIBM AIX 4.1BEA TuxedMotorola P 8 N 6 7/23/96 12/15/96 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 2 N 4 7/30/97 7/1/97 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 2 N 2 3/5/97 3/31/97 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 1 N 1 3/5/97 3/31/97 Itautec InfoSERVE Brazil $ Microsoft Microsoft W-none- Intel Pentiu 4 N 6 6/30/97 3/1/97 Motorola Motorola S US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 4/24/96 7/30/96 NEC Express US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/23/00 9/29/00 NEC Express US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/19/00 9/29/00 NEC Express US $ Oracle8i EMicrosoft WBEA TuxedIntel Pentiu 32 Y 8 6/29/99 11/30/99 NEC Express US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 5 8/5/98 12/29/98 63
64 More Vendors TPC- C Benchmarks NEC UP 4800/ E+08 Yen Informix ONEC UP-UTuxedo (R4MIPS R440 2 N 3 1/27/95 4/21/95 NEC UP 4800/ E+08 Yen Informix ONEC UP-UTuxedo (R4MIPS R440 6 N 7 1/27/95 6/21/95 SGI Origin US $ Sybase AdSGI IRIX 6.BEA TuxedMIPS R100 2 N 2 4/23/98 7/31/98 SGI Origin US $ Informix OSGI IRIX 6.BEA TuxedMIPS R N 26 4/30/97 10/29/97 Sequent NUMACen US $ Oracle OraDYNIX/ptx BEA TuxedIntel Xeon 64 N 8 12/18/98 6/15/99 Sequent NUMACen US $ Oracle OraSequent DYBEA TuxedIntel Xeon 32 N 4 10/13/98 3/15/99 Sun Enterprise US $ Fujitsu/ICLSun SolariBEA Tuxed Ultra SPA 4 N 7 2/2/00 7/28/00 Sun Enterprise US $ Sybase AdSun SolariBEA Tuxed Ultra SPA 14 N 15 11/23/99 3/30/00 Sun Enterprise US $ Oracle8i ESun SolarisBEA Tuxed Ultra SPA 96 Y 40 9/24/99 1/31/00 Sun Starfire Ent US $ Oracle 8i vsun SolariBEA Tuxed Ultra SPA 64 N 32 3/24/99 8/22/99 Sybase Digital Alph US $ Sybase SQDigital UNIXITI Tuxedo Digital DEC 10 N 10 12/21/95 3/1/96 Tandem Integrity NR US $ Informix OSGI IRIX 6.IMC TuxedMIPS R N 7 11/10/95 2/28/96 Unisys e-@ction E US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 4 2/16/00 7/13/00 Unisys Unisys e-@ US $ Oracle8i EUnixWare 7Tuxedo 6.4Intel Pentiu 8 N 5 12/13/99 6/1/00 Unisys Aquanta ES US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 3 10/27/99 12/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 10/12/99 12/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 2 N 1 9/7/99 9/30/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 8 N 4 6/22/99 9/30/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/11/99 5/11/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/7/99 5/7/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 4/1/99 3/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/17/99 3/17/99 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Xeon 4 4 N 3 1/5/99 1/5/99 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 3 12/4/98 12/29/98 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 3 11/11/98 12/29/98 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N NEC UP 4800/6 3 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N NEC UP 4800/6 3 64
65 Experimental Platform: NCR Teradata DBC/1012 Work Station Client Computer (client resident software) Channels Channel Work Station LAN COP IFP IFP PE YNET IFP - Interface Processor COP - Communications Processor Ynet - Interprocessor Bus AMP - Access Module Processor AP - Application Processor PE - Parsing Engine AMP AMP AMP AP disks disks disks UNIX disk LAN 65
66 Sample Relevance Ranking Query SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* ( (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d5$idx a, trec6$d5$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid UNION SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* ( (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d4$idx a, trec6$d4$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid ORDER BY 3 DESC; 66
67 Breakdown of DBC/1012 Processing Steps Step 1 - A read lock is placed on tables trec6$d5$idx, trec6$d5$docavgtf, trec6$d4$idx, and trec6$d4$docavgtf. Step 2 - A single processor is used to select and join rows via a merge join from trec6$q6$qrynidf and trec6$q6$qryavgtf where the value of qryid = 301. The results are stored on spool 3. Step 3 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 4. Step 4 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$docavgtf and spool 4 where the values of the docid attribute match. The results are stored on spool 2 which is built locally on each processor. Step 5 - The SUM value for the aggregate function is calculated from the data on spool 2 and the results are stored on spool 5. (The next two steps, 6a and 6b, are executed in parallel) Step 6a - The data from spool 5 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. Step 6b - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 9. Step 7 - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$docavgtf and spool 9 where the values of the docid attribute match. The results are stored on spool 7 which is built locally on each processor. Step 8 - The SUM value for the aggregate function is calculated from the data on spool 7 and the results are stored on spool 10. Step 9 - The data from spool 10 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. A sort is then done to remove duplicates from data on spool 1. Step 10 - An END TRANSACTION step is sent to all processors involved and the contents of spool 1 are sent back to the user. 67
68 Parallel Performance Parallel Efficiency measures how efficiently the processors are distributing the workload. max_cpu = maximum CPU time Parallel Efficiency (PE) = avg_ cpu max_ cpu across all processors avg_cpu = average CPU time across all processors Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 4 processors, disk 2 only % 84.9% primary index on term 4 processors, disk 2 and % 86.9% primary index on term 24 processors, disk 2 only % 40.6% primary index on term 24 processors, disk 2 and % 44.2% primary index on term 68
69 Hashing Algorithm Term 4 Processors Hashing Algorithm A,E,I,M,Q,U,Y Proc #1 B,F,J,N,R,V,Z Proc #2 C,G,K,O,S,W Proc #3 D,H,L,P,T,X Proc #4 Term 24 Processors Hashing Algorithm A Proc #1 B Proc #2 C Proc # V Proc #22 WX Proc #23 YZ Proc #24 69
70 Term Distribution Distribution of terms based on starting letter Number of Terms a c e g i k m o q s u w y Starting Letter 70
71 Parallel Performance Parallel Efficiency before and after balancing data storage across 24 processors Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 24 processors, disk 2 only % 40.6% primary index on term 24 processors, disk 2 and % 44.2% primary index on term 24 processors, disk 2 only % 91.6% primary index on docid and term 24 processors, disk 2 and % 93.8% primary index on docid and term 71
72 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 72
73 Current Enterprise Portals 73
74 Next Generation Search Engines 74
75 75 75
76 76 76
77 77 77
78 78 78
79 IIT Production Mediator Logs 271 Queries 142 With User Feedback Satisfied 13% Ok 13% Dissatisfied 13% Happy 38% Unhappy 23% 79
80 Technology Transfer Industrial America Online BIT Systems Harris Corporation IITRI NCR Unnamed Others (Proprietary) Assorted dot-dead companies (hopefully not due to our technology!!!) Government National Institutes of Health Additional Others 80
81 Information Retrieval Laboratory Faculty Members: O. Frieder D. Grossman N. Goharian X. Li P. Wan Senior Affiliates A. Chowdhury AOL D. Holmes NCR M. C. McCabe - US Gov. Students: Steven Beitzel Rebecca Cathey Ankit Jain Eric Jensen Vincent Nguyen Angelo Pilotto Michael Saelee Chih-Wei Yi Wang Yu 81
82 References D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts, Integrating Structured Data and Text: A Relational Approach, Journal of the American Society of Information Science, February C. Lundquist, O. Frieder, D. Holmes, D. Grossman, A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval, Journal of the American Society of Information Science, April O. Frieder, D. Grossman, A. Chowdhury, and G. Frieder, "Efficiency Considerations in Very Large Information Retrieval Servers," Journal of Digital Information, (British Computer Society), 1(5), April Invited Paper. A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, Analyses of Multiple-Evidence Combinations for Retrieval Strategies, ACM Twentieth SIGIR, New Orleans, Louisiana, September D. Grossman, S. Beitzel, E. Jensen, and O. Frieder, IIT Intranet Mediator: Bringing Data Together on a Corporate Intranet, IEEE IT PRO, January/February A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe, Collection Statistics for Fast Duplicate Document Detection, ACM Transactions on Information Systems (TOIS), April
Relational Approach. Problem Definition
Relational Approach (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman, Frieder Grossman, Frieder 2002, 2010 1 Problem
More informationRelational Approach. Problem Definition
Relational Approach (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman & Frieder 1 Problem Definition Three conceptual
More informationA Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval
A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval Carol Lundquist 1, Ophir Frieder 2, David Grossman 3, and David O. Holmes 4 Abstract. A scalable,
More informationRelevance Feedback & Other Query Expansion Techniques
Relevance Feedback & Other Query Expansion Techniques (Thesaurus, Semantic Network) (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Informion Retrieval Algorithms and Heuristics,
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationCollection Statistics for Fast Duplicate Document Detection
Collection Statistics for Fast Duplicate Document Detection ABDUR CHOWDHURY, OPHIR FRIEDER, DAVID GROSSMAN, and MARY CATHERINE McCABE Illinois Institute of Technology We present a new algorithm for duplicate
More informationComparative Analysis of Sparse Matrix Algorithms For Information Retrieval
Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationInfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux
InfoBrief Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux Leveraging Oracle 9i Real Application Clusters (RAC) Technology and Red
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationIBM's Regatta Still Lags NCR Teradata in Data Warehousing
Decision Framework, A. Butler, K. Strange Research Note 17 September 2002 's Regatta Still Lags NCR Teradata in Data Warehousing 's new Regatta server is maturing fast in the online transaction processing
More informationUNITED STATES. Performance Report. IBM Netfinity HUVLRQÃ -DQXDU\Ã
Performance Report IBM Netfinity 7000 9HUVLRQÃ -DQXDU\Ã Š Executive Overview The performance of the IBM Netfinity* 7000 server, announced worldwide in September 1997, was evaluated using the following
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationExploiting Parallelism to Support Scalable Hierarchical Clustering
Exploiting Parallelism to Support Scalable Hierarchical Clustering Rebecca Cathey, Eric Jensen, Steven Beitzel, Ophir Frieder, David Grossman Information Retrieval Laboratory http://ir.iit.edu Background
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationDoes the TPC still have relevance? H. Reza Taheri HPTS 2017, 9-Oct-2017
Does the TPC still have relevance? H. Reza Taheri HPTS 2017, 9-Oct-2017 2016 VMware Inc. All rights reserved. Outline History of the TPC Where things stand today Why the decline? The way forward Not gonna
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationKeyword Search in Databases
Keyword Search in Databases Wei Wang University of New South Wales, Australia Outline Based on the tutorial given at APWeb 2006 Introduction IR Preliminaries Systems Open Issues Dr. Wei Wang @ CSE, UNSW
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationDQpowersuite. Superior Architecture. A Complete Data Integration Package
DQpowersuite Superior Architecture Since its first release in 1995, DQpowersuite has made it easy to access and join distributed enterprise data. DQpowersuite provides an easy-toimplement architecture
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationRelease Notes. BMC Performance Manager Express for Hardware by Sentry Software Version January 18, 2007
Release Notes BMC Performance Manager Express for Hardware Version 2.3.00 January 18, 2007 Sentry Software is releasing version 2.3.00 of the BMC Performance Manager Express for Hardware. These release
More informationDatabase Server. 2. Allow client request to the database server (using SQL requests) over the network.
Database Server Introduction: Client/Server Systems is networked computing model Processes distributed between clients and servers. Client Workstation (usually a PC) that requests and uses a service Server
More informationCompaq AlphaServer ES40 Model 6/833 4 CPU. Client Server System. Total System Cost TPC-C Throughput Price/Performance Availability Date
Compaq AlphaServer ES40 Model 6/833 4 CPU Client Server System TPC-C Rev. 3.5 Report Date: February 26, 2001 Total System Cost TPC-C Throughput Price/Performance Availability Date $712,376 37,274 $19.11
More informationLecture 14: I/O Benchmarks, Busses, and Automated Data Libraries Professor David A. Patterson Computer Science 252 Spring 1998
Lecture 14: I/O Benchmarks, Busses, and Automated Data Libraries Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr. 98 UCB 1 Review: A Little Queuing Theory Queue System server Proc
More informationIIT at TREC-10. A. Chowdhury AOL Inc. D. Holmes NCR Corporation
IIT at TREC-10 M. Aljlayl, S. Beitzel, E. Jensen Information Retrieval Laboratory Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 {aljlayl, beitzel, jensen } @ ir.iit.edu
More information7. Query Processing and Optimization
7. Query Processing and Optimization Processing a Query 103 Indexing for Performance Simple (individual) index B + -tree index Matching index scan vs nonmatching index scan Unique index one entry and one
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationData warehouse and Data Mining
Data warehouse and Data Mining Lecture No. 13 Teradata Architecture and its compoenets Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and
More informationPerformance Optimization for Informatica Data Services ( Hotfix 3)
Performance Optimization for Informatica Data Services (9.5.0-9.6.1 Hotfix 3) 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
More informationContent Management in Large-Scale Information Retrieval Systems
Content Management in Large-Scale Information Retrieval Systems S. Beitzel Information Retrieval Laboratory Computer Science Department Illinois Institute of Technology Chicago, IL, U.S.A. steve@ir.iit.edu
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationWalking Four Machines by the Shore
Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0
More informationCompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy
CompSci 516 Database Systems Lecture 20 Parallel DBMS Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements HW3 due on Monday, Nov 20, 11:55 pm (in 2 weeks) See some
More informationDialog (interactive) data input. Reporting. Printing processing
Tutorials, D. Prior Research Note 24 February 2003 Who Sets the Pace in the SAP Performance 'Olympics'? SAP and its hardware vendors use many different application performance benchmarks. But records for
More informationParallel DBMS. Chapter 22, Part A
Parallel DBMS Chapter 22, Part A Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also: http://www.research.microsoft.com/research/barc/gray/pdb95.ppt Database
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VI Lecture 17, March 24, 2015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part V External Sorting How to Start a Company in Five (maybe
More informationHP ProLiant DL580 G5. HP ProLiant BL680c G5. IBM p570 POWER6. Fujitsu Siemens PRIMERGY RX600 S4. Egenera BladeFrame PB400003R.
HP ProLiant DL58 G5 earns #1 overall four-processor performance; ProLiant BL68c takes #2 four-processor performance on Windows in two-tier SAP Sales and Distribution Standard Application Benchmark HP leadership
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationOutline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.
Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More information4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)
4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,
More informationQuickSpecs. ISG Navigator for Universal Data Access M ODELS OVERVIEW. Retired. ISG Navigator for Universal Data Access
M ODELS ISG Navigator from ISG International Software Group is a new-generation, standards-based middleware solution designed to access data from a full range of disparate data sources and formats.. OVERVIEW
More informationHP ProLiant delivers #1 overall TPC-C price/performance result with the ML350 G6
HP ProLiant ML350 G6 sets new TPC-C price/performance record ProLiant ML350 continues its leadership for the small business HP Leadership with the ML350 G6» The industry s best selling x86 2-processor
More informationQLE10000 Series Adapter Provides Application Benefits Through I/O Caching
QLE10000 Series Adapter Provides Application Benefits Through I/O Caching QLogic Caching Technology Delivers Scalable Performance to Enterprise Applications Key Findings The QLogic 10000 Series 8Gb Fibre
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More informationIntroduction to IR Systems: Supporting Boolean Text Search
Introduction to IR Systems: Supporting Boolean Text Search Ramakrishnan & Gehrke: Chapter 27, Sections 27.1 27.2 CPSC 404 Laks V.S. Lakshmanan 1 Information Retrieval A research field traditionally separate
More informationData Warehouse Tuning. Without SQL Modification
Data Warehouse Tuning Without SQL Modification Agenda About Me Tuning Objectives Data Access Profile Data Access Analysis Performance Baseline Potential Model Changes Model Change Testing Testing Results
More informationIntel Enterprise Solutions
Intel Enterprise Solutions Catalin Morosanu Business Development Manager High Performance Computing catalin.morosanu@intel.com Intel s figures 2003/Q104 Revenue 2003: $ 31 billion first Quarter 2004: $
More information6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS
Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long
More informationOverview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa
ICS 624 Spring 2011 Overview of DB & IR Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 1 Example
More informationCIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Text To Knowledge IR and Boolean Search Text to Knowledge (IE)
More informationArcInfo 9.0 System Requirements
ArcInfo 9.0 System Requirements This PDF contains system requirements information, including hardware requirements, best performance configurations, and limitations, for ArcInfo 9.0. HP HP-UX 11i (11.11)
More informationTrade-ins from qualified competitor products to Informix Dynamic Server V9
Software Announcement September 28, 2004 Trade-ins from qualified competitor products to Informix Dynamic Server V9 Overview This trade-in offering for Informix Dynamic Server (IDS) V9 gives you another
More informationBVRIT HYDERABAD College of Engineering for Women. Department of Computer Science and Engineering. Course Hand Out
BVRIT HYDERABAD College of Engineering for Women Department of Computer Science and Engineering Course Hand Out Subject Name : Information Retrieval Systems Prepared by : Dr.G.Naga Satish, Associate Professor
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationIBM Systems: Helping the world use less servers
Agenda Server Consolidation Reasons Server Consolidation Methodology Power Systems Server Consolidation Server Consolidation Examples Demo of SCON Tool Mike Rede Field Technical Sales Specialist mrede@us.ibm.com
More informationHostname System Configuration Documentation
Hostname System Configuration Documentation Version 0.0 25-Jan-18 Delivered January 25, 2018 Version 0.0 By: Gary Neshanian (consultant) Nish Consulting 2336 Elden Ave., Suite G Costa Mesa, CA 92627 Phone
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationAccelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740
Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740 A performance study with NVDIMM-N Dell EMC Engineering September 2017 A Dell EMC document category Revisions Date
More informationnumber of documents in global result list
Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval
More informationIn-Memory Data Management
In-Memory Data Management Martin Faust Research Assistant Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University of Potsdam Agenda 2 1. Changed Hardware 2.
More informationOracle Database Competency Center
Oracle Database Competency Center Suchai Yenruedee Consulting & Customer Support Director Advanced Solutions Application Hosting Services Database Competency Center Space: 167.54 sqm. Location: 7th Floor
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VI Lecture 14, March 12, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part V Hash-based indexes (Cont d) and External Sorting Today s Session:
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationSenior Technical Manager, ATG, Oracle Corporation. Vamsi Mudumba. High Availability. High Availability
High Availability High Availability Vamsi Mudumba Senior Technical Manager, ATG, Oracle Corporation Agenda HA Overview Availability Defined HA Importance Designing Solutions for HA Causes of Downtime HA
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationOracle9i Real Application Clusters. Principal Sales Consultant DB Tech. Team Oracle Corporation
Oracle9i Real Application Clusters Principal Sales Consultant DB Tech. Team Oracle Corporation What is a Cluster? Group of servers acting as single system Requires hardware (interconnect) software (clusterware)
More informationWhat happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques
376a. Database Design Dept. of Computer Science Vassar College http://www.cs.vassar.edu/~cs376 Class 16 Query optimization What happens Database is given a query Query is scanned - scanner creates a list
More informationCOURSE 12. Parallel DBMS
COURSE 12 Parallel DBMS 1 Parallel DBMS Most DB research focused on specialized hardware CCD Memory: Non-volatile memory like, but slower than flash memory Bubble Memory: Non-volatile memory like, but
More informationArcSDE 8.1 Questions and Answers
ArcSDE 8.1 Questions and Answers 1. What is ArcSDE 8.1? ESRI ArcSDE software is the GIS gateway that facilitates managing spatial data in a database management system (DBMS). ArcSDE allows you to manage
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationExperience the GRID Today with Oracle9i RAC
1 Experience the GRID Today with Oracle9i RAC Shig Hiura Pre-Sales Engineer Shig_Hiura@etagon.com 2 Agenda Introduction What is the Grid The Database Grid Oracle9i RAC Technology 10g vs. 9iR2 Comparison
More informationNEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III
NEC Express5800 A2040b 22TB Data Warehouse Fast Track Reference Architecture with SW mirrored HGST FlashMAX III Based on Microsoft SQL Server 2014 Data Warehouse Fast Track (DWFT) Reference Architecture
More informationData, Information, and Databases
Data, Information, and Databases BDIS 6.1 Topics Covered Information types: transactional vsanalytical Five characteristics of information quality Database versus a DBMS RDBMS: advantages and terminology
More informationData about data is database Select correct option: True False Partially True None of the Above
Within a table, each primary key value. is a minimal super key is always the first field in each table must be numeric must be unique Foreign Key is A field in a table that matches a key field in another
More informationRouting and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.
Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany
More informationVeritas NetBackup 6.5 Clients and Agents
Veritas NetBackup 6.5 Clients and Agents The Veritas NetBackup Platform Next-Generation Data Protection Overview Veritas NetBackup provides a simple yet comprehensive selection of innovative clients and
More informationItanium 2. Itanium.
Itanium 2 Itanium 2 Itanium www.intel.com/itanium2 ... 2... 2... 4... 4... 4... 4... 5... 5... 5... 6 Itanium 9MB L3 Itanium 2 1.60GHz Itanium Itanium 2 Itanium 2 Itanium 2 25% 1 5 15% IA-32 Itanium 2
More informationSAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2
SAP SD Benchmark using DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2 Version 1.0 November 2008 SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2 1801 Varsity Drive
More informationDatabase Group Research Overview. Immanuel Trummer
Database Group Research Overview Immanuel Trummer Talk Overview User Query Data Analysis Result Processing Talk Overview Fact Checking Query User Data Vocalization Data Analysis Result Processing Query
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Prof. Chris Clifton 27 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 AD-hoc IR: Basic Process Information
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationSoftware withdrawal: IBM VisualAge Pacbase V3.0 features Replacement available
Withdrawal Announcement October 14, 2003 Software withdrawal: IBM VisualAge Pacbase V3.0 features Replacement available Overview Effective January 9, 2004, IBM will withdraw from marketing VisualAge Pacbase
More informationBusinessObjects Enterprise XI 3.0 for Linux
Revision Date: February 22, 2010 BusinessObjects Enterprise XI 3.0 for Linux Overview Contents This document lists specific platforms and configurations for the BusinessObjects Enterprise XI 3.0 for Linux.
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 19 Query Optimization Introduction Query optimization Conducted by a query optimizer in a DBMS Goal: select best available strategy for executing query Based on information available Most RDBMSs
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationVERITAS Storage Foundation 4.0 TM for Databases
VERITAS Storage Foundation 4.0 TM for Databases Powerful Manageability, High Availability and Superior Performance for Oracle, DB2 and Sybase Databases Enterprises today are experiencing tremendous growth
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationFall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU
Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always
More informationOutline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012
Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 6 April 12, 2012 1 Acknowledgements: The slides are provided by Nikolaus Augsten
More informationNew Magic Quadrant Definitions
Markets, A. Butler Research Note 11 September 2003 Magic Quadrant for Enterprise Servers, 2003 This new Magic Quadrant addresses the changes in server workloads in large organizations. It covers rack-optimized
More information